Disease healthy network
Current Position : HOME > Syphilis > Mandelbrot and 16-bit fixed point multiplies ( Part II )

Mandelbrot and 16-bit fixed point multiplies ( Part II )

POST:2008-10-29 09:42:48  
The original SSE2 implementation of the Mandelbrot leveraged the PMADDWD instruction to do the workload inside of the inner loop. Unfortunately, in order to get this to work, lots of code had to be inserted to shuffle data around, and this instruction leaves the packed results in a 32-bit format. This requires PACKSSDW instructions to get the data back to 16-bits before it can be used for further computation. This adds significant overhead to the inner loop calculation. The advantage that PMULHRSW provides are that there are no dependencies on how the data is ordered within the register and it produces its results in a packed 16-bit format. After gaining an understanding of the code differences between the SSE2 and SSSE3 implementations, I believed that it was possible to gain this same advantage using SSE2, but I needed to leverage PMULHW for PMULHRSW. PMULHW writes into the upper 16 bits of the destination 32 bit temporary result, as illustrated in Figure 3 below.
Figure 3: Bit selection of PMULHW
Using 4.12 fixed point arithmetic, PMULHW produces an 8.8 fixed point number as a result, as illustrated in Figure 4.
Figure 4: 4.12 PMULHW multiply; W=Whole, F=Fractional bit
This leaves the least significant bit of the result to represent 2-8. Unfortunately, after modifying the SSE2 code to use this technique, the fidelity of the Mandelbrot picture started to degrade. Inside the edges of the Mandelbrot pattern, strings of long black pixels began to show, and the rest of the intricate pattern began to look noisy and dirty, with random pixel popping. It became evident that the precision of the Mandelbrot needed to retain the 2-9 bit. I needed to add more fractional bits (precision) to the upper 16 bits of the multiply to get this technique to work.
The only way to add more fractional bits is to take away whole bits. Stepping back a bit, I took a hard look at the data. In this particular Mandelbrot benchmark, the left and right edges of the window are represented by -2.25 and .75 respectively, and -1.25 and 1.25 for the bottom and top edges respectively. If I took away a whole bit from the fixed point data, changing the 4.12 input data to 3.13, I still have enough range to represent the default Mandelbrot zoom. For signed data, 3 whole bits can represent a range from -4 to +3. If you factor in the value of fractional bits, the upper range is actually very close to positive 4. Figure 5 below illustrates how PMULHW treats 3.13 source data.
Figure 5: 3.13 PMULHW multiply; W=Whole, F=Fractional bit
As can be seen, the precision of PMULHW now goes to 2-10, because a whole bit was reduced from each 16 bit source, so the multiplied result loses two whole bits. This reduces the range of the signed result to 6 bits (-32 to +31), but this goes beyond our modest needs. In addition, with 10 fractional bits, this exceeds the precision that PMULHRSW gave with 4.12 fixed point data. A testing pass verified that the fidelity of the rendered fractal with this new data format and algorithm was indeed tight and well formed. Overall, with the reduction of pack and data swizzling instructions, this resulted in about a 2.7x speedup over the original SSE2 implementation that used PMADDWD. Through my own internal measurements, this is actually slightly faster than the PMULHRSW optimized versions as well.
I think the biggest point that I want to get across with my writing is, "Think about your data". How can you optimize its use? Not only in terms of how much you have, but as illustrated in this fixed point example, the data format as well. While it's true that the PMULHW instruction enables us to do a fast 16-bit fixed point multiply, we had to change the format of the data to make use of the optimization possible. If you have control over your data (in this benchmark I had, but this is not always the case), the time spent optimizing your data up front can pay back huge dividends later on with simpler/fewer, faster code.
Kent Knox
Member of AMD Technical Staff
Kent Knox is a Member of Technical Staff in Solutions Enablement Engineering at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

0 Vote

Popular

Recent

Radom commend