Generating Fractals with SSE/SSE2

Peter Kankowski

4.94/5 (86 votes)

Nov 29, 2005

CPOL

32 min read

460886

2518

An article on generating Mandelbrot and Julia sets using Intel's Streaming SIMD Extensions (SSE, SSE2).

Download source - 39.2 Kb

Julia Set

Introduction

You probably have heard about fractals before. They are beautiful pictures such as the one shown above. Any fractal can be described using iterative formulas. So you can generate a fractal by evaluating these formulas and finding the color of each pixel. That is a large computational task, and drawing a fractal needs a fast CPU and a carefully optimized program.

Fractals and their properties are studied in non-linear dynamics. Besides the scientists, there are people who like to explore fractals simply because the pictures are graceful and quaint.

I will skip the mathematical background in this article. You can find necessary explanations here:

Julia and Mandelbrot Sets by David E. Joyce.
What Is Mandelbrot Set and other articles on fractals by Fabio Cesari.
Your fractal, sir. - Thank you, James by q123456789.
Julia and Mandelbrot Sets - Theory by Martin Pfingstl.
Fractal Theory by Cygnus Software.

The last three articles also provide some C code for fractal generation. If you are new to fractals, I recommend you read the articles by Fabio Cesari. His explanations are simple and clear, with tables and images that are worth a thousand words.

My Goals and Review of Several Fractal Generators

I had to find or develop educational software for learning fractals and their properties. The software will be used to show students what Mandelbrot and Julia sets are, and will give them some exercises. The students are expected to work with the program only for one lesson, thus it needs to be simple and intuitive. The main features are the following:

generating Julia and Mandelbrot sets using a fast SSE-driven algorithm;
viewing the orbit of the function (i.e., the sequence of iterations) as a graph and as a table;
showing axes and coordinates of any point;
easy zooming (left click to zoom in, right click to zoom out or something similar);
generating the Julia set with the C parameter that corresponds to the point of the Mandelbrot set.

The last point means you can pick the C value for your Julia set from the image of the Mandelbrot set. You simply hold the mouse button, move the cursor over the Mandelbrot set, and the Julia set changes as you do it:

Changing Julia Set

I will call this PnP mode (Picture iN Picture) for short.

And a few words about non-goals, namely the things that I am not going to implement:

fancy colors and palettes (if I did this, the students would get focused on coloring schemes instead of math);
any types of fractals other than Julia and Mandelbrot;
support for 3DNow (SIMD extensions found in old AMD processors; modern Athlon XP uses SSE);
porting to other OSes and processors (x86 and Win9x/XP were just enough for me);
multiprocessor support.

My first try was Chaos Pro. It's a complicated freeware application that can generate Julia and Mandelbrot sets and a lot of other fractals. You can even add your own fractal formulas using the built-in programming language. While the program is able to create cool 3D images and animations, use different palettes, etc., it still lacks some essential features. First, you cannot view the orbit and axes. Second, the Mandelbrot set could be generated in about 1.6 seconds on my Pentium M 1.5 GHz laptop, and the Julia set with C = -0.12+0.74i takes 1.4 seconds to draw. Chaos Pro supports PnP mode, but it's inconvenient to use because the rendering is so slow. Fractals cannot be fully rendered in real time (you can't get the picture changing instantly as you move the cursor). Third (and it was the main reason for me), I often got "Access violation" error or other bugs with Chaos Pro. Certainly, it's okay for the freeware application written by a hobbyist, but I wanted something more reliable.

Then I came across FracInt, an good old DOS program for drawing fractals. It has been developed several years ago, and a 20.0 version of this application is available now. The program is quite sophisticated and powerful, with a lot of intriguing features such as ability to view several dozens of fractal types, play fractal music, use the PnP mode, and create stereoscopic images. Its main disadvantage is the outdated user interface. Zooming takes a lot of time: you drag the mouse to set the desired rectangle size, then move it to the point and, finally, you click twice (left button to zoom in, right button to zoom out). This was unacceptable for me, because an average student will spend half of the lesson comprehending how to use this zooming interface. Video modes up to 1600x1200 are available (1024x768 VESA should work with most of the video cards), but when I ran FracInt in Windows XP and pressed Alt+Tab, my video driver crashed. The output to the sound card didn't work, and I had to use my PC speaker (does anybody still remember this infamous squealer? ;). Needless to say about missing MMX or SSE support in this old application.

After several hours of web-surfing, I discovered Fast Floating Fractal Fun a.k.a. FFFF. It's a very fast Mandelbrot set generator supporting SSE, SSE2, 3DNow and GPU Vertex Programs. If I had known about it before, I would not have started developing my own fractal rendering engine :). Though, FFFF has not much features besides zooming and switching between different calculating engines, so if you decide to use it, you will need to add a Julia set generator and develop a more elaborate GUI.

The Algorithm

Here is the algorithm for generating the Mandelbrot set in pseudocode:

complex P, Z;
for each point P on the complex plane {
   Z = 0;
   for i = 0 to ITER {
      if abs(Z) > 4.0
         set point color to palette[i];
         break the inner loop and go to the next point;
      Z = Z * Z + P;
   }
   set point color to black;
}

P and Z are complex variables. You can read general explanations about complex numbers in the articles mentioned above.

For every pixel P of the image, the loop is repeated ITER times. So it takes O(X*Y*ITER) time to execute, where ITER is the number of iterations, and the image has a dimension of X x Y pixels.

abs(Z) is a function that calculates the absolute value of a complex number. Imagine a circle with radius 2.0 and the center in the origin. If the point Z is outside of this circle, abs(Z) will be more than 4.0. In this case, the inner loop should be terminated, because you know that every point of a Mandelbrot set lies inside the circle. And if Z has gone outside the circle, the point P doesn't belong to the Mandelbrot set. Let's mark this point with a color that shows how fast it escapes the circle when you iterate the Z = Z * Z + P calculation. The number of complete iterations, i, shows this. Thus you take the i-th color from some palette and continue from the next point. The palette is simply an array of colors. For example, the points that go outside the circle after the first iteration are marked with dark green color; the points that leave the circle after the second iteration are marked with light green and so on.

Otherwise, if the loop has been executed for ITER times and the point Z has never gone outside the circle, then the point belongs to the Mandelbrot set. Let's paint it with black color.

The inner loop repeatedly evaluates ZN+1 = ZN² + P, the iterative formula for the Mandelbrot set. Z is initialized to zero, but it's easy to see that you can set Z = P, because Z1 = Z0² + P = 0² + P = P.

Using a palette is the easiest and the fastest way to color in the Mandelbrot set, but the resulting image will have only ITER+1 colors. There are methods that can render 24-bit images. For example, ChaosPro uses this formula:

color_index = (i - log(log(abs(Z))) * 1.442695 + 2) / 252.0;

Coefficients 252.0 and 2.0 scale down the color_index value to fit into a 0..1 range. Then, the color_index is translated into a 24-bit color. Unfortunately, calculating the logarithm is a slow operation that it will add about 30-50% to the overall execution time. That's why I sacrificed the 24-bit color for the sake of speed and chose the limited palette.

To translate the algorithm into C code, you have to remember the formulas for multiplying complex numbers and finding their absolute values (I hope you didn't miss math lessons at college :).

void PaintMandelbrotFPU(int w, int h, long* b, double left, double top,
                        double right, double bottom) {
   int x, y, i; double x2, y2, px, py, zx, zy, dx, dy;
   dx = (right - left) / w; // Find horizontal and vertical increase for each pixel
   dy = (bottom - top) / h;
   py = top;
   for(y = 0; y < h; ++y) {
      px = left;
      for(x = 0; x < w; ++x) {
         zx = px, zy = py;
         for(i = 0; i < ITER; ++i) {
            x2 = zx * zx,   y2 = zy * zy;
            if(x2 + y2 > 4.0)
               break;
            zy = zx * zy * 2 + py; // Calculate z = z * z + p
            zx = x2 - y2 + px;
         }
         *b++ = a[i];
         px += dx; // Move to the next pixel
      }
      py += dy; // Move to the next line of pixels
   }
}

The complex variable p has turned to px (the real part) and py (the imaginary part). The same happens for zx and zy. Squares of zx and zy are used twice: in the absolute value calculation and in finding z * z + p. That's why I made two variables x2 and y2 for them. Variables zx and zy are initialized to px and py instead of zeros, because then you will save time by skipping the 0*0 multiplication.

Note the integer variables x and y that counts the current pixel coordinates, and the pair (px, py) which are coordinates on the complex plane. px and py are increased independently of x and y. This improves performance dramatically, because you don't need to translate screen coordinates (x, y) to complex plane coordinates (px, py) each time the loop is iterated. The price for such a performance improvement is the loss of precision. Floating point numbers have limited precision, and errors are accumulated during multiple additions: px += dx, py += dy. This causes an unpleasant effect: when you zoom in the picture, it will "jerk". An alternate solution could be calculating px = (right - left) * x / w, py = (bottom - top) * y / h at each iteration, but this seems to be impractical, because the division has a long latency.

Finally, a is my palette, and b is a bitmap being generated by the function. The bitmap b is later blitted to the screen with DirectDraw or GDI functions. Details of this process are unimportant for us now. Just remember that the size of b is w * h pixels, and you need to fill it with colors from the palette.

When the inner loop is terminated using a break, i has a value below ITER. And when it's not terminated, i == ITER after the loop. With this property, I did a nasty trick by creating the array a[ITER+1] and making the last element of this array equal to the color black. After the loop, a[i] refers to black in a[ITER]. A more strict way to do this would be:

// ...
for(i = 0; i < ITER; ++i) {
    x2 = zx * zx,   y2 = zy * zy;
    if(x2 + y2 > 4.0) {
       *b++ = a[i];
       px += dx;
       goto NEXTPIXEL;
    }
    zy = zx * zy * 2 + py; // Calculate z = z * z + p
    zx = x2 - y2 + px;
}
*b++ = RGB(0,0,0);
px += dx; // Move to the next pixel
NEXTPIXEL:
// ...

The algorithm for generating a Julia set is very similar to the one for the Mandelbrot set, so I will not show it here. The main difference is the introduction of the complex parameter (PX, PY) in Julia.

MOV-ADD method

I did a little profiling and found that the PaintMandelbrotFPU function takes about 95%-99% of the overall execution time. Painting takes only 5% with GDI or 1% when using DirectDraw. Thus it's much more important to optimize this function than any other part of the program.

One of the best ways to do such optimizations is using SSE (Streaming SIMD Extensions). These instructions allow you to calculate four pixel values at the same time. So, at least in theory, the speed will be improved by a factor of 4 (the real figures are about 3.2-3.3). You can find more information about SSE in the Intel Architecture Software Developer's Manual, volume 1, chapter 10, and the AMD64 Architecture Programmer's Manual, vol. 1, chapter 4. In further explanations, I will assume you know the basics of SSE and x86 assembler programming.

There are a lot of different ways to arrange SSE instructions for this function, and some of these ways are faster than others. Here is a very fast sequence for the Pentium IV processor. I call it the MOV-ADD method:

; xmm0 = zx  xmm1 = zy  xmm2 = tmp  xmm3 = tmp  
;            xmm4 = px  xmm5 = py  xmm6 = result  xmm7 = 4.0
; eax = tmp    ecx = loop counter
   MOVAPS xmm0,xmm4
   XORPS  xmm6,xmm6
   MOVAPS xmm1,xmm5
   mov ecx, ITER
ILOOP:
   ; xmm0 = zx             xmm1 = zy
   MOVAPS xmm2, xmm0
   MULPS  xmm0, xmm0
   MOVAPS xmm3, xmm1
   ADDPS  xmm1, xmm1
   ; xmm0 = zx^2           xmm1 = 2 * zy     xmm2 = zx           xmm3 = zy
   MULPS  xmm1, xmm2
   MOVAPS xmm2, xmm0
   MULPS  xmm3, xmm3
   ; xmm0 = zx^2           xmm1 = 2*zy*zx    xmm2 = zx^2         xmm3 = zy^2
   ADDPS  xmm1, xmm5
   SUBPS  xmm0, xmm3
   ADDPS  xmm2, xmm3
   ; xmm0 = zx^2 - zy^2    xmm1=2*zy*zx+py   xmm2 = zx^2 + zy^2  xmm3 = zy^2
   CMPLEPS xmm2, xmm7
   ADDPS   xmm0, xmm4
   MOVMSKPS eax, xmm2
   test eax, eax
   jz EXIT
   ANDPS   xmm2, xmm7      ; xmm6 += (xmm2 < 4.0) ? 4.0 : 0.0;
   ADDPS   xmm6, xmm2
   sub ecx, 1
   jnz ILOOP
EXIT:

Only the inner loop is shown here. By the start of the loop, xmm0 and xmm1 contain px and py, the initial values for zx and zy. Registers xmm4 and xmm5 store px and py (both of them are loop invariants). The register xmm7 contains the constant 4.0 in all doublewords.

The program starts by saving xmm0 and xmm1 in temporary registers xmm2 and xmm3. Then zx is squared, and zy is doubled by adding xmm1 to itself. Other steps of calculation are shown in the comments above.

The register xmm6 holds the number of complete iterations. Note that the number in xmm6 should be counted independently for each of the 4 pixels. The CMPLEPS instruction puts a bit mask into the register xmm2, and the sequence MovMskPS-test-jz checks whether all 4 pixels have gone outside of the circle with a radius 2.0. If this happens, the loop is terminated. In other cases, the program should increase the number of complete iterations for the pixels that are still inside the circle. The mask is then ANDed with 4.0 to get 4.0 in the doublewords corresponding to such pixels, and 0.0 in other doublewords. This value is added to xmm6. In short, when the loop ends, xmm6 will contain 4.0 * number-of-complete-iterations for each of the 4 pixels. The value will be later used as an offset into our palette.

Now let's calculate the execution time for Pentium IV and find the bottlenecks in this code. The manual by Anger Fog describes how to perform such an analysis; if you are new to optimization for Pentium IV, please read the manual. See also official documents from Intel: Intel Architecture Optimization Reference Manual and Microarchitecture of the Pentium 4 Processor.

Firstly, memory access is not a bottleneck here, because the calculation takes much more time than writing to memory. For the usual resolution 1024x768, the program writes 1024*768*4 bytes = 3 Mb. Typical throughput for modern DDR memory may be about 1-2 Gb/s. A rough estimation shows that the memory access will take 1.5-3 milliseconds, while the real execution time for this code is 70-80 ms. An analysis of the code gives the same result: memory is accessed only twice for 1 pixel (see the code in the next chapter), but the inner loop is executed ITER times (typically 64 times or so), and it includes SSE instructions with long latencies.

That's why different tricks for optimizing cache performance are of no use for this code. The same for instruction decoding: the loops are so small that they fit into the trace cache entirely, and they are decoded only once. The real bottleneck is the long latencies and limited throughput of SSE instructions.

Mov-Add Function Analysis

For simplicity sake, let's presume that the pipeline is initially empty, and the whole code is in the trace cache. Register names xmm1, xmm2, etc., are shortcuts to x1, x2 to fit into the picture.

The first three instructions (MOVAPS x2, x0, MULPS x0, x0, and MOVAPS x3, x1) are sent from the trace cache into the op queue. The FP/SSE execution cluster has two ports: one for add, mul, div operations, called fp unit in Agner Fog's manual, and one for moves between registers and stores (the mov execution unit). These two units can start executing the MOVAPS and MULPS instructions at the same time.

At the second clock cycle, the next three instructions go to the queue. From them, MOVAPS x3, x1 and ADDPS x1, x1 can execute in parallel. The reciprocal throughput for the MOVAPS instruction is 1, thus the second instruction can begin at this cycle.

The MOVAPS instruction has a latency of 6 clock cycles, and MULPS x1, x2 depends on the results of MOVAPS x2, x0 (dependency is shown with arrows on the chart). So MULPS can start at the sixth cycle (by this time, the queue should be full of ops waiting to execute).

The next instruction, MOVAPS x2, x0, depends upon MULPS x0, x0. The MULPS instruction also has an additional latency of 1, meaning it needs to wait for 1 extra clock cycle if the next op goes to a different execution unit. MULPS and MOVAPS are from different execution units fp and mov, and I have to insert a delay of one cycle between them.

Now, the MULPS x3, x3 instruction. The multiplication subunit (FP multiplier in terms of Intel) can execute one multiply every two cycles, thus the second MULPS instruction has to wait for one cycle. It will be started at the eighth cycle.

The next five instructions go to the FP addition subunit (FP adder). All of them have a latency of 4 clock cycles. ADDPS x1, x5 can begin at the twelfth cycle, when the results of the MUL x1, x2 multiplication will be ready. The same for SUBPS x0, x3; it starts at the fourteenth cycle. The ADDPS x2, x3 instruction could begin earlier than at the sixteenth cycle, but the FP adder throughput limits the number of simultaneously running instructions to 2. The sequence SUBPS x0, x3; ADDPS x2, x3; ADDPS x0, x4; CMPPS x2, x7 perfectly matches this requirement of the FP adder. The two neighboring instructions are from different dependency chains, thus they are fully independent and can execute in parallel.

Our purpose is to complete the x0 = x0 ^ 2 - x1 ^ 2 + x4, x1 = 2 * x0 * x1 + x5 calculation as soon as possible. On the chart, the instructions that compute the new values for x0 and x1 are marked with different shades of gray, and non-relevant instructions are white. You need to schedule the "gray" instructions earlier than others because the next iteration of the loop can't start without the new values for x0 and x1. The ANDPS and MOVMSKPS instructions from the previous iteration can execute in parallel with the next iteration, provided that the jump will be predicted correctly. So you can schedule them later than gray instructions.

Time measurements I made confirmed these considerations. There is now a long delay between MOVAPS x2, x0 and ADDPS x2, x3 (see the chart). But if you try to put ADDPS x2, x3 earlier than SUBPS x0, x3, the overall execution time will increase. This means the delay doesn't matter as long as x0 and x1 are calculated early.

You see that the speed of the code depends on both the instruction latencies and the throughput, because there are three quite independent chains, which can execute at the same time. The critical path is marked with dark gray color. MOVAPS adds an unnecessary delay (may be, if x86 architecture had three-address instructions like RISCs, the code would have execute faster, who knows). The subunits are not utilized fully. At the first stage of the loop (cycles 0-12), a great load is put on the MOV unit and the FP multiplier; later, in cycles 14-22, the FP adder is overloaded, and the FP multiplier is not used at all. You could take instructions from two iterations and shuffle them in order to get more equable load, but you need more SSE registers for this. With AMD64 (or Intel EM64T), 8 additional registers are available, so the code can be further optimized in 64-bit mode.

After drawing the chart, I ordered the instructions by their start time. For example, MOVAPS x2, x0 and MULPS x0, x0 will be the earliest started instructions, so I put them first in the code. Then MOVAPS x3, x1 and ADDPS x1, x1 begin, and they are the next instructions in the source code. Ordering the instructions reduces the number of ops waiting in the queue. Now there is no need for your processor to reorder the instructions at run-time: they are already ordered in accordance with their latencies, throughput, and dependencies.

But what about branch prediction in this code? Can it be a bottleneck? In fact, the penalties for mispredicted branches take a lot of execution time for this program. But the number of complete iterations is the result we want to obtain. You can't perfectly predict how many times the loop will be executed, because this value is itself the result of the program. If you know exactly how many times the loop is repeated for each pixel, then why are you running this program? :)

I've randomly selected two close-up pictures of the sets, then calculated the distribution of the iteration's number for them and for one large-scale image:

Bar chart showing distributions

Mandelbrot and Julia Set Pictures

The blue bars denote the large-scale image of the Mandelbrot set; the red and yellow bars show two close-up pictures. ITER is equal to 64 in my program, so the chart shows the frequencies of the points from 0 to 5 iterations, from 6 to 11 iterations, and so on to 60 iterations and more. Obviously, the minimal and maximum numbers of iterations constitute the largest parts of the picture. For the blue bar, the minimal value is 0-5 iterations (59% of all pixels); for the red bar, it's 12-17 (43%); and for the yellow bar, it's 30-35 (8%) and 36-41 (15%). The minimal number of iterations corresponds to the vast background area filled with one color, and the maximum number of iterations means the black points that belong to the set. Thus, 26% of the pixels belong to the Mandelbrot set on the large picture, and 21% and 62% of the points belong to the set on the two close-ups.

From this analysis, you can see that if the loop is terminated, there is a high probability that it will be terminated early, at the minimal number of iterations. But if the loop is not terminated early, it usually will not be terminated at all. This statistics is interesting, but I couldn't make use of it in the code. May be, you, the reader, can do it. Any suggestions will be appreciated.

Let's make some conclusions about this MOV-ADD method:

The general strategy is to calculate the values that will be needed in the next iteration (in our case, x0 and x1) as early as possible. These calculations constitute a critical path of the program, and the overall execution time depends on their finish time.
By scheduling independent instructions to execute simultaneously, I tried to improve the instruction-level parallelism (ILP). For example, the first four instructions execute addition, multiplication, and two moves in parallel.
The instructions are ordered by their start time, which helps to reduce the length of the op queue.

FFFF method

FFFF v3.2.2, the Mandelbrot set generator mentioned above, uses another method. FFFF is open-source, so it is fully legal to publish its code here:

; xmm0 = zx  xmm1 = zy  xmm2 = tmp  xmm3 = tmp  xmm4 = px  
;            xmm5 = py  xmm6 = result  xmm7 = 4.0
; eax = tmp    ecx = loop counter
   MOVAPS xmm0,xmm4
   XORPS  xmm6,xmm6
   MOVAPS xmm1,xmm5
   mov ecx, ITER
ILOOP:
   ; xmm0=zx      xmm1=zy
   MOVAPS xmm2, xmm0
   MULPS  xmm2, xmm1
   MULPS  xmm0, xmm0
   MULPS  xmm1, xmm1
   ; xmm0=zx^2    xmm1=zy^2  xmm2=zx*zy 
   addps xmm2, xmm2
   movaps xmm3, xmm0
   addps xmm3, xmm1
   ; xmm0=zx^2    xmm1=zy^2  xmm2=2*zx*zy  xmm3=zx^2+zy^2
   cmpltps xmm3, xmm7
   movmskps eax, xmm3
   test eax, eax
   jz EXIT

   subps xmm0, xmm1
   movaps xmm1, xmm2
   ; xmm0=zx^2-zy^2  xmm1=2*zx*zy   xmm3=zx^2+zy^2
   ANDPS  xmm3, xmm7 ; xmm6 += (xmm3<radius) ? 4.0 : 0.0
   ADDPS  xmm6, xmm3
   ADDPS  xmm0, xmm4
   ADDPS  xmm1, xmm5
   dec ecx
   jnz ILOOP
EXIT:

FFFF Function Analysis

FFFF tries to compare x0^2 + x1^2 with 4.0, and only if the comparison fails, it finds new values for x0 and x1. If the comparison returns true, the loop is terminated, and there is no need to calculate x0 and x1. Although this may seem a good strategy, my experiments show it is not. Look at the long chain: MOVAPS xmm3, xmm0; ADDPS xmm3, xmm1; CMPLTPS xmm3, xmm7; MOVMSKPS eax, xmm3. When this chain runs, there are a lot of possibilities to utilize free execution units and to run the second chain in parallel. But FFFF authors didn't insert any instruction between the instructions of this chain. The scheduler has to reorder instructions itself. Sometimes, it fails, and the execution units are left unused. For example, there is so large a distance between MULPS xmm1, xmm1 and SUBPS xmm0, xmm1 in the code that the second instruction may be not sent to the queue in time. In this case, an additional delay should be added between those instructions. The same for ADDPS xmm2, xmm2 and MOVAPS xmm1, xmm2: there is a small chance that they will be executed without a delay.

The chart shows the ideal case without any delays. Even then x0 takes 20 clock cycles to calculate, and x1 takes 27. For my MOV-ADD method, the figures are 16 and 22 clock cycles, respectively. The critical path (marked with dark gray color) includes two MOVAPS instructions having long latencies, while MOV-ADD method uses only one MOVAPS on its critical path.

There are additional latencies before MOVAPS x1, x2 and MOVAPS x3, x0, because the previous instruction was executed in a different unit. The ADDPS x3, x1 is put off for one cycle due to the limited throughput of the FP adder.

In any other way, MOV-ADD is similar to FFFF. Both methods use SSE to calculate 4 pixel values together, both have the same limitation of precision and the same jerking effect when zooming. I used a similar algorithm, but arranged instructions differently to improve performance. Other distinctions between FFFF and my program will be shown later.

In the FFFF code above, the register names are changed to the names used in MOV-ADD (xmm0 for zx, xmm1 for zy, and so on). This change doesn't affect performance, but makes the methods' code easier to compare.

Vodnaya method

I also developed Vodnaya, a method optimized for Pentium M. The name is fully meaningless, so if it's hard for you to pronounce Russian words, you can just call it Vod-method or V-method ;). It outperforms both MOV-ADD and FFFF methods on Pentium M and Pentium III. Only a little is known about the Pentium M microarchitecture, so I was unable to do an analysis of latencies and throughput. After much trial and error, I found a sequence of instructions that seemed to be the fastest on these processors. The method is also a champion in the number of instructions: it is one instruction smaller than MOV-ADD or FFFF.

   MOVAPS xmm0,xmm4
   XORPS  xmm6,xmm6
   MOVAPS xmm1,xmm5
   mov ecx, ITER
ILOOP:
   ; xmm0 = zx             xmm1 = zy
   MOVAPS xmm2, xmm1
   MULPS  xmm2, xmm2
   MULPS  xmm1, xmm0
   ; xmm0 = zx             xmm1 = zy*zx           xmm2 = zy^2
   MOVAPS xmm3, xmm2 ; save zy^2
   MULPS  xmm0, xmm0
   ADDPS  xmm2, xmm0
   ; xmm0 = zx^2           xmm1 = zy*zx           xmm2 = zx^2+zy^2     xmm3 = zy^2
   CMPLEPS  xmm2, xmm7
   ADDPS  xmm1, xmm1
   SUBPS  xmm0, xmm3
   ; xmm0 = zx^2-zy^2      xmm1 = zy*zx*2         xmm2 = zx^2+zy^2
   ADDPS  xmm1, xmm5
   ADDPS  xmm0, xmm4
   MOVMSKPS eax, xmm2
   test eax, eax
   jz EXIT
   ANDPS  xmm2, xmm7
   ADDPS  xmm6, xmm2
   sub ecx, 1
   jnz ILOOP
EXIT:

The zx^2+zy^2 sum is calculated in the temporary register xmm2. zy^2 is copied here to save it from being overwritten when the program will subtract zx^2-zy^2. New values for zx and zy are computed in place, in the xmm0 and xmm1 registers. I guess the whole method is fairly graceful. Unfortunately, it's not the fastest one for Pentium-IV and Athlon XP processors.

The Great Battle: MOV-ADD vs. FFFF vs. Vodnaya

And now, the speed comparison of these three methods:

Bar Chart with Method Comparison

In the set of tests, the dimension of the image was 1024x768 pixels; the time was measured with the RDTSC instruction (see the source code in the SSEroutines_test.asm file). I repeated the tests five times and took the minimal result for each method. Times in the chart are expressed in millions of clock cycles. The lower times are better. After the names of processors, their CPUID signatures are shown. For example, Celeron F13 is a family 15, model 1, stepping 3 processor, namely a trimmed down Pentium IV. Its caches are smaller than those of Pentium IV, but the core is the same, that's why the results of this Celeron should be almost equal to the results of Pentium IV (sorry, I was unable to get Pentium IV for the tests). "Rabbit" means Julia set with C = -0.12+0.74i.

Evidently, MOV-ADD is the best method for Pentium IV and Athlon XP processors. Vodnaya shows good results on Family 6 processors from Intel (in particular, Pentium III and Pentium M). FFFF is definitely a loser on every type of processor. The speed improvement of MOV-ADD in comparison to FFFF is about 12% on Pentium IV, and the difference between Vodnaya and FFFF is 7-11% for Pentium M and Pentium III.

The readymade program available uses Vodnaya on family 6 processors from Intel and MOV-ADD method on other processors. The method is selected at run-time using the processor signature retrieved with the CPUID instruction.

CvtTss2si

After finishing the calculation, the program should translate the results into color values. The single line of code (*b++ = a[i];) can make the translation in C, but in assembler language, you have to do more work.

The program initializes the xmm6 to zero and adds 4.0 to it at each iteration (see the code of any method above). If the inner loop is not terminated, it is repeated ITER times. When the loop is over, the number of complete iterations is multiplied by 4 in the xmm6 register provided that the point doesn't belong to the Mandelbrot set, and there is the ITER * 4 value there otherwise. The same trick with the last element of array can be done: add one element to the array and make its value equal to black. By doing so, you can use one coloring code to handle points that are in the Mandelbrot set and the points that are not. FFFF v3.2.2 handles these types of points differently using a conditional move (it's a slower way, and the CMOV instruction is not supported on some old AMD processors).

Note that the xmm6 register holds the number of complete iterations multiplied by 4. FFFF divides it to get the index in the palette. For division, it uses the shr instruction, which is slow on Pentium IV processors with the model number lower than 3. But one element of the palette takes 4 bytes in memory. Thus FFFF divides the number by 4, and then multiplies it by 4 to get the memory offset. What a terrible waste of time! :) I do this in a much simpler way:

CVTTSS2SI eax, xmm6
; xmm6 contains an offset into palette, i.e. 4*number-of-iterations

mov eax, [esi + eax]  ; palette is started at [esi]
mov [edi], eax        ; the current pixel of the bitmap is at [edi]

The CvtTss2si instruction converts, with truncation, a single-precision floating-point value to an integer value, in the eax register. The number of iterations is, obviously, an integer number. It will be converted without rounding, that's why the rounding mode is unimportant here. The next instruction extracts the color from palette, and the last one writes the color into the bitmap. The bitmap may be located in the video memory or in the RAM (don't care about this for now).

There are other instructions for converting floating-point numbers (namely, CvtPs2Pi and CvtPs2Dq), but none of them write the result to general-purpose registers. You have to store the number in memory and immediately read it from the same location, which can cause additional delays on a Pentium IV. I tested the code with CvtPs2Dq and found that it is not faster than the code using CvtTss2si.

The Power of FASM Macros

Writing several variants of the same code for the Mandelbrot set and for the Julia set, for black and white image and for color image, for SSE and for SSE2, requires a lot of routine work. Thankfully, most assemblers have some special tools that can do this work for you. The tools are macroinstructions and conditional assembly. Macros let you repeat the sequence of instructions with different parameters, and conditional assembly allows inserting instructions depending on these parameters.

I chose FASM 1.62 from several assemblers, because it is tiny, fast, and has a rich macro syntax. The FASM manual describes the syntax in detail. Here is a small piece from the Vodnaya method to illustrate the use of macros in my program:

macro JuliaMandelPaint color, type {
; ...
   ADDPS  xmm1, xmm1
   SUBPS  xmm0, xmm3
if type = Julia
   ADDPS  xmm1, dqword[cy1]
   ADDPS  xmm0, dqword[cx1]
else if type = Mandel
   ADDPS  xmm1, xmm5
   ADDPS  xmm0, xmm4
end if
; ...
}

; later in the code:
   mov eax, [fractaltype]
   test eax, eax
   jz MANDELBROT
JULIA:
   JuliaMandelPaint 1, Julia
   ret
MANDELBROT:
   JuliaMandelPaint 1, Mandel
   ret

If the argument type is equal to Julia, FASM will generate the code that adds cx1 to xmm0 and cy1 to xmm1, else it will use the values from the registers xmm4-xmm5. Both variants of the code will be generated: the first at the label JULIA, and the second at MANDELBROT. You just check the variable fractaltype and decide which variant to use.

You could say that there is an easier way to do this:

   mov eax, [fractaltype]
   test eax, eax
   jz MANDELBROT
   ADDPS  xmm1, dqword[cy1]
   ADDPS  xmm0, dqword[cx1]
   jmp NEXT
MANDELBROT:
   ADDPS  xmm1, xmm5
   ADDPS  xmm0, xmm4
NEXT:

Why not use this code? Because the check is located in deeply nested loops, and the comparison will be repeated several million times, while it needs to be done only once. Needless to say about delays that can occur if the conditional jump will be mispredicted... So it's advantageous to move the comparison out of the loop. Macros help to write this shortly.

The next interesting task is adapting the methods for SSE2. SSE2 instructions operate on two double-precision values in parallel. In other aspects, they have no difference from SSE. Even the latencies and throughputs for SSE2 are the same as for SSE.

So you have to do a massive search and replace to change ADDPS to ADDPD, MULPS to MULPD, MOVAPS to MOVAPD, etc. Then you adjust pointer increments and get two functions: one for SSE and one for SSE2. This approach is used in FFFF and several other programs. There is one serious inconvenience with it. When you want to update the program, you have to change the same code in both functions. It's tedious and frustrating. Sometimes you correct one function, but forget to change another.

The idea is to create a macro called MOVAP that will be replaced with MOVAPS for SSE and MOVAPD for SSE2. Now, instead of two functions you write only one and use MOVAP, MULP, and ADDP macros in it. During the compilation, the real SSE or SSE2 instructions are substituted for these macros:

macro MOVAP a, b {
   if ssewidth = 4
      MOVAPS a, b
   else
      MOVAPD a, b
   end if
}
macro ADDP a, b {
   if ssewidth = 4
      ADDPS a, b
   else
      ADDPD a, b
   end if
}
macro MULP a, b {
   if ssewidth = 4
      MULPS a, b
   else
      MULPD a, b
   end if
}
; another 2 pages of such clumsy code
; for every SSE/SSE2 instruction

Don't hurry to write this code! A much more elegant solution exists:

macro defineinstr instr, [args] {
   common
   if ssewidth = 4
      instr#S args
   else
      instr#D args
   end if
}

macro MOVAP a, b   {defineinstr MOVAP, a, b}
macro ADDP a, b    {defineinstr ADDP, a, b}
macro MULP a, b    {defineinstr MULP, a, b}

The macro defineinstr takes a variable number of arguments. For example, it can be used for SHUFPS with three arguments and for ADDPS with two arguments. The macro issues the instruction with a S suffix if ssewidth = 4, or the same instruction with a D suffix otherwise. The # operator concatenates names; common means the instruction will be repeated only once.

You can make this code even shorter:

macro defineinstr instr {
   macro instr [args] \{
   \common
      if ssewidth = 4
    instr#S args
      else
    instr#D args
      end if
   \}
}
defineinstr MOVAP
defineinstr ADDP
defineinstr MULP

There are two nested macros here. The outer macro, defineinstr, defines the inner one, instr. Note that instr is also the argument of defineinstr, so when you write defineinstr MOVAP, a macro named MOVAP will be defined. This is exactly what I want. Curly brackets and the common directive need to be escaped with backslashes because of some strange parsing problems in the internals of FASM. Now the definition of a new SSE/SSE2 instruction is extremely short and easy: just write defineinstr and the name of the instruction. The number of arguments doesn't matter.

I hope you are convinced now that FASM macros are very powerful and graceful. If not, just try to make a C preprocessor macro with variable number of arguments and compare it with the FASM equivalent!

Blitting Bitmaps with GDI and DirectDraw

The SetPixel function seems to be the slowest possible way to output the pixels to the screen:

case WM_PAINT: {
   PAINTSTRUCT ps;
   HDC hdc = BeginPaint(hWnd, &ps);
   for(x = 0; x < w; ++x)
      for(y = 0; y < h; ++y)
          SetPixel(hdc, x,y, b[i]);
   EndPaint(hWnd, &ps);
}

Use this function only to set 1 or 2 pixels; it's too slow for large images. Here is a simplified version of my program which is much faster:

int w, h; HBITMAP hBMP = 0, holdBMP; HDC hmem;
BITMAPINFOHEADER bi = {sizeof(BITMAPINFOHEADER), 0, 0, 1, 32, BI_RGB};
void ReDraw() {
   PaintJuliaSSE(b, ceil(w, 4), h);
   SetDIBits(hmem, hBMP, 0, h, b, 
            (BITMAPINFO*)&bi, DIB_RGB_COLORS);
            // Copy the bitmap to hmem
}
LRESULT CALLBACK WndProc(HWND hWnd,UINT msg,WPARAM wParam,LPARAM lParam) {
   switch(msg) {
      case WM_SIZE: {
         HDC hdc = GetDC(hWnd);
         if(hBMP) { DeleteObject(SelectObject(hmem, holdBMP)); DeleteDC(hmem); }
         w = LOWORD(lParam), bi.biWidth = ceil(w, 4),
         h = HIWORD(lParam), bi.biHeight = h;
         bi.biSizeImage = bi.biWidth * h * 4;
         b = (long*)realloc(b, bi.biSizeImage);
         hBMP = CreateCompatibleBitmap(hdc, ceil(w, 4), h);
         hmem = CreateCompatibleDC(hdc);
         holdBMP = (HBITMAP)SelectObject(hmem, hBMP);
         ReleaseDC(hWnd, hdc);
         ReDraw();
         return 0;  }
      case WM_PAINT: {
         PAINTSTRUCT ps; HDC hdc = BeginPaint(hWnd, &ps);
         BitBlt(hdc, 0, 0, w, h, hmem, 0, 0, SRCCOPY);  // Copy hmem to the screen
         DrawPolyline(hdc);
         EndPaint(hWnd, &ps);
         return 0;   }
      case WM_DESTROY:
         DeleteObject(SelectObject(hmem, holdBMP)); DeleteDC(hmem);
         PostQuitMessage(0); return 0;
      default:
         return DefWindowProc(hWnd,msg,wParam,lParam);
   }
}

You call the ReDraw function every time the image is fully updated, e.g., zoomed in or out. It uses PaintJuliaSSE to generate the bitmap of the fractal in b. Then the SetDIBits function is called. It copies the whole b array to the hmem memory DC.

In response to WM_SIZE, the program sets a new size of the bitmap and recreates it. Also, the memory for the bitmap should be reallocated, and the fractal should be fully redrawn. The handler for WM_PAINT just blits (copies) the memory DC to the screen and adds a polyline that shows the orbit of the function.

The image of the fractal is stored in the memory DC, which exists during the whole lifetime of the program. Having this image, you can quickly draw the moving polyline by blitting the stored image and drawing the new position of the polyline above. This will cause a little flicker of the polyline, but it's tolerable. Adding the second memory DC and the second bitmap could remove the flicker completely, but then the program will be slow and memory-intensive.

SSE instructions operate on 4 values in parallel, and all SSE methods shown above require the number of pixels in the line to be divisible by 4. That's why the width (w) is rounded up with the ceil(w) function. The bitmap becomes a little larger than needed, and in the WM_PAINT handler, you blit only the w * h rectangle from it.

Besides GDI, the program also supports DirectDraw. I won't show the code here, because it's very long and boring (error handling takes a lot of space). In essence, it is very close to the GDI code. An off-screen surface is created; the fractal is rendered into it. On WM_PAINT, you blit the off-screen surface to the primary surface (to the screen) and draw the polyline. Modern video cards store both surfaces in the video memory, thus blitting with DirectDraw can be 50 times faster than GDI blitting (I get this result on my laptop with a simple on-board video chip, Intel 915 GM). DirectDraw improves the overall speed by 5-6% in comparison with GDI.

DirectDraw is supported for Windows 98 and higher. For the fans of Windows 95, I wrote a few lines of code so the program will gracefully degrade to GDI if DirectDraw is not available :).

Conclusion

My program is a very fast Mandelbrot and Julia set generator using SSE/SSE2 and DirectDraw. I will not claim it's "the fastest fractal generator for Win32", as the author of FFFF did, but you see it's pretty fast and easy-to-use. Some points still need more work, for example:

there are faster algorithms which "guess" the values of the neighboring pixels;
you can further optimize MOV-ADD and Vodnaya methods to get the maximum performance on 64-bit processors;
may be, some tricky approach to optimizing branch prediction exists, but I didn't find it;
the program can't save the current position and the C parameter to let you return to them in future (the next version will do this);
changing the palette is not supported, but it was not my goal.

So the code is good, but not perfect. Please add your comments and suggestions on how to improve it.

You can use the techniques presented in this article in many different graphic applications. The analysis of latencies and throughputs helps you to find bottlenecks. Then you can exclude long-latency operations from the critical path, make several independent chains in your code, and increase instruction-level parallelism by reordering the instructions. Finally, the assembler macros free you from such dull work as rewriting the same code for SSE2.

Credits and Thanks

Thanks to Daniele Paccaloni, the author of FFFF. Viewing the source of his program helped me to find suboptimal solutions in my own code and, generally, to make my program better. Thank you, Daniele!

Thanks to my teacher, Tatyana Valenteenovna Korablina, for providing useful information about fractals and non-linear dynamics. I also feel grateful to people who helped me with testing: Anton Orlov, Larisa Alekseevna Morozova, and Victor "vito". Special thanks to Stas, who stayed at work on that Friday evening and let me use a lot of computers for tests.