|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionYou probably have heard about fractals before. They are beautiful pictures such as the one shown above. Any fractal can be described using iterative formulas. So you can generate a fractal by evaluating these formulas and finding the color of each pixel. That is a large computational task, and drawing a fractal needs a fast CPU and a carefully optimized program. Fractals and their properties are studied in non-linear dynamics. Besides the scientists, there are people who like to explore fractals simply because the pictures are graceful and quaint. I will skip the mathematical background in this article. You can find necessary explanations here:
The last three articles also provide some C code for fractal generation. If you are new to fractals, I recommend you read the articles by Fabio Cesari. His explanations are simple and clear, with tables and images that are worth a thousand words. My Goals and Review of Several Fractal GeneratorsI had to find or develop educational software for learning fractals and their properties. The software will be used to show students what Mandelbrot and Julia sets are, and will give them some exercises. The students are expected to work with the program only for one lesson, thus it needs to be simple and intuitive. The main features are the following:
The last point means you can pick the C value for your Julia set from the image of the Mandelbrot set. You simply hold the mouse button, move the cursor over the Mandelbrot set, and the Julia set changes as you do it:
I will call this PnP mode (Picture iN Picture) for short. And a few words about non-goals, namely the things that I am not going to implement:
My first try was Chaos Pro. It's a complicated freeware application that can generate Julia and Mandelbrot sets and a lot of other fractals. You can even add your own fractal formulas using the built-in programming language. While the program is able to create cool 3D images and animations, use different palettes, etc., it still lacks some essential features. First, you cannot view the orbit and axes. Second, the Mandelbrot set could be generated in about 1.6 seconds on my Pentium M 1.5 GHz laptop, and the Julia set with C = -0.12+0.74i takes 1.4 seconds to draw. Chaos Pro supports PnP mode, but it's inconvenient to use because the rendering is so slow. Fractals cannot be fully rendered in real time (you can't get the picture changing instantly as you move the cursor). Third (and it was the main reason for me), I often got "Access violation" error or other bugs with Chaos Pro. Certainly, it's okay for the freeware application written by a hobbyist, but I wanted something more reliable. Then I came across FracInt, an good old DOS program for drawing fractals. It has been developed several years ago, and a 20.0 version of this application is available now. The program is quite sophisticated and powerful, with a lot of intriguing features such as ability to view several dozens of fractal types, play fractal music, use the PnP mode, and create stereoscopic images. Its main disadvantage is the outdated user interface. Zooming takes a lot of time: you drag the mouse to set the desired rectangle size, then move it to the point and, finally, you click twice (left button to zoom in, right button to zoom out). This was unacceptable for me, because an average student will spend half of the lesson comprehending how to use this zooming interface. Video modes up to 1600x1200 are available (1024x768 VESA should work with most of the video cards), but when I ran FracInt in Windows XP and pressed Alt+Tab, my video driver crashed. The output to the sound card didn't work, and I had to use my PC speaker (does anybody still remember this infamous squealer? ;). Needless to say about missing MMX or SSE support in this old application. After several hours of web-surfing, I discovered Fast Floating Fractal Fun a.k.a. FFFF. It's a very fast Mandelbrot set generator supporting SSE, SSE2, 3DNow and GPU Vertex Programs. If I had known about it before, I would not have started developing my own fractal rendering engine :). Though, FFFF has not much features besides zooming and switching between different calculating engines, so if you decide to use it, you will need to add a Julia set generator and develop a more elaborate GUI. The AlgorithmHere is the algorithm for generating the Mandelbrot set in pseudocode: complex P, Z;
for each point P on the complex plane {
Z = 0;
for i = 0 to ITER {
if abs(Z) > 4.0
set point color to palette[i];
break the inner loop and go to the next point;
Z = Z * Z + P;
}
set point color to black;
}
For every pixel
Otherwise, if the loop has been executed for The inner loop repeatedly evaluates Using a palette is the easiest and the fastest way to color in the Mandelbrot set, but the resulting image will have only color_index = (i - log(log(abs(Z))) * 1.442695 + 2) / 252.0;
Coefficients 252.0 and 2.0 scale down the To translate the algorithm into C code, you have to remember the formulas for multiplying complex numbers and finding their absolute values (I hope you didn't miss math lessons at college :). void PaintMandelbrotFPU(int w, int h, long* b, double left, double top,
double right, double bottom) {
int x, y, i; double x2, y2, px, py, zx, zy, dx, dy;
dx = (right - left) / w; // Find horizontal and vertical increase for each pixel
dy = (bottom - top) / h;
py = top;
for(y = 0; y < h; ++y) {
px = left;
for(x = 0; x < w; ++x) {
zx = px, zy = py;
for(i = 0; i < ITER; ++i) {
x2 = zx * zx, y2 = zy * zy;
if(x2 + y2 > 4.0)
break;
zy = zx * zy * 2 + py; // Calculate z = z * z + p
zx = x2 - y2 + px;
}
*b++ = a[i];
px += dx; // Move to the next pixel
}
py += dy; // Move to the next line of pixels
}
}
The complex variable Note the integer variables Finally, When the inner loop is terminated using a // ...
for(i = 0; i < ITER; ++i) {
x2 = zx * zx, y2 = zy * zy;
if(x2 + y2 > 4.0) {
*b++ = a[i];
px += dx;
goto NEXTPIXEL;
}
zy = zx * zy * 2 + py; // Calculate z = z * z + p
zx = x2 - y2 + px;
}
*b++ = RGB(0,0,0);
px += dx; // Move to the next pixel
NEXTPIXEL:
// ...
The algorithm for generating a Julia set is very similar to the one for the Mandelbrot set, so I will not show it here. The main difference is the introduction of the complex parameter MOV-ADD methodI did a little profiling and found that the One of the best ways to do such optimizations is using SSE (Streaming SIMD Extensions). These instructions allow you to calculate four pixel values at the same time. So, at least in theory, the speed will be improved by a factor of 4 (the real figures are about 3.2-3.3). You can find more information about SSE in the Intel Architecture Software Developer's Manual, volume 1, chapter 10, and the AMD64 Architecture Programmer's Manual, vol. 1, chapter 4. In further explanations, I will assume you know the basics of SSE and x86 assembler programming. There are a lot of different ways to arrange SSE instructions for this function, and some of these ways are faster than others. Here is a very fast sequence for the Pentium IV processor. I call it the MOV-ADD method: ; xmm0 = zx xmm1 = zy xmm2 = tmp xmm3 = tmp
; xmm4 = px xmm5 = py xmm6 = result xmm7 = 4.0
; eax = tmp ecx = loop counter
MOVAPS xmm0,xmm4
XORPS xmm6,xmm6
MOVAPS xmm1,xmm5
mov ecx, ITER
ILOOP:
; xmm0 = zx xmm1 = zy
MOVAPS xmm2, xmm0
MULPS xmm0, xmm0
MOVAPS xmm3, xmm1
ADDPS xmm1, xmm1
; xmm0 = zx^2 xmm1 = 2 * zy xmm2 = zx xmm3 = zy
MULPS xmm1, xmm2
MOVAPS xmm2, xmm0
MULPS xmm3, xmm3
; xmm0 = zx^2 xmm1 = 2*zy*zx xmm2 = zx^2 xmm3 = zy^2
ADDPS xmm1, xmm5
SUBPS xmm0, xmm3
ADDPS xmm2, xmm3
; xmm0 = zx^2 - zy^2 xmm1=2*zy*zx+py xmm2 = zx^2 + zy^2 xmm3 = zy^2
CMPLEPS xmm2, xmm7
ADDPS xmm0, xmm4
MOVMSKPS eax, xmm2
test eax, eax
jz EXIT
ANDPS xmm2, xmm7 ; xmm6 += (xmm2 < 4.0) ? 4.0 : 0.0;
ADDPS xmm6, xmm2
sub ecx, 1
jnz ILOOP
EXIT:
Only the inner loop is shown here. By the start of the loop, The program starts by saving The register Now let's calculate the execution time for Pentium IV and find the bottlenecks in this code. The manual by Anger Fog describes how to perform such an analysis; if you are new to optimization for Pentium IV, please read the manual. See also official documents from Intel: Intel Architecture Optimization Reference Manual and Microarchitecture of the Pentium 4 Processor. Firstly, memory access is not a bottleneck here, because the calculation takes much more time than writing to memory. For the usual resolution 1024x768, the program writes 1024*768*4 bytes = 3 Mb. Typical throughput for modern DDR memory may be about 1-2 Gb/s. A rough estimation shows that the memory access will take 1.5-3 milliseconds, while the real execution time for this code is 70-80 ms. An analysis of the code gives the same result: memory is accessed only twice for 1 pixel (see the code in the next chapter), but the inner loop is executed That's why different tricks for optimizing cache performance are of no use for this code. The same for instruction decoding: the loops are so small that they fit into the trace cache entirely, and they are decoded only once. The real bottleneck is the long latencies and limited throughput of SSE instructions.
For simplicity sake, let's presume that the pipeline is initially empty, and the whole code is in the trace cache. Register names The first three instructions ( At the second clock cycle, the next three instructions go to the queue. From them, The The next instruction, Now, the The next five instructions go to the FP addition subunit (FP adder). All of them have a latency of 4 clock cycles. Our purpose is to complete the Time measurements I made confirmed these considerations. There is now a long delay between You see that the speed of the code depends on both the instruction latencies and the throughput, because there are three quite independent chains, which can execute at the same time. The critical path is marked with dark gray color. After drawing the chart, I ordered the instructions by their start time. For example, But what about branch prediction in this code? Can it be a bottleneck? In fact, the penalties for mispredicted branches take a lot of execution time for this program. But the number of complete iterations is the result we want to obtain. You can't perfectly predict how many times the loop will be executed, because this value is itself the result of the program. If you know exactly how many times the loop is repeated for each pixel, then why are you running this program? :) I've randomly selected two close-up pictures of the sets, then calculated the distribution of the iteration's number for them and for one large-scale image:
The blue bars denote the large-scale image of the Mandelbrot set; the red and yellow bars show two close-up pictures. From this analysis, you can see that if the loop is terminated, there is a high probability that it will be terminated early, at the minimal number of iterations. But if the loop is not terminated early, it usually will not be terminated at all. This statistics is interesting, but I couldn't make use of it in the code. May be, you, the reader, can do it. Any suggestions will be appreciated. Let's make some conclusions about this MOV-ADD method:
FFFF methodFFFF v3.2.2, the Mandelbrot set generator mentioned above, uses another method. FFFF is open-source, so it is fully legal to publish its code here: ; xmm0 = zx xmm1 = zy xmm2 = tmp xmm3 = tmp xmm4 = px
; xmm5 = py xmm6 = result xmm7 = 4.0
; eax = tmp ecx = loop counter
MOVAPS xmm0,xmm4
XORPS xmm6,xmm6
MOVAPS xmm1,xmm5
mov ecx, ITER
ILOOP:
; xmm0=zx xmm1=zy
MOVAPS xmm2, xmm0
MULPS xmm2, xmm1
MULPS xmm0, xmm0
MULPS xmm1, xmm1
; xmm0=zx^2 xmm1=zy^2 xmm2=zx*zy
addps xmm2, xmm2
movaps xmm3, xmm0
addps xmm3, xmm1
; xmm0=zx^2 xmm1=zy^2 xmm2=2*zx*zy xmm3=zx^2+zy^2
cmpltps xmm3, xmm7
movmskps eax, xmm3
test eax, eax
jz EXIT
subps xmm0, xmm1
movaps xmm1, xmm2
; xmm0=zx^2-zy^2 xmm1=2*zx*zy xmm3=zx^2+zy^2
ANDPS xmm3, xmm7 ; xmm6 += (xmm3<radius) ? 4.0 : 0.0
ADDPS xmm6, xmm3
ADDPS xmm0, xmm4
ADDPS xmm1, xmm5
dec ecx
jnz ILOOP
EXIT:
FFFF tries to compare The chart shows the ideal case without any delays. Even then There are additional latencies before In any other way, MOV-ADD is similar to FFFF. Both methods use SSE to calculate 4 pixel values together, both have the same limitation of precision and the same jerking effect when zooming. I used a similar algorithm, but arranged instructions differently to improve performance. Other distinctions between FFFF and my program will be shown later. In the FFFF code above, the register names are changed to the names used in MOV-ADD ( Vodnaya methodI also developed Vodnaya, a method optimized for Pentium M. The name is fully meaningless, so if it's hard for you to pronounce Russian words, you can just call it Vod-method or V-method ;). It outperforms both MOV-ADD and FFFF methods on Pentium M and Pentium III. Only a little is known about the Pentium M microarchitecture, so I was unable to do an analysis of latencies and throughput. After much trial and error, I found a sequence of instructions that seemed to be the fastest on these processors. The method is also a champion in the number of instructions: it is one instruction smaller than MOV-ADD or FFFF. MOVAPS xmm0,xmm4
XORPS xmm6,xmm6
MOVAPS xmm1,xmm5
mov ecx, ITER
ILOOP:
; xmm0 = zx xmm1 = zy
MOVAPS xmm2, xmm1
MULPS xmm2, xmm2
MULPS xmm1, xmm0
; xmm0 = zx xmm1 = zy*zx xmm2 = zy^2
MOVAPS xmm3, xmm2 ; save zy^2
MULPS xmm0, xmm0
ADDPS xmm2, xmm0
; xmm0 = zx^2 xmm1 = zy*zx xmm2 = zx^2+zy^2 xmm3 = zy^2
CMPLEPS xmm2, xmm7
ADDPS xmm1, xmm1
SUBPS xmm0, xmm3
; xmm0 = zx^2-zy^2 xmm1 = zy*zx*2 xmm2 = zx^2+zy^2
ADDPS xmm1, xmm5
ADDPS xmm0, xmm4
MOVMSKPS eax, xmm2
test eax, eax
jz EXIT
ANDPS xmm2, xmm7
ADDPS xmm6, xmm2
sub ecx, 1
jnz ILOOP
EXIT:
The The Great Battle: MOV-ADD vs. FFFF vs. VodnayaAnd now, the speed comparison of these three methods:
In the set of tests, the dimension of the image was 1024x768 pixels; the time was measured with the RDTSC instruction (see the source code in the SSEroutines_test.asm file). I repeated the tests five times and took the minimal result for each method. Times in the chart are expressed in millions of clock cycles. The lower times are better. After the names of processors, their CPUID signatures are shown. For example, Celeron F13 is a family 15, model 1, stepping 3 processor, namely a trimmed down Pentium IV. Its caches are smaller than those of Pentium IV, but the core is the same, that's why the results of this Celeron should be almost equal to the results of Pentium IV (sorry, I was unable to get Pentium IV for the tests). "Rabbit" means Julia set with C = -0.12+0.74i. Evidently, MOV-ADD is the best method for Pentium IV and Athlon XP processors. Vodnaya shows good results on Family 6 processors from Intel (in particular, Pentium III and Pentium M). FFFF is definitely a loser on every type of processor. The speed improvement of MOV-ADD in comparison to FFFF is about 12% on Pentium IV, and the difference between Vodnaya and FFFF is 7-11% for Pentium M and Pentium III. The readymade program available uses Vodnaya on family 6 processors from Intel and MOV-ADD method on other processors. The method is selected at run-time using the processor signature retrieved with the CPUID instruction. CvtTss2siAfter finishing the calculation, the program should translate the results into color values. The single line of code ( The program initializes the Note that the CVTTSS2SI eax, xmm6
; xmm6 contains an offset into palette, i.e. 4*number-of-iterations
mov eax, [esi + eax] ; palette is started at [esi]
mov [edi], eax ; the current pixel of the bitmap is at [edi]
The There are other instructions for converting floating-point numbers (namely, The Power of FASM MacrosWriting several variants of the same code for the Mandelbrot set and for the Julia set, for black and white image and for color image, for SSE and for SSE2, requires a lot of routine work. Thankfully, most assemblers have some special tools that can do this work for you. The tools are macroinstructions and conditional assembly. Macros let you repeat the sequence of instructions with different parameters, and conditional assembly allows inserting instructions depending on these parameters. I chose FASM 1.62 from several assemblers, because it is tiny, fast, and has a rich macro syntax. The FASM manual describes the syntax in detail. Here is a small piece from the Vodnaya method to illustrate the use of macros in my program: macro JuliaMandelPaint color, type {
; ...
ADDPS xmm1, xmm1
SUBPS xmm0, xmm3
if type = Julia
ADDPS xmm1, dqword[cy1]
ADDPS xmm0, dqword[cx1]
else if type = Mandel
ADDPS xmm1, xmm5
ADDPS xmm0, xmm4
end if
; ...
}
; later in the code:
mov eax, [fractaltype]
test eax, eax
jz MANDELBROT
JULIA:
JuliaMandelPaint 1, Julia
ret
MANDELBROT:
JuliaMandelPaint 1, Mandel
ret
If the argument You could say that there is an easier way to do this: mov eax, [fractaltype]
test eax, eax
jz MANDELBROT
ADDPS xmm1, dqword[cy1]
ADDPS xmm0, dqword[cx1]
jmp NEXT
MANDELBROT:
ADDPS xmm1, xmm5
ADDPS xmm0, xmm4
NEXT:
Why not use this code? Because the check is located in deeply nested loops, and the comparison will be repeated several million times, while it needs to be done only once. Needless to say about delays that can occur if the conditional jump will be mispredicted... So it's advantageous to move the comparison out of the loop. Macros help to write this shortly. The next interesting task is adapting the methods for SSE2. SSE2 instructions operate on two double-precision values in parallel. In other aspects, they have no difference from SSE. Even the latencies and throughputs for SSE2 are the same as for SSE. So you have to do a massive search and replace to change The idea is to create a macro called macro MOVAP a, b {
if ssewidth = 4
MOVAPS a, b
else
MOVAPD a, b
end if
}
macro ADDP a, b {
if ssewidth = 4
ADDPS a, b
else
ADDPD a, b
end if
}
macro MULP a, b {
if ssewidth = 4
MULPS a, b
else
MULPD a, b
end if
}
; another 2 pages of such clumsy code
; for every SSE/SSE2 instruction
Don't hurry to write this code! A much more elegant solution exists: macro defineinstr instr, [args] {
common
if ssewidth = 4
instr#S args
else
instr#D args
end if
}
macro MOVAP a, b {defineinstr MOVAP, a, b}
macro ADDP a, b {defineinstr ADDP, a, b}
macro MULP a, b {defineinstr MULP, a, b}
The macro You can make this code even shorter: macro defineinstr instr {
macro instr [args] \{
\common
if ssewidth = 4
instr#S args
else
instr#D args
end if
\}
}
defineinstr MOVAP
defineinstr ADDP
defineinstr MULP
There are two nested macros here. The outer macro, I hope you are convinced now that FASM macros are very powerful and graceful. If not, just try to make a C preprocessor macro with variable number of arguments and compare it with the FASM equivalent! Blitting Bitmaps with GDI and DirectDrawThe case WM_PAINT: { PAINTSTRUCT ps; HDC hdc = BeginPaint(hWnd, &ps); for(x = 0; x < w; ++x) for(y = 0; y < h; ++y) SetPixel(hdc, x,y, b[i]); EndPaint(hWnd, &ps); } Use this function only to set 1 or 2 pixels; it's too slow for large images. Here is a simplified version of my program which is much faster: int w, h; HBITMAP hBMP = 0, holdBMP; HDC hmem; BITMAPINFOHEADER bi = {sizeof(BITMAPINFOHEADER), 0, 0, 1, 32, BI_RGB}; void ReDraw() { PaintJuliaSSE(b, ceil(w, 4), h); SetDIBits(hmem, hBMP, 0, h, b, (BITMAPINFO*)&bi, DIB_RGB_COLORS); // Copy the bitmap to hmem } LRESULT CALLBACK WndProc(HWND hWnd,UINT msg,WPARAM wParam,LPARAM lParam) { switch(msg) { case WM_SIZE: { HDC hdc = GetDC(hWnd); if(hBMP) { DeleteObject(SelectObject(hmem, holdBMP)); DeleteDC(hmem); } w = LOWORD(lParam), bi.biWidth = ceil(w, 4), h = HIWORD(lParam), bi.biHeight = h; bi.biSizeImage = bi.biWidth * h * 4; b = (long*)realloc(b, bi.biSizeImage); hBMP = CreateCompatibleBitmap(hdc, ceil(w, 4), h); hmem = CreateCompatibleDC(hdc); holdBMP = (HBITMAP)SelectObject(hmem, hBMP); ReleaseDC(hWnd, hdc); ReDraw(); return 0; } case WM_PAINT: { PAINTSTRUCT ps; HDC hdc = BeginPaint(hWnd, &ps); BitBlt(hdc, 0, 0, w, h, hmem, 0, 0, SRCCOPY); // Copy hmem to the screen DrawPolyline(hdc); EndPaint(hWnd, &ps); return 0; } case WM_DESTROY: DeleteObject(SelectObject(hmem, holdBMP)); DeleteDC(hmem); PostQuitMessage(0); return 0; default: return DefWindowProc(hWnd,msg,wParam,lParam); } } You call the In response to The image of the fractal is stored in the memory DC, which exists during the whole lifetime of the program. Having this image, you can quickly draw the moving polyline by blitting the stored image and drawing the new position of the polyline above. This will cause a little flicker of the polyline, but it's tolerable. Adding the second memory DC and the second bitmap could remove the flicker completely, but then the program will be slow and memory-intensive. SSE instructions operate on 4 values in parallel, and all SSE methods shown above require the number of pixels in the line to be divisible by 4. That's why the width ( Besides GDI, the program also supports DirectDraw. I won't show the code here, because it's very long and boring (error handling takes a lot of space). In essence, it is very close to the GDI code. An off-screen surface is created; the fractal is rendered into it. On DirectDraw is supported for Windows 98 and higher. For the fans of Windows 95, I wrote a few lines of code so the program will gracefully degrade to GDI if DirectDraw is not available :). ConclusionMy program is a very fast Mandelbrot and Julia set generator using SSE/SSE2 and DirectDraw. I will not claim it's "the fastest fractal generator for Win32", as the author of FFFF did, but you see it's pretty fast and easy-to-use. Some points still need more work, for example:
So the code is good, but not perfect. Please add your comments and suggestions on how to improve it. You can use the techniques presented in this article in many different graphic applications. The analysis of latencies and throughputs helps you to find bottlenecks. Then you can exclude long-latency operations from the critical path, make several independent chains in your code, and increase instruction-level parallelism by reordering the instructions. Finally, the assembler macros free you from such dull work as rewriting the same code for SSE2. Credits and ThanksThanks to Daniele Paccaloni, the author of FFFF. Viewing the source of his program helped me to find suboptimal solutions in my own code and, generally, to make my program better. Thank you, Daniele! Thanks to my teacher, Tatyana Valenteenovna Korablina, for providing useful information about fractals and non-linear dynamics. I also feel grateful to people who helped me with testing: Anton Orlov, Larisa Alekseevna Morozova, and Victor "vito". Special thanks to Stas, who stayed at work on that Friday evening and let me use a lot of computers for tests.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||