 |
|
 |
Hey,
I downloaded the source but when i open it, antivir found a virus on binary file.
regards,
Cem
|
|
|
|
 |
|
 |
U right!
I checked it on the virustotal site.
Result: 18/ 44 (40.9%) (Trojan inside)
|
|
|
|
 |
|
 |
Virustotal reports "2 from 44" for fractalsSSE_src.zip (CAT-QuickHeal: (Suspicious) - DNAScan, Sophos: Sus/UnkPacker).
Assembly language programs often trigger false warnings in antivirus software. You can compile it yourself; the source code is included.
|
|
|
|
 |
|
 |
Sorry..
I tested version from 2005-11-29.
Can you check it?
|
|
|
|
 |
|
 |
Hi Peter Kankowski
I'm impressed by your knowledge in software programming and wonder if you could help me in writing/editing Fractal Dimension formula code for eSignal trading system. I'm from China where professional programmers are not widely available; that's why I need someone like you who's well-versed in JavaScript that helps me in formula writing. I can show you a sample chart with Fractal Dimension Indicator as an example. If you are willing to help, please reply by email: yiyi226688@yahoo.com Also, feel free to let me know if there's any charges involved in your service. Thanks very much.
Yi
|
|
|
|
 |
|
 |
In the file "SSEroutines.asm", the ConvertAndWrite macro has the line " endif ssewidth = 2", and unless your assembler has some special abilities, I'm pretty sure that should say elseif
|
|
|
|
 |
|
 |
If you just need to see the algorithm, but don't want to struggle through assembler and DirectDraw code, you can download the simplified program. Certainly, it's slower than the SSE code, but it's very easy to understand.
Dominique asked me to publish this simple program. Here is his e-mail to me (published with his permission):
Dominique wrote: Peter,
I just downloaded your fractal program from
http://www.codeproject.com/cpp/fractalssse.asp
I have managed to import it as a project to my Microsoft Visual C++ .NET version 7.0.9955 and compiled and linked it and it runs really fine.
I am studying the iteration of transendental functions in my PhD, so I look at iterating, for example, functions like f(z) = c exp(z) where c is the parameter.
I’d like to customize your code so it calculates julia sets for such functions, and I want to ask for your permission.
I have only a basic level of skill in C++ coding and no skill in assembler, but it looks like I might just need to alter the PaintJuliaFPU method in factals.cpp.
I had the intention of adding
#include <math.h>
to the tech.h file and using the exp function from math.h but I notice you have commented out the
#include <math.h>
line in tech.h. Can you remember if there was any particular reason for this?
Are you, in principle, prepared to let me make some modifications to my copy of your code?
I hope this email gets to you.
Dominique
Peter wrote:
Dominique,
Are you, in principle, prepared to let me make some modifications to my copy of your code?
Certainly, you can make any modifications. If you will distribute the modified version, credits would be appreciated (just mention me somewhere in your readme file or help file).
But it may be easier for you to start from small and simple code base. The program from my article is bloated with trivial details such as switching between SSE and FPU modes or creating DirectDraw surfaces. If you don’t need SSE or DirectDraw, you may found this small program useful. It has poor performance, but is easier to understand and modify.
#include line in tech.h. Can you remember if there was any particular reason for this?
The standard functions from math.h require startup code, which increases size of the executable by 27 Kb. I replaced these functions with their home-made analogs [atod(), ceil(), dtoi(), and so on] to keep the file size small. If you wish to exclude the startup code too, use intrinsic transcendental functions described in the article by Dierk “Chaos” Ohlerich. I included them in the small program mentioned above.
Peter Kankowski
|
|
|
|
 |
|
 |
Hi Peter,
As you may have already guessed I am new to assembler. I tried increasing the ITER value but the SSE routines don't seem to iterate for longer, I just get a peach colour inside the set. Any hints? Thanks.
The main methods for speeding up the calculation (apart from great assembler) are guessing and periodicity checking.
Guessing can either be implemented by a raster scan of successive refinement, boundary tracing or the Mariani/Silver algorithm. The guessing algorithms work on the principle that the iteration bands bounding the Mandelbrot set (and similar fractals) is continuous; thus if the points surrounding a point are all the same value then that point is the same value too. This can save a lot of time inside the set, where you would normally iterate to your maximum number of iterations.
The periods inside the set repeat themselves (become periodic); therefore if you can detect this repetiton then you can bail out of the loop early. Knuth has an algorithm that is good for this:
double ZxSave = 0.0;
double ZySave = 0.0;
int K = 1;
int L = 1;
bool PeriodicityTest(double Zx, double Zy)
{
if(Zx == ZxSave && Zy == ZySave)
{
return true;
}
else
{
K--;
if(K == 0)
{
ZxSave = Zx;
ZySave = Zy;
L *= 2;
K = L;
}
return false;
}
I found this site to be very useful when writing my own Mandelbrot program: http://www.mrob.com/pub/muency.html
zenzero
-- modified at 20:02 Sunday 27th August, 2006
|
|
|
|
 |
|
 |
No worries! I've found the ITER variable in the asm file.
|
|
|
|
 |
|
 |
zenzero,
thank you very much for ideas about pixel guessing and periodicity checking. I also found a lot of useful information on your site. May be, I will add these speed improvements to my program later (I'm too busy for that now, sorry).
Peter.
Peter Kankowski
|
|
|
|
 |
|
 |
Peter,
It's not my web site, I'm not that clever
The two speed up methods really come into their own at high iterations. 64 is a very low number of iterations, so you wouldn't see much (any?) improvement due to your loop being so efficient.
The inner loop in my program is just written in C++, and only does one pixel at a time, but it is not too slow. I think that it would be difficult to modify my code to work with 4 pixels at a time and to integrate it with the period checking. Still, it has given me something to think about!
zenzero
|
|
|
|
 |
|
 |
Hi,
I couldn't get this to compile on VS 2005. I receive the following error:
The system cannot find the path specified.
Project : error PRJ0019: A tool returned an error code from " "
I looked at the log file and it shows:
set include=c:\Program Files\fasm\INCLUDE
"c:\Program Files\fasm\FASM.EXE" "d:\Temp\Programming\FractalsSSE\SSEroutines.asm"
I have no fasm directory or FASM.EXE on my computer. What am I doing wrong? I am probably being very stupid, please save me from my ineptness.
zenzero
|
|
|
|
 |
|
 |
No worries, I have found FASM on the net. I did not realise it was a seperate program. Great article by the way!
|
|
|
|
 |
|
 |
Hi Peter,
with the help of another coder, I developed a benchmark using your algorithm. My base idea was to compare it to a straight forward FPU code and to see how good the different cores are performing.
Furthermore the other coder helped invoking the use of multi-threading, so we achieve more than 190% of a single threading version now. I use your MOV-ADD method for double precision SSE2, just changed some sourrounding stuff.
You can find it on:
http://www.mikusite.de/pages/x86.htm
Of course I gave you full credit in the Read_Me ! I hope it's okay in this way...
The interesting thing during developing the multi-threading version was how to set up the calculation for each core...we found that making each core calculating like 4 or 5 lines on screen ist the best...because of the non symetric style of most fractals it wasn't good to just divide the screen in two same halfs for calcualtions...
|
|
|
|
 |
|
 |
Hi, Michael,
thank you, your benchmark looks very promising!
I have no experience with dual core processors,
but it's interesting for me to see what you have
done. Here are my results:
Pentium M 715, 1.5 GHz
(family 6 model D stepping 8 revision C0 frequency 600-1500 MHz)
FPU 102.0 millions iterations/sec
SSE2 131.2 millions iterations/sec
Best wishes,
Peter Kankowski
|
|
|
|
 |
|
 |
Thanks for your results ! I will put them on my webpage soon.
As you can see in the results on my webpage, the parallelism
with a dual core is almost perfect, especially for AMD CPUs.
I also always relate the results to Iterations/MHz/Core
so that somebody can detect the performance of the actual core.
I still got to have a look at local/global variables but it
shouldn't have a big impact on the result. Also due to the
fast execution on fast CPUs, I will include a looping of the
whole benchmark for more comparable results.
The performance of the dual cores and the hyper-threading
CPUs is really impressing compared to other single cores and
the non hyper-threading ones...also the point, how bad the
FPU is running on P4 without hyper-threading.
|
|
|
|
 |
|
 |
It's me again...
If you look again on my webpage you might see an interesting result from the new leader from your algoritm...a taiwanese guy got hands on a Intel Core 2 Merom, it's more than 65% percent better performing that the old architecture per MHz !!! Well done Intel !!!
May be there could be even more of a gain if your special Pentium-M algoritm is used.
By the way...I will look myself also, but do you got any improvements ins mind using SSE3 or SSE4 for the algoritm itself ?
@EDIT: I looked at the SSE3 instructions (SSE4 isn't really out yet...), so it seems that there could be a little enhancement for moving data in registers and stuff like that, to avoid some of the 'shuf' commands, but for the parallel calculation the instruction seem not to help, more for combining the parallel data from the instructions...may be very good for other algoritms...but I don't see a big potential for the iteration loop...
-- modified at 17:59 Monday 15th May, 2006
|
|
|
|
 |
|
 |
Interesting article, thanks. I've played with this problem a bit and came up with the approach below. The main thing I concentrated on was removing all branching from the code. By using "cmove" and calculating the iterations in an XMM register it can all be done with only once branch decision, the main iteration loop that checks on the counters. The only penalty is that I used up all the XMM registers and so need to keep one contant ("4.0") im memory. It seems that there is little or no penalty for referencing constant source operands from memory in SSE instructions once they are in L0 cache. Does this match anyone elses experience? Any comments are appreciated, I'm still learning tricks for optimising on the Pentium. --arne Hope it's OK to post in "ATT" format, I use GCC. #### # Mandelbrot inner loop # Calculates two points at once with SSE2 # calling from C (GCC) as "int *counters=iterPoint(iterLim,xmmBuffer);" # Free for any use as long as this comment is included # Arne Thormodsen - April 2006 #### .data .p2align 6,0 #align at 2^6 for SSE2 and cache line COUNTER: .double 0.0,0.0 # return location for counters, actually used for 64 bit ints, not doubles FOUR: .double 4.0,4.0
.globl _iterPoint .def _iterPoint; .scl 2; .type 32; .endef .text _iterPoint: pushl %ebx # used to hold sse2 result compare pushl %ecx # holds the constant 1 movl 16(%esp), %eax # pointer to unaligned buffer passed from C that contains # Cim and Cre for two adjacent points movupd (%eax),%xmm2 # CRe movupd 0x10(%eax),%xmm3 # CIm # movl 12(%esp), %eax # get iteration limit movl $1, %ecx # constant 1, used a couple of places # initialize XMM registers pxor %xmm7, %xmm7 # counter for iterations = 0 movapd %xmm2, %xmm0 # =Cre movapd %xmm3, %xmm1 # =Cim call MAIN_LOOP movapd %xmm7, COUNTER # save iteration counters movl $COUNTER,%eax # return pointer to counters popl %ecx popl %ebx ret MAIN_LOOP: ### A^2 -> A, MAG^2(A)->M^2 ### movapd %xmm0, %xmm4 # t1=re movapd %xmm1, %xmm5 # t2=im mulpd %xmm0, %xmm0 # re(partial)=re*re mulpd %xmm1, %xmm5 # t2=im*im movapd %xmm0, %xmm6 # M^2(partial)=re*re addpd %xmm1, %xmm1 # im=2*im addpd %xmm5, %xmm6 # M^2=(re*re)+(im*im) mulpd %xmm4, %xmm1 # im=2*im*re subpd %xmm5, %xmm0 # re=(re*re)-(im*im) ### ### cmpltpd FOUR, %xmm6 # compare magnitude squared to 4.0 psubq %xmm6, %xmm7 # increment counters (note that -1 is put in for # "true" so sub increments counters) movmskpd %xmm6, %ebx # get compare flags andl %ebx, %ebx # =0? (i.e. both counters have stopped incrementing) cmove %ecx, %eax # if =0 set iter counter to 1, will bail at next decrement ### ### addpd %xmm2, %xmm0 # zre=(zre*zre)-(zim*zim)+cre addpd %xmm3, %xmm1 # zim=(2*zim*zre)+cim ### subl %ecx,%eax #decrement counter that is tracking both calcs ja MAIN_LOOP ret
|
|
|
|
 |
|
 |
Hi, Arne,
thank you for the interesting approach. I like your idea of using CMOVE instruction, but it has a long latency on Pentium IV (6 cycles). The test showed that the code with CMOVE is slower than the code with usual TEST-JZ. So CMOVE may be good, but not in this life and not with these processors...
Other comments:
* Yes, you are right: there's a very little penalty for accessing constants in L0 cache. Intel manual even recommends putting constants in memory, not in registers.
* PXOR is for XORing integer data; it should be replaced with XORPD in your code. Don't mix XMM instructions that operate with different data types, it may slow down the code.
And please, forget about AT&T syntax. It's almost unreadable! AFAIK, gcc supports Intel syntax too.
Good luck and thank you again!
Peter Kankowski
|
|
|
|
 |
|
 |
No doubt about it, the JZ version is about 4% faster than the CMOVE version. Thanks for the tip. As far as the code syntax, I guess I've just spent too much of my previous life as a UNIX hack. I'll check out the gcc options, it is a bit of a pain constantly transposing published code examples. --arne
|
|
|
|
 |
|
 |
Dear Peter,
Very nice article !!! As I'm a fan of fractals and currently
getting into x86 assembly language with FASM I just want
to ask you some further questions about it:
- As far as I can see you are calculating either 2 or
4 pixels at one time in your routines (single or double
precision SSE/SSE2. I just wonder if there isn't a huge
penalty if the neighbour pixels have very different
depth of iterations, what sometimes is true especially
for the spiral areas in a mandelbrot fractal...?
- if yes, it might be even usefull to use SSE2 to
improve speed for a single pixel, I think...(got to
try to implement some code on this...)
- When you go deeper and deeper in the fractal, is
single precision a problem, so that you got to do
double precision to get a proper result ?
Regards,
Michael
|
|
|
|
 |
|
 |
Hi, Michael,
thank you for your letter.
xKuemmelx wrote: As far as I can see you are calculating either 2 or
4 pixels at one time in your routines (single or double
precision SSE/SSE2. I just wonder if there isn't a huge
penalty if the neighbour pixels have very different
depth of iterations, what sometimes is true especially
for the spiral areas in a mandelbrot fractal...?
There is no problem, because operations are done in parallel.
Let's say you have 2 neighbouring pixels: the first
with 1 iteration, and the second with 5 iterations.
Imagine that one iteration takes 1 second (of course,
in the real program it takes less time). If you process
two pixels in parallel, you have to do 5 iterations
in 5 seconds. But it's the best you can do with this
two pixels, because processing them serially takes
1+5=6 seconds. Here is a simple diagram:
Parallel computation Serial computation
the first pixel A - - - - A
the second pixel B B B B B B B B B B
seconds 1 2 3 4 5 1 2 3 4 5 6
And why we can't reorder pixels to get use of 4 wasted
seconds (marked with - on the diagram)? Because it's too
complex. We don't know how much does the depth of iterations
vary between two pixels before we start calculating this
pixels.
xKuemmelx wrote: When you go deeper and deeper in the fractal, is
single precision a problem, so that you got to do
double precision to get a proper result ?
Yes, you have to do double precision in such cases.
Good luck,
Peter Kankowski
|
|
|
|
 |
|
 |
Hi Peter,
Thanks for your fast answer !
Peter Kankowski wrote: There is no problem, because operations are done in parallel.
Let's say you have 2 neighbouring pixels: the first
with 1 iteration, and the second with 5 iterations.
Imagine that one iteration takes 1 second (of course,
in the real program it takes less time). If you process
two pixels in parallel, you have to do 5 iterations
in 5 seconds. But it's the best you can do with this
two pixels, because processing them serially takes
1+5=6 seconds. Here is a simple diagram
Okay, that's clear. I forgot to stress out, that
I thought probably SSE2 code (for serial processing)
is slower compared to standard FPU (don't know if
it's true...), so that in the cases you showed the
overall performance done by standard FPU would be
faster than the parallel processing !?
Cheers,
Michael
|
|
|
|
 |
|
 |
In general, SSE is faster than FPU:
- SSE uses registers instead of stack, which is much easier for both programmer and processor.
- All SSE instructions have lower latencies, for example FADD takes 6 cycles to execute on Pentium IV model 3, while ADDPD or ADDSD takes only 5.
- Though, throughput for FADD is higher than for ADDPD on Pentium IV (throughputs for multiplication are equal). That's why Intel's Optimization Reference Manual says: "For applications with a large number of adds relative to the number of multiplies, x87 FPU may be a better choice". The algorithm for Mandelbrot set contains 3 multiplications and 5 additions, so, in theory, it may be the case. But in practice, SSE code for calculating Mandelbrot set is faster than FPU code, mostly because it processes pixels in parallel.
I will cite Intel's manual again: "Use scalar SSE/SSE2 unless you need an x87 feature. Most scalar SSE2 arithmetic operations have shorter latency then their x87 counterpart and they eliminate the overhead associated with the management of the x87 register stack". So SSE is usually faster than FPU even for scalar instructions (such as ADDSD or ADDSS versus the vector instructions ADDPD or ADDPS). That's why MS had included the option to generate scalar SSE code in their Visual C++ compiler (/arch:SSE2).
Also, there are rumors that FPU, MMX and 3DNow will not be supported in 64-bit Windows, because Windows will not save FPU stack when doing task switch (see http://en.wikipedia.org/wiki/Talk:AMD64#FPU.2FMMX_Registers). Note that I only heard the rumor, I have not 64-bit CPU nor 64-bit Windows and cannot confirm or disprove it. The latest information from Wikipedians is that the instructions will work, but none of Microsoft compilers will generate them. But you still may think that optimizing FPU code to death is not the best way to spend you time, because FPU instructions will work in 64-bit Windows, but they will be considered as a legacy and will not be officially supported.
So, thank you for interesting considerations, but the overall perfomance will never be higher for FPU code than for SSE code.
Peter
|
|
|
|
 |
|
 |
You indicate that the section of code from FFFF is Open Source. Most Open Source software is not in the public domain, requires crediting the author, and probably requires the inclusion of a specific copyright notice and license text. Who owns the copyright on that FFFF code? and what are the terms under which I can also reproduce it?
|
|
|
|
 |
|