Add your own alternative version
Stats
30.5K views 427 downloads 14 bookmarked
Posted
15 Apr 2013
|
Comments and Discussions
|
|
|
I downloaded the zip file, but am unable to extract it. Says that "the archive is either in unknown format or is damaged". Can you please upload the zip file once more, so that I can use it.
Edited:
Sorry - there was something wrong with my machine, which made it download only part of the file. I am able to extract the zip properly. Thanks and sorry for any inconvenience caused.
modified 3-Aug-14 11:26am.
|
|
|
|
|
This is a great inspiring article. I am pretty much pleased with your good work. You put really very helpful information. Keep it up once again.
|
|
|
|
|
Thanks 
|
|
|
|
|
You didn't show us the test cases you have used. I was curious about comparing to ttmath, espacially why multiplication was so bad. First of all you should give us the code which we can compile. You are using some specific windows headers and macros and it will not compile on other operating systems. I have prepared a little path to your library:
http://pastebin.com/xb6GLgmG[^]
It changes BOOL to 'typedef bool', and similary FALSE to 0 and TRUE to 1 (and some 'const' keyword should be used). Now some testes can be made. I wrote code which:
- generate two random 200-digits values
- multiply it 100000 times
- do the same 10 times
sample for ttmath:
http://pastebin.com/fitYdCjA[^]
sample for ckrf:
http://pastebin.com/gZFi5HaE[^]
I have compiled them in the following way:
$ clang++ -O2 -s -march=native -o mul_ttmath -I../ttmath mul_ttmath.cpp
$ clang++ -O2 -s -march=native -o mul_ckrf -I../ttmath mul_ckrf.cpp CKRFloat.cpp
on a 64-bit FreeBSD:
$ uname -srm
FreeBSD 10.0-CURRENT amd64
and the results are:
$ ./mul_ttmath
time: 406
$ ./mul_ckrf
time: 1333
so your library is about 3 times slower than ttmath in multiplication. If you disagree then give us please:
1. the name of a compiler you have been using
2. compile options
3. source code which we can compile and test
|
|
|
|
|
Hi and thanks for the reply.
1. Visual Studio 10
2. Default, however with MASM enabled in order to be able to compile ttmath, since it has assembler code
3. I would need some time to do this...
I was not aiming to create a competition between ttmath and CKRFloat, however if you take the challenge and win by factor 3, I would gladly use your library instead of mine.
I had a look at your test case and one main thing that differs is that you are assigning the variables within the time measurement in your test. It might be that the string parser of CKRFloat is very slow.
Instead, I am running my test several times and note the measured time in order to pick the lowest value it gives, to avoid delays from the operating system. In the end I print the result, a 400 digit float, and validate the result between the libraries.
Here is a snippet from my test-code
int nTest=10000;
CKRFloat f_a, f_b, f_c;
f_a = szA;
f_b = szB;
GetLocalTime(&st1);
for(i=0;i<nTest;i++)
f_c = f_a * f_b;
GetLocalTime(&st2);
Kind regards
Karl
|
|
|
|
|
"are assigning the variables within the time measurement"
good point
|
|
|
|
|
And it is important to run the test many times, since other programs or the operating system might compete for CPU.
The time to process the test can be randomly longer than required if the cpu is given to another program.
However the time can never be shorter than what is required.
A fair test should only count the lowest values from many runs.
|
|
|
|
|
... yes, especially when the left click mouse is 'down.'
I would use the statistical average time and not the lowest. [priority maxed]
Trying to get the same conditions can be tricky.
Of course, you could boot to dos and run it there.
|
|
|
|
|
Why use average? You should not by any chance get a lower time value than what is actually needed to do the calculation - there is no magic...
Unless you write code that tries use a cache, however that seems unreasonable for multiplying arbitrary values.
|
|
|
|
|
the average behavior of all three tests would be truer/closer to what another tester would get on another system. The absolute numbers you get are probably impossible to duplicate. You could include standard deviation to highlight the very best and worst cases. 
|
|
|
|
|
I do not agree. If it happens that one of the library is tested during a heavy memory swapping, it would not be a fare test... 
|
|
|
|
|
All libraries have to be tested in near identical situations.
Taking the lowest tt values from a c64 and the lowest CKR from a cray is simply unfair.
Your assumptions that the lowest can't be underdone are very false: It's all a matter of tweaks.
Have your tried porting to a gpu card or sse*, mpi? Your program is noteworthy but I'm surprised you don't know of boost, blas, statistics, etc.
Welcome to the world of speed freaks. 
|
|
|
|
|
JohnWallis42 wrote: Your assumptions that the lowest can't be underdone are very false: It's all a matter of tweaks.
Why? Can you explain this? Or can you give any reference that I can read?
This test case is running in one thread only, so I am sorry I cannot understand why not the lowest value is what should be picked and the rest discarded...
|
|
|
|
|
It's the other running threads your not measuring.
So. My impression is your running the program, in a multitasking environment, along with other programs.
The time and cpu usage of other programs, mouse detection, timer update, window repaint, etc, along with other programs such as Ages of Empires are hard to predict.
In Linux when you ask for times it will give you 3 back! program time, real time and another time.
Hence, if you turn off all the games, and very carefully, in a special sequence, turn off many off the background "windows"-tasks [you won't be able to shutdown, switch user, etc normally after - but reboot restores all], if you do all that: you lowest program time will improve.
To save you all that trouble: run each program 10^6 times (enough to smooth out outliers) and compare the averages, sigmas and standard deviations. Plot them, Note the machine make, model, OS build, CPU(+MHz)[make,build], (+ anything [chips] that could be considered different of your machine).
Try a DOS only boot.
- and turn off your internet, unplug all usbs AND no mouse.
|
|
|
|
|
As I wrote in the article, after some runs I find - several times - a value for each test case not less exceeded. A test suite can be in milliseconds
32, 36, 14, 17, 18, 14, 14, 17, 65
I then assume 14 is the value when my test program is get all the CPU, and if I run more tests I do not get less than 14. I also assume that the other values are when my program is competing with other processes on my system, so I disregard them.
Isn't my assumptions correct? If not, why?
|
|
|
|
|
whew ...
0. [you've found a lower boundary]
Regarding your scores: To be professional about it, your going to have to include your PC details and how many times you ran the test.
Your tests, plotted against precision may not behave asymtopically linear as you assume (may not be the case though, but this is only one test).
1. [observational]
let's say you had a virus, and ... you installed the basic stuff.
Your PC and compiling/testing environment are the same
- I just think under these circumstances, your result would change for the better(!).
If that were the case, then the assumption is false.
2. [environmental factors]
The boost, tt, and whathaveyou libraries usually run optimized according to the hardware capabilities.
If you were to buy hardware, it might affect the times asymmetrically.
In the wildest case, just changing some BIOS settings might change the scores. Try turning hyper-threading off. That will get you closer to a single threaded world. Yes and make sure your program as well as tt, etc run devoted to one cpu only.
3. [absolute vs relative]
Suppose another PC, say mine, runs the tests. The number will most likely be different. The difference could be linearly/exponentially scaled ... Well, [ brain fart]..., relative scores might be better.
|
|
|
|
|
RAmbling on to myself ...
The different cases, tt, CKR, etc could depending on their solution logic paths, have for the random PC time fluctuations, different performance behaviours: The paths of each part logic will posses varying performance depending on the caching, etc of the momentary PC states ...
--> Using an average would smooth out the curves.
--> I can only gather, if you stay with your code long enough, your figure out how to clear the cache, prioritize sections and get the best case sits with all parts of your code: vector math, memory transfers, etc that you can reach that potential of having the average equal to your lowest.
|
|
|
|
|
Do you have optimizations turned off for this test?
The compiler may optimize your code to just one iteration of the loop, if it detects that it's going to get the samne result every time.
|
|
|
|
|
No, I have only default options in my project (except for MASM enabled, which is not default, in order to be able to compile ttmath).
If the compiler would detect that the same operations are performed in all loops, it would not take longer time if I increase the number of digits in my test, which it does.
|
|
|
|
|
Hi again Member 9997896
What compiler do you use?
Can you please verify that KRFLOAT_TYPE is defined as a 64-bit signed integer, that KRFLOAT_DIGITS is 8 and that KRFLOAT_PARTMAX is 100000000?
I think you need to add
#define _WIN64
Kind regards
Karl
modified 21-Apr-13 17:41pm.
|
|
|
|
|
Karl Runmo wrote: What compiler do you use?
I have used clang 3.2:
$ clang++ -v
FreeBSD clang version 3.2 (tags/RELEASE_32/final 170710) 20121221
Target: x86_64-unknown-freebsd10.0
Thread model: posix
Karl Runmo wrote: I think you need to add
#define _WIN64
Good point, this speeds up ckrf, only I have to change __int64 to int64_t as the first one is Windows specific. I prepared a new test where first some 200-digits values are prepared and then the multiplication is performed.
ckrf:
http://pastebin.com/YrMRKYfK[^]
ttmath:
http://pastebin.com/n7PdU6FV[^]
let we first try ckrf:
$ clang++ -O2 -s -march=native -o mul_ckrf mul_ckrf.cpp CKRFloat.cpp
$ ./mul_ckrf
time: 214
now ttmath -- the first test (the size of the mantissa is exactly large to store 200 decimal digits):
$ clang++ -O2 -s -march=native -o mul_ttmath -I../ttmath mul_ttmath.cpp
$ ./mul_ttmath
mantissa size in the machine words: 11
time: 124
As you see ttmath is faster, but the problem is when we want all the digits from the multiplication. If we multiply 200-digits number by a 200-digits number we can have a 400-digit value. In ttmath we can achieve it only by getting a two times larger mantissa. So let we make a second test with ttmath when the mantissa is two times larger:
$ clang++ -O2 -s -march=native -o mul_ttmath -I../ttmath -DSECOND_TEST mul_ttmath.cpp
$ ./mul_ttmath
mantissa size in the machine words: 22
time: 382
And now ttmath is slower. It is slower because ttmath is operating on the whole mantissa, it doesn't care about the real value of the mantissa, even if the mantissa is equal to one the ttmath is performing the same calculation as it would have all the digits set. So in above test ttmath is doing twice much calculations than ckrf. Currently ttmath lacks the variable size mantissas/exponents and if you need such a feature then ttmath is not a good choise.
Summarizing
===========
If you want to compare (for speed) two libraries or programs first you should check what algorithms they used especially their computational complexity:
https://en.wikipedia.org/wiki/Computational_complexity_theory[^]
Your library for multiplication uses such a code:
KRFLOAT_TYPE pResValues[KRFLOAT_ENTRIES+KRFLOAT_ENTRIES];
for(a=0;a<m_nValues;a++){
for(b=0;b<A.m_nValues;b++){
pResValues[a+b] += m_pValues[a]*A.m_pValues[b];
This is the schoolbook algorithm with O(n^2). TTmath for small vectors is using it too but for larger vectors uses Karatsuba multiplication which has O( n^(ln(3)/ln(2) ) :
https://en.wikipedia.org/wiki/Karatsuba_multiplication[^]
So your library will never be as fast as ttmath, especially if you'll get more digits to multiply. You can make a test by yourself, try changing DIGITS to:
#define DIGITS 10000
You have to change KRFLOAT_ENTRIES as well:
#define KRFLOAT_ENTRIES 3000
and for ttmath you can set 66560 bits mantissa (this stores more that 20000 decimal digits):
#define TTMATH_MANTISSA_SIZE_BITS (1040*64)
and you'll see that the ttmath's karatsuba multiplication outperforms your schoolbook multiplication (even where in such a case ttmath is using two times larger mantissa).
But the problem is when you really need a mantissa of variable size. In ttmath you can define some temporary objects, e.g.:
ttmath::Big<1, 10> small_number;
ttmath::Big<1, 100> medium_number;
ttmath::Big<1, 1000> large_number;
but this is only a workaround and is not convenient for programming.
You can give a try for GMP (it has a variable size numbers and better algorithms for multiplying than Karatsuba):
http://gmplib.org/[^]
and mpfr:
http://www.mpfr.org/[^]
GMP is treated as a standard for high precision arithmetic but I remember there were some problems when compiling on Windows in the past.
For windows probably mpir would be better but I have never used it (this is a fork of GMP):
http://www.mpir.org/[^]
|
|
|
|
|
Hi Tomasz
Thanks for your detailed reply.
You may use the function SetMaxSignificant and compare CKRFloat according to your first test, where ttmath is truncating the result to 200 digits. I tested this and CKRFloat is still twice as fast as ttmath, at least in my environment. I used SetMaxSignificant(TEST_SIZE+16) in order to get the same number of correct decimals as ttmath:
ttmath::Big<1,TEST_SIZE/18> a,b,c;
...
CKRFloat f_a, f_b, f_c;
f_a.SetMaxSignificant(TEST_SIZE+16);
My point is that it is most important to keep the code in the inner loop as small as possible.
And as you showed I have only one line of code.
Even though it would be possible to use more decimals per integer, and have shorter loops for multiplications, it would require overflow handling in the loops and in the end get slower.
I assume you would be able to improve ttmath if it's inner loop also could be reduced to one single line.
Is it even possible at all to make the O(n^2) multiplication faster than this, up to a couple of thousand decimals (which is well enough when rendering fractals)??
Kind regards Karl
modified 22-Apr-13 7:19am.
|
|
|
|
|
|
Thanks
Where can I download this library?
|
|
|
|
|
|
|
General News Suggestion Question Bug Answer Joke Praise Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.
|
|