|
Hi PIEBALD,
PIEBALDconsult wrote: Have any of you done this?
no I have not done sieves for a long time. I have done some large prime stuff, where simple sieves are not appropriate.
Wheel sieves become effective only if you use several primes at once, so do not use just 2 and 3 to get 2 spokes out of 6; instead use say 2, 3, 5, 7, 11, 13, 17 and notice you will have a large number of spokes (equal to the product of the primes used, hence more than 1000) of which most don't need to be allocated.
The aim of the wheel stuff is to reduce memory needs, hence extending the possible range for a given
memory footprint; when implemented appropriately, I expect it to be somewhat faster than a regular sieve with the same range, since a number of divisions/remainder operations are avoided, at the expense of some bookkeeping code. Unfortunately the missing spokes correspond mopstly to the most obvious non-primes, those that are recognized easily since they fail one or more of the very early division tests (with low divisors such as 2, 3, 5...) , so it is essentially the cheapest non-primes that get skipped, resulting in a modest cycles gain; the memory gain however is significant.
One optimization aspect is about how to store the spokes; they should aim for a good locality of data
(hence reusing data that is already in cache), while not causing cache trashing, so if each spoke were an array, don't give array dimensions that are powers of 2, but rather go for some strange value (a prime is often a good choice for array sizes) in this respect.
Once you got the spoke logic correct and efficient, you can easily experiment and enlarge the collection of basic primes, yielding more spokes, and more cycles gained. You then can research the optimal set of spoke primes as a function of available memory...
Enjoy!
|
|
|
|
|
Luc Pattyn wrote: Once you got the spoke logic correct and efficient, you can easily experiment and enlarge the collection of basic primes
Right, that's where I'm at. And it looks like the tactic that seemed to work with six spokes won't work with thirty spokes so it's not a general solution.
|
|
|
|
|
PIEBALDconsult wrote: won't work with thirty spokes
So it seems a change of tactics is what you need.
|
|
|
|
|
Yes, it's time to use "the little grey cells".
|
|
|
|
|
Got it working* last night, with one through thirty spokes.
* Well, mostly working, when I ran it with thirty spokes up to 2^32 I got one more "prime" than I did with the old version, so I have to track that down.
P.S. The new one is correct, the old one missed 65537 -- SQRT(2^32)+1 -- so I need to fix the old one.
modified on Wednesday, November 26, 2008 3:29 PM
|
|
|
|
|
Hi,
1.
65537 is F4; failing to report it is something Fermat will not take lightly.
2.
The most popular bug in prime programs seems to be a "less than SQRT(n)" test ignoring rounding stuff;
I tend to make sure and do "less/equal 1+SQRT(n)" occasionally wasting some cycles but not generating false primes. But your mistake must have been something different, since you skipped a real prime.
3.
now may be the time to switch to long integers in order to extend the capabilities of your app.
After that Win64 and lots of memory may become relevant.
4.
By entering an article on the subject, you might be awarded a CodeProject Spokesperson Certificate...
|
|
|
|
|
I am stepping by two, I get to 65537 and stop the first loop but don't report it as a prime, then I begin the second loop by incrementing again. I'm not terribly happy with either solution I've tried.
I am using long integers, but until I got my wheel working I could only allocate enough memory for up to 2^32. I'll try for more later (overnight). 2^32 is taking about an hour and forty minutes (I think it's IO-bound).
Luc Pattyn wrote: Spokesperson
Not if they call it that.
Besides I don't want students to cheat by using my code.
|
|
|
|
|
PIEBALDconsult wrote: I think it's IO-bound
looking for primes is IO-bound?
Intel would better redirect its research efforts then.
|
|
|
|
|
|
OK,
tried a text file, took forever (I killed it after some 10 minutes, when it reached 4GB). It is both huge and slow, due to the string formatting stuff I guess.
tried a binary file, took 6 minutes all together for almost 2GB (storing more than 200 million longs); that seems acceptable (simplified model: 5 minutes calculation, then 1 minute writing, i.e. writing 30MB/sec; most primes are above SQRT and don't need much calculation).
|
|
|
|
|
I just finished a little experiment, calculating all primes < 2^32 in a simple Eratosthenes sieve. With long integers where appropriate, 1 bit per candidate (32 candidates in an int), half a gigabyte of working set, and without spokes, without threading, without assembly or SIMD instructions, it took less than 5 minutes (on a simple Intel T8300 at 2.4GHz) to find there are 203280221 primes, which got confirmed by http://www.psc-consulting.ca/fenske/cpjav18s.txt[^]. The only sophistication I applied was avoiding almost all divide/modulo operations.
And indeed listing them to a ListBox would dramatically increase the time (I did not wait for it).
With indices in C# being int (i.e. effectively 31-bit), I could theoretically stretch this to 2G of longs, or 128G candidates; obviously I would run out of physical RAM long before that, without spokes. With spokes, not sure yet how far I could go.
[ADDED] One factor strongly in favor of spoked sieves is marking the smallest primes is what takes the longest, since the smallest primes have the largest number of multiples that must get marked. [/ADDED]
|
|
|
|
|
Pentium 4, 3GHz, 1GB RAM
Using an array of ulong s, without the IO...
Thirty-spoke : 203280221 primes found in 00:08:25.7221373.
Implied Two-spoke: 203280221 primes found in 00:13:15.0680712.
Luc Pattyn wrote: The only sophistication I applied was avoiding almost all divide/modulo operations
I'll have to figure out how to do that. It's all these calculations that I was concerned about.
|
|
|
|
|
When you count primes, do you count them as you go (as I'm doing so far) or wait until the sieve is complete and then count them?
I'm considering changing my code to do all the sieving, then write the raw arrays to a file, and count or print them later.
I keep putting off working on threading.
|
|
|
|
|
Hi,
I do everything as I go, when the sieve is complete all has been done already, the only thing left is
to print a summary line (and close the output file, if any).
Threading is not helping that much, I got the code that fast that the only thing threading buys me
is having a second thread preparing the next buffer, shaving off some 30% of total elapsed time.
I see no way to gain more from threading than that; reason is in the end the problem is cache-bandwidth limited and not CPU-cycle limited.
|
|
|
|
|
Luc Pattyn wrote: shaving off some 30% of total elapsed time
I'd take that. With threading, mine was slower, I except because of all the thread creates.
I expect that I could create one or two threads that perform work as I provide it and otherwise sleep.
Luc Pattyn wrote: cache-bandwidth limited
Which I'm probably having too because I can't perform all the work on one spoke, then go to the next; I have to do a small amount of work on one, go to the next, and continue cycling through all the spokes until they're all done.
...
I just had a thought... I could copy a vertical slice of the table into a separate array and work on it with many fewer (1/64?) iterations through my List<Spoke>! I can also make that part of the process quicker; I just realized how badly I implemented it.
|
|
|
|
|
PIEBALDconsult wrote: I could create one or two threads that perform work as I provide it and otherwise sleep.
Yep, I have the main thread, one backgroundworker to do all the work, and one more thread to help the BGW; it gets created once, and two AutoResetEvents are needed to synchronize with the BGW.
PIEBALDconsult wrote: I can't perform all the work on one spoke, then go to the next; I have to do a small amount of work on one, go to the next,
Yes, that is what I meant long ago when I said I would try to merge the data of the spokes. The spokes theory makes a nice picture, but it does not really show in my code.
|
|
|
|
|
I don't know anything about Wheel Factorization, but I did my playing around with a massive bit map. I only used the odd bits so the least significant bit was 1, then 3, 5, .... I created byte bounded masks for each of the smaller primes, then masked out blocks of bits through the array. I saved the lowest primes on a file, then I scanned the array for the first bit still set, then masked bigger and bigger blocks of bits and masked some more. Save a bit, make a mask, mask bits in the array, then find a new bit.
I also saved the masks in a file. My massive bit map was 2GB representing numbers to 4GB. When I had done that, I recalculated the start for each mask for the next 2GB of bits, and continued.
I finally gave up and just found a source for the first 15 million primes.
Dave.
|
|
|
|
|
Member 4194593 wrote: to 4GB
That's 2^32, yes? How long did it take? I haven't gone beyond that yet because I want to improve the speed of what I have so far.
Member 4194593 wrote: I finally gave up and just found a source for the first 15 million primes.
It's the journey, not the destination.
|
|
|
|
|
It is an endless journey with many scenery routes. Not even sure what the destination is.
|
|
|
|
|
With many places to stop for wine and cheese.
I made some severe changes yesterday, but the worst of the calculations still need work.
Update:
Target=4294967296; Spokes=(30,8); Bytes=143165632; Elapsed=00:04:48.9588449; Primes=203280221
Target=17179869184; Spokes=(30,8); Bytes=572662336; Elapsed=00:19:56.7513899; Primes=762939111
(That's 2^34. This is decreased from thirty-one minutes yesterday.)
I still need to eliminate a div and a mod that get used a gazillion times.
modified on Sunday, November 30, 2008 11:44 AM
|
|
|
|
|
I have my Sieve working, I think. What I am looking for is the 4 or 5 primes
around the boundary of 2^(32*k) to validate the Sieve Blocks that I have
created - primes just less and just greater than 2^32, 2^64, 2^96, 2^128, 2^160,
2^192, 2^224, 2^256 and 2^288. I would like to know that my algorithm is correct
before I try to create 2048 (or more) Sieve Blocks.
Does anyone have such numbers gathering dust? The first 15 Million Primes do not
even come close to the first boundary, much less the 9th.
Prime number counts for each block: Millisecond timing
Sieve Block 0 = 203,280,220 primes. 4203 I/O 310344 Masking
Sieve Block 1 = 190,335,585 primes. 4390 I/O 512266 Masking
Sieve Block 2 = 186,011,076 primes. 7515 I/O 564594 Masking
Sieve Block 3 = 183,312,229 primes. 7468 I/O 651703 Masking
Sieve Block 4 = 181,354,450 primes. 6941 I/O 619688 Masking
Sieve Block 5 = 179,823,053 primes. 7722 I/O 832343 Masking
Sieve Block 6 = 178,574,138 primes. 5438 I/O 647093 Masking
Sieve Block 7 = 177,515,527 primes. 5765 I/O 871422 Masking
Sieve Block 8 = 176,604,424 primes. 5282 I/O 570749 Masking
50024 5580202
5630226 = 1:33:50.226
Dave.
|
|
|
|
|
I have just thought of a way to self check my calculations for determining the
correct bit to use as a start for a Sieve Block that is not Sieve Block 0. The
first Sieve Block, Sieve Block 0, is a lead pipe cinch. It IS correct. The
problem is whether or not the subsequent blocks are correct - whether or not all
bits have been examined to remove from the subsequent blocks, or whether or not
the correct starting bit was calculated for any Prime Number that is not in that
Sieve Block. My thought was to increase the size of the Sieve Block being
processed to twice the size of the original. Essentially, calculate the Sieve
for two blocks as if they were one. There would be no calculations to make for
switching from one Sieve Block to the next, any more than would need to be made
in passing from one half of a single Sieve Block to the second half of the
block, the data bits are contiguous. The original Sieve Block 0 would be copied
to a .SAV file. Once the new pair of Sieve Blocks has been calculated, compare
the new Sieve Block 0 to the saved copy, if they are the same, then continue
(they WILL be the same for Sieve Block 0), then save the new Sieve Block 0 as a
.SIV file, and save the second Sieve Block as a .SAV file (it is a valid Sieve
Block 1 since it was masked contiguously with Sieve Block 0). Now calculate the
start of Sieve Block 1, and create a new pair of Sieve Block 1 and Sieve Block 2
as was done for Sieve Block 0 and Sieve Block 1. Again compare Sieve Block 1
with the prior saved .SAV file, and if they match, save the new Sieve Block 1 as
a .SIV file, and Sieve Block 2 as a .SAV file. Each Sieve Block build builds a
correct block to validate the next block when it is created. Twice the work, but
you have validation. Actually, the actual masking of the bits is quite fast, the
reading and writing and calculations for the starting bit are the time killers.
One question. Would it be sufficient to take an MD5 checksum of the second
block, then an MD5 checksum of the next newly created block and compare the MD5
checksum to determine if valid? I think it wise to save a file of the MD5
checksums anyway to be able to validate the correctness of the blocks when read
back in. I would actually save them in a file instead of attaching them to the
end of the file data. The file data is sector and cluster bounded, to add an
extra DWORD would take extra sectors, whereas a file with MD5 checksums would
only appear once. I had planned to use 2 750 GB external hard drives to save the
files giving 1500 GB storage, and 6000 Sieve Blocks (1/4 GB per Sieve Block).
This would be all primes below 2^188000.
Does anyone see any problems with this approach?
Dave.
|
|
|
|
|
I did some research yesterday and came up with a banding technique, similar to what Dave describes. I currently scan [0,2^32) for primes in less than 4 minutes, using a small buffer only (less than 1MB). Apart from speed it has the advantage of being able to scan much larger ranges too. I think I'll turn it into a CP article in a few days/weeks, there still are some areas that need more investigation first.
|
|
|
|
|
I have it working for up to the first six primes, but it appears that using the first three offers the best performance (to 2^32):
C:\>Eratosthenes 4294967296 Zero
Target=4294967296; Spokes=(1,1); Bytes=(536870920,536870920); Elapsed=00:21:24.1512386; Primes=203280221
C:\>Eratosthenes 4294967296 One
Target=4294967296; Spokes=(2,1); Bytes=(268435464,268435464); Elapsed=00:09:32.2740256; Primes=203280221
C:\>Eratosthenes 4294967296 Two
Target=4294967296; Spokes=(6,2); Bytes=(89478488,178956976); Elapsed=00:05:51.1191606; Primes=203280221
C:\>Eratosthenes 4294967296 Three
Target=4294967296; Spokes=(30,8); Bytes=(17895704,143165632); Elapsed=00:04:53.7847739; Primes=203280221
C:\>Eratosthenes 4294967296 Four
Target=4294967296; Spokes=(210,48); Bytes=(2556536,122713728); Elapsed=00:07:46.0915766; Primes=203280221
C:\>Eratosthenes 4294967296 Five
Target=4294967296; Spokes=(2310,480); Bytes=(232416,111559680); Elapsed=00:41:44.5200097; Primes=203280221
And when I tried it using the first six it ran overnight before I killed it.
I still have more to do on it.
|
|
|
|
|
Hi,
I haven't done the actual spoke stuff yet, I decided to first optimize my "implied two-spoke" strategy as you called it; it is now below 1 minute for 2^32 (using longs and memory bands, so I can also search much larger ranges).
I am not sure how to interpret your "Bytes=(..,..)" stuff.
You having six-primed spokes run forever seems to indicate something is still not correct in your code.
I do expect spokes to improve performance a lot, so your results are a surprise to me. I plan to store all relevant spokes in a single array.
Greetings,
|
|
|
|
|