|
Any possibility of writing an article about it and posting to CP? I would be interested!
|
|
|
|
|
Yes, I plan to write at least one article on the subject; I still need to finish some research first, so it will be a few more weeks before anything shows though.
|
|
|
|
|
|
No! It'll be mis-used by students.
|
|
|
|
|
Potential misuse should not stop us now; it did not stop mankind from coming up with good and bad things such as fire, steam engines, cars, roads, arms, computers, Internet, and Windows.
I'll make the article long enough so none of the lazy students will ever read it.
|
|
|
|
|
That might work. It would be better in French, though.
|
|
|
|
|
PIEBALDconsult wrote: It would be better in French
The CP police won't allow it, they are very reluctant to articles and messages of which they don't understand most of the words.
modified on Wednesday, December 3, 2008 11:52 PM
|
|
|
|
|
Certain optimizations can be made to a classic or implied-two-spoke sieve that won't work for more spokes.
If it's hard-coded to only count the primes, it'll be quicker too.
For mine, I pass in a delegate of what to do with each prime, so I have a lot more method calls.
C:\>Eratosthenes 4294967296 Three
Target=4294967296; Spokes=(30,8); Bytes=(17895704,143165632); Elapsed=00:04:53.1354700; Primes=203280221
^ ^ ^ ^
| | | |
Total spokes --------------- | | |
Allocated spokes -------------- | |
Bytes per array ------------------------------ |
Total bytes allocated for arrays ----------------------
Luc Pattyn wrote: I do expect spokes to improve performance a lot
Try it, you'll see.
How would you cross off the 25?
1 7 13 19 25 31
5 11 17 23 29 35
This is an area that still needs work, and I think I may have an idea...
|
|
|
|
|
Hi,
thanks for the clarification.
PIEBALDconsult wrote: I pass in a delegate of what to do with each prime
I don't. I might do that when I am done, right now I do not want to wait for a machine to execute millions of delegates I do not really need. I call a non-virtual method for each prime; it displays a few of them, and keeps track of count.
PIEBALDconsult wrote: Try it, you'll see.
I will.
PIEBALDconsult wrote: How would you cross off the 25?
I am considering only one array containing the primes of all relevant spokes; I currently store 1, 3, 5, 7, 9, 11, 13, 15 etc because the odd one is the only relevant spoke in a two-spoke setup. In the six-spoke scenario that would be 1, 5, 7, 11, 13, 17, 19, 23, 25, 29, 31, 35, etc.
Therefore 25, just like any other non-prime, will be flagged as composite, every time it appears as a multiple of one of its factors; in this case that would be only once, and it would also be the first relevant multiple of prime 5.
|
|
|
|
|
Update:
I now have a Core 2 Quad and 4GB RAM (still running XR Pro), and after a few more tweaks I have:
C:\>eratosthenes 4294967296 Zero
Target=4294967296; Spokes=(1,1); Bytes=(536870920,536870920); Elapsed=00:11:24.2121472; Primes=203280221
C:\>eratosthenes 4294967296 One
Target=4294967296; Spokes=(2,1); Bytes=(268435464,268435464); Elapsed=00:05:14.2174818; Primes=203280221
C:\>eratosthenes 4294967296 Two
Target=4294967296; Spokes=(6,2); Bytes=(89478488,178956976); Elapsed=00:03:21.4457024; Primes=203280221
C:\>eratosthenes 4294967296 Three
Target=4294967296; Spokes=(30,8); Bytes=(17895704,143165632); Elapsed=00:02:59.1627191; Primes=203280221
C:\>eratosthenes 4294967296 Four
Target=4294967296; Spokes=(210,48); Bytes=(2556536,122713728); Elapsed=00:05:38.4098073; Primes=203280221
C:\>eratosthenes 4294967296 Five
Target=4294967296; Spokes=(2310,480); Bytes=(232416,111559680); Elapsed=00:36:54.0737311; Primes=203280221
Now I have to add threading... I'll be back.
Update to the update:
I tweaked a bit more (inlined some methods) so now I get:
C:\>eratosthenes 4294967296 Zero
Target=4294967296; Spokes=(1,1); Bytes=(536870920,536870920); Elapsed=00:12:00.3882952; Primes=203280221
C:\>eratosthenes 4294967296 One
Target=4294967296; Spokes=(2,1); Bytes=(268435464,268435464); Elapsed=00:05:24.4898381; Primes=203280221
C:\>eratosthenes 4294967296 Two
Target=4294967296; Spokes=(6,2); Bytes=(89478488,178956976); Elapsed=00:03:24.4878007; Primes=203280221
C:\>eratosthenes 4294967296 Three
Target=4294967296; Spokes=(30,8); Bytes=(17895704,143165632); Elapsed=00:02:46.4519220; Primes=203280221
C:\>eratosthenes 4294967296 Four
Target=4294967296; Spokes=(210,48); Bytes=(2556536,122713728); Elapsed=00:04:22.4536602; Primes=203280221
C:\>eratosthenes 4294967296 Five
Target=4294967296; Spokes=(2310,480); Bytes=(232416,111559680); Elapsed=00:24:59.2846800; Primes=203280221
C:\>eratosthenes 4294967296 Six
Target=4294967296; Spokes=(30030,5760); Bytes=(17880,102988800); Elapsed=06:30:34.0253234; Primes=203280221
Notice the first successful run of 2^32 with six primes!
Further update:
C:\>eratosthenes 4294967296 Six
Target=4294967296; Spokes=(30030,5760); Bytes=(17880,102988800); Elapsed=05:53:56.1556709; Primes=203280221
I did try threading, but it was awful, I'll try again later.
modified on Monday, January 12, 2009 10:05 AM
|
|
|
|
|
My requirement is : I will be having billions of data records coming in per day.From those I have to filter some 1000-2000 records based on record names which are stored in a filter.
My doubt is what is the suiatble data structure to be used and which search algo will meet this high volume requirement.
Sanchit
|
|
|
|
|
Sanchit,
You need to be more specific. In what format are the input records or is this something which can be defined for this implementation? In what form are the record names in the filter and can this format be defined for this implementation?
Dave.
|
|
|
|
|
The most efficient data structure for lookups is the hash table.
Create a hash table with a size that's about three times the number of names you're looking for, that's a prime number. Insert the 1000-2000 names you're looking for.
Then for each data record that comes in, look up the name in the hash table. Use a language like C++, which executes faster than C#.
|
|
|
|
|
You may also want to look at a radix tree:
http://en.wikipedia.org/wiki/Radix_tree[^]
It will save the time required to calculate a hash for each of the billions of records at the cost of a longer search per record.
Hash table:
- calc hash of record (slow)
- check table for hash value (quick)
- handle searching collisions (quick)
- if found matching hash in table do a full string compare on string record and hash table record as hash algo can produce same hash for different strings. (slow)
Radix tree:
- search tree for record string (slow)
The radix should be faster. The hash function will likely have a similar cost to the radix tree search. It may even be worse as the hash looks at each character in the string whereas the radix search can fail out early.
...cmk
The idea that I can be presented with a problem, set out to logically solve it with the tools at hand, and wind up with a program that could not be legally used because someone else followed the same logical steps some years ago and filed for a patent on it is horrifying.
- John Carmack
|
|
|
|
|
The initial post mentioned that the search would be for record names that were stored in a filter. If only a field of the record (the name) needed to be checked, wouldn't that reduce the effort to create a hash - in fact wouldn't a Radix search be faster? That is why I asked for more information about the input and filter format.
Dave.
|
|
|
|
|
Member 4194593 wrote: If only a field of the record (the name) needed to be checked,
I understood that each record would contain a (single) filter field, that's why i suggested as radix tree. A hash would likely be more approriate if you wanted to match several fields.
Member 4194593 wrote: wouldn't that reduce the effort to create a hash
Yes. Sounds like you're in favour of using a hash table.
Member 4194593 wrote: in fact wouldn't a Radix search be faster?
Yes, based on my op. Sounds like you're in favour of using a radix tree.
Original post: 'based on record names'
I assume this to mean strings, in which case my op stands.
...cmk
The idea that I can be presented with a problem, set out to logically solve it with the tools at hand, and wind up with a program that could not be legally used because someone else followed the same logical steps some years ago and filed for a patent on it is horrifying.
- John Carmack
|
|
|
|
|
Good point. Some other pros and cons: A radix tree would require more memory accesses. But on the other hand, a radix tree would take less memory than a hash table, so would be more likely to fit entirely in a cache.
The hash-value calculation wouldn't necessarily be slow; it would be all integer operations on a key that could be in a single cache line. Also the number of collisions can be reduced in two ways:
1. Randomly-hashed values follow a Poisson distribution, so the expected number of collisions can be calculated for a given hash-table size. The table size can be increased to achieve an acceptable collision rate.
2. Assuming the list of names is fixed, a perfect hash function http://en.wikipedia.org/wiki/Perfect_hash_function[^] can be constructed, which will have no collisions.
|
|
|
|
|
If it's necessary to check billions of records per day, that sounds like enough computer work--regardless of algorithm--that profiling will be necessary to achieve optimal performance. Radix trees can be good if the records to be searched for differ from other records in the characters that are examined first. They can be bad if records often differ only in the characters that are examined much later.
With hash functions, there is often going to be some trade-offs involving the time required to perform a hash, the hash table size, and the frequency of hash collisions and false matches. Increasing the complexity of a hash function such that the time required increases by more than half the cost of a direct record comparison won't make much sense if doing so unless it saves an average of one direct record comparison for every two records. Having a hash table so large that every access entails a cache miss may slow things down more than would the occasional hash collision or false match.
One approach that may be helpful would be to compute a somewhat long hash of a record, hash that to produce an index into a hash table, and have the table itself store the 'long' hash in addition to the actual record. One could thus at very little expense test the long hash of the record to be checked against the long hash in the table. This would eliminate a large fraction of the direct record comparisons that would otherwise be needed.
BTW, how fast are modern machines at 32x32 integer multiplication? If the strings are zero-padded as well as zero-terminated and multiplies are cheap, it might be good to do something like:
unsigned long *ptr;
unsigned long hash,temp;
do
{
temp = *ptr++;
hash += hash >> 23;
hash ^= temp;
hash *= some_big_prime_number;
} while(temp);
That would gobble up 32 bits of string data per loop iteration; I would expect the multiply, even if it takes multiple cycles, could be done concurrently with the next word fetch. Not the world's most sophisticated hash function by any stretch, but the execution time should be pretty minimal. If you want to use the hash function to yield a table index, use the upper bits. Taking the modulus of a prime number (distinct from the one used for the multiply) may be better, but modulus operations are more expensive than shifts.
|
|
|
|
|
Interesting points. Your comment about better performance if the records differ in the characters examined first reminded me of the ID3 algorithm http://en.wikipedia.org/wiki/ID3_algorithm[^]. Basically it builds a decision tree that first examines the attributes of a record that differ the most. This avoids wasting time examining attributes that don't differ much among the records.
Your point about cache misses is good; for this volume of data the best data structure is one that fits in the cache.
|
|
|
|
|
|
Vij@y wrote: anyone can convert this code to quick sort.
So convert it, then.
If you can't convert it, which part are you having problems with?
|
|
|
|
|
I need help from you who have experience with wumpus problem.
I would like to ask that at one step, how can we choose the adjacent field to go ?
I know one method that we choose the adjacent OK field (without mummies or pits) that has fewest visited times. My teacher said that it's not the most optimal method...
Would you mind helping me. Thanks in advance.
|
|
|
|
|
|
I just think that if you can not help other people, you shouldn't consider them as a joke...
|
|
|
|
|
I think it was a way to indicate to you that you should look to other currently available sources before you ask others to spend their time trying to help you. Did you google "Wumpus"? Many available references. If you have checked those and still do not have an answer, then ask for individual help, but maybe on a gaming forum.
|
|
|
|
|