If your keys fit in memory fine, and sort fine, then that is as fast as its going to get. If the seek to pull data off the hard disk by key is an issue, then you need more RAM.
Think of it this way. You have the sorted keys, so your only challenge is to pull records off disk. Thats going to be slow. I'd start to look at why you have a 2 gig flatfile in the first place, and why that needs to be sorted so quickly.
Did you understand Mr. Balkany's suggestion? If you can fit 1/16 of the data in RAM along with all the indices, then you should be able to process everything (after the sort) in 16 passes through the data file, with no random seeks.
Suppose, for example, that there are 16,000,000 records and you can hold an array recBuff of 1,000,000 records in RAM along with an array finalPos of 16,000,000 integers. First, fill in finalPos such that finalPos(0) says where record #0 in the original file should go; finalPos(1) says where record #1 should go, etc. This can be done in linear time.
Next, read through the entire source file; after reading record #n from the file, look at finalPos(n). If it's less than 1,000,000 then store the record in recBuff(finalPos(n)). Otherwise discard it. Once this is done, recBuff(0..999999) will hold the first million records. Write them to disk.
Now read through the source file again. This time, look for records where finalPos(n) is in the range 1,000,000 to 1,999,999 and store those records in recBuff(finalPos(n)-1000000). Once all records have been read, recBuff will hold the next million records. Write those to disk.
If recBuff and finalPos fit in RAM without swapping, the program should run very fast. Doubling the number of items in recBuff will double the speed, if it does not cause swapping. If it does cause swapping, it will dog the performance.
If there are so many records that the finalPos array itself takes an excessive amount of space, a temp file could be created which interleaves the source data with the finalPos items (since finalPos is always read in order). That would free up more space for recBuff.
First of all I have to ask, if this is the algorithm called "bucket sort" / "radix sort", because you didn't mentioned any comparison operations.
The point I don't get is how one record is classifed to the current recBuf()-borders.
I will demonstrate my lack of understanding with an example.
Given an unsorted array: 5,3,2,9,1,8,2,4,7,2,6. Let's assume that my internal memory can only hold 3 values
First, I read the entire array and my goal is to classify 1,2,2 into the first recBuf(0..2). And that's my problem of understanding, how can I know that "3" belongs to the second recBuf(3..5).
There are three instances of "2" and so the array is not uniformly distributed.
How many records are there, how big are the keys, and how big are the records? Do you have one, two, or more disk drives available for processing?
If the keys are small enough (and there are few enough of them) that they can all fit into memory, you should start by sorting the keys (each one accompanied by an integer giving the location in the original file). Then proceed as I described.
If the number of records so large that e.g. only 10% of the keys will fit into memory, then I would suggest that you come up with some means of partitioning the keys into, say, 65,536 buckets. It doesn't matter whether the distribution is particularly even, provided that no single bucket holds more than 10%, and preferably no more than 2% or so. Make a pass through the file and count how many keys fit into each bucket.
Once that is done, count how many buckets one could add, starting at the bottom, before they totaled 10% of the records. Make a pass through the original file reading into RAM all the records that fit into those buckets. Then sort them in RAM and write them out. Then repeat the procedure, starting the the bucket after the last one that was used in the first pass. Then sort those and write those out.
The exact procedure you should use will vary depending upon what your data looks like and the number of separate disks available. Nonetheless, the key observations are (1) it's often good to sort with records containing just key and a reference to the original record, since the number of such records that can fit in RAM is larger than the number of full records that would fit; (2) it's better to think in terms of reading through a whole file, fetching some data into RAM and ignoring other data, than to think in terms of grabbing lots of little pieces of data scattered through a file; (3) though I haven't touched on this much, for really big jobs, having two or three hard drives will help things a lot.
Looks like it was done using distributed computing.
...that mortally intolerable truth; that all deep, earnest thinking is but the intrepid effort of the soul to keep the open independence of her sea; while the wildest winds of heaven and earth conspire to cast her on the treacherous, slavish shore.
Lol, how long will htey need to read the number in the oficial presentation?
If something has a solution... Why do we have to worry about?. If it has no solution... For what reason do we have to worry about?
Help me to understand what I'm saying, and I'll explain it better to you
“The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.” - Michael A. Jackson
Rating helpfull answers is nice, but saying thanks can be even nicer.
I'm working at a non-recursive implementation of a scapegoat tree ( Scapegoat tree "partial" paper ). At section 6.2 there is summarized an implementation of a non-recursive rebalancing algorithm. I think I almost got how I should use those 2 stacks, but my problem is how can I "plug" the nodes in the right(how should I tell if a node is connected to another?!) position. I think reduced the problem to the determination of height of a node in the rebalanced tree using the weight of the whole tree and the position of the node in the inorder traversal. The height determination should be only O(1) because the whole rebalance should be O(n) in time and O(logn) in space.
Thanks in advance!
Last Visit: 31-Dec-99 18:00 Last Update: 30-Jul-14 21:40