External sort of large data sets (and related algorithms like merge, join, group by, unique) are an important part of many data organization efforts such as database construction, search engine indexing and data-mining algorithms.
What can you do when you have to sort 5,000,000,000 (5*10^9) integers (this is around 18GB of data when you store the integers in binary with 4 bytes per integer or 30GB if you store the data as text)?
- Use a database. Boring!
- Use a commercial product such as nsort from oridinal.com. Expensive!
- Use an open source tool such as the unix sort command. Uncertain!
- Roll your own and see how it performs. Fun!
I rolled my own external sort implementation in C# but only compared against the unix sort utility. Guess what, my implementation runs more than 6 times faster than unix sort. To be fair, there are research reports that show off fast sort implementations, but those implementations are tuned to particular hardware and in general cannot be found online.
- Show a program that sorts a large number of integers (but only integers) using disk.
- My program should be comparable to/or better than unix sort in terms of efficiency. It turns out mine is around 6 times faster which is a big gap when you measure your program's running time in hours.
- My program is written in C#
- Post my code
- Keep it short
- Write the fastest possible sort program
- Write a complete solution, i.e. a program that works not only for integers but for any sort of input and supports multiple options
- Evaluate and compare against most possible options to sort large datasets
- Provide extensive discussion of mine or other approaches for external sorting
After implementing my program, I ran one large scale experiment. I generated a text file of 5*10^9 integers. One needs approximately 18GB of space to store this much data given that one spends 4 bytes per integers. When the data is stored as text, with one number per line, it takes either 28GB or 33GB depending on whether you "\n" or "\r\n" to terminate the lines. I used two external USB hard drives with transfer-rate of 30-35MB per second from one to the other. The results when using one USB hard drive were similar. Given that the sorting algorithm requires two reads and two writes of the data-set, and assuming transfer rate between 30-35 MB per second, it should take, in the best case between 49 and 58 minutes to sort the data-set for this experiment. I used the following formula for this calculation:
( 33.0 (*initial read*)
+ 18.0 (*sort chunks and write*)
+ 18.0 (*read sorted chunks and merge*)
+ 33.0 (*write sorted file*)
)* 1024.0 (*work in GB*)
( 30.0 (*transfer rate MB/s*)
* 60.0 (*convert to minutes*));;
I split the original file into chunks of 200MBs which lead to 96 files to merge. Overall, my program did not consume more than 1GB of memory. The timings of parts of my program are shown below:
Time to sort intermediate files: 00:28:34.9422320
Number of files to merge: 96
Time to merge intermediate files: 00:24:50.6632475
(When '\n' for end of line is replaced by '\r\n' it took 00:27:43.6421148)
Total Sort Time: 00:53:25.9329689
I got the following timing for unix sort:
$ time sort -n --buffer-size=1000M --temporary-directory=/cygdrive/e/tmpSort/tmp/
real 422m35.296s (this is around 7 hours)
I monitored unix sort through Windows Performance Monitor and noticed most of the time was spent in CPU. This is strange since external sorting should be mostly IO-bound. As another baseline, I ran the wc (word count command on the output file):
$ time wc -l output.txt
Word count spent most time doing IO, which is as expected.
One can observe from this experiment that reading took 14 minutes and reading + writing (both for the initial sort pass and the merge pass) took 28=2*14 minutes. It seems that one cannot sort faster than 4*14 (around 26 min.) that for my particular choice of hardware. If one used compressed files for input/output or multiple non-USB disks, one could do better.
Everyone who implements an external sort program has a number of tricks. The most significant trick that I used was the way I performed the merge of multiple files. The literature accounts for using a priority queue. In my experience, the priority queue shows poor locality of reference and ends up slow. I used merging of streams where each stream is represented as array+rest. Rest contains code to generate a new "array+rest". For more than two streams, I built a binary tree of merges up front and allocated storage buffers. No allocation of memory happens after that. A drawback to this strategy is the larger amount of memory used. To merge n files where I use a buffer of k bytes for each file I need: log(n)*n*k bytes. Usually, n <= 1024, so log(n) is 10.