12,291,230 members (56,362 online)
alternative version

33K views
27 bookmarked
Posted

# Sorting Huge Text Files

, 11 Aug 2014 CPOL
 Rate this:
Algorithm of sorting lines in text files size of which exceeds available memory

## Introduction

One interviewer asked me how I would sort lines in a 5GB text file, if I had only 1GB of RAM. The task seemed to me quite interesting and I couldn't resist from implementing it the next morning.

## Background

I wish I had some background on it, but I really don't know much about external sorting algorithms. I wrote this code mostly using intuition, so I would appreciate if you share your knowledge on more advanced techniques in comments.

## Algorithm

In order to sort the file, I first split it to smaller files. On the first iteration, I run through the input file and find all lines that start from the same character. If some resultant files are still larger than the amount of available memory, I split them by first two characters and so on. Then each smaller file is being sorted in memory and saved on the disk. The final stage is merging the sorted list of sorted files into a single file.

Example. Consider the following text file:

```apricot
apple
orange
milk
meat```

For simplicity let's assume that we run a very ancient computer and can't afford sorting files more than 10 bytes long. Our input file is larger, so we need to split it. We start the process by splitting by one character. So after the first step, we will have three files:

```a

apricot
apple
m

milk
meat
o

orange```

Files m and o can now be sorted in memory. However file a is still too large. So we need to split further.

```ap

apricot
apple
av

File av is less than ten bytes, however file ap is still large. So we split once again.

```apr

apricot
app

apple```

Now that we have five small sorted files instead of a single big one, we range them in order of their beginnings and merge them together, saving results into output file.

 app apple apr apricot av avocado m meat milk o orange

Looks good, however this algorithm has a flaw: consider that the input file contains five gigabytes of the same line repeated many times. It's easy to see that in this case the algorithm will be stuck in an endless loop trying to split this file over and over again. A similar problem is illustrated below. Consider we have the following `string`s and our memory is not sufficient to sort them.

```a
ab
abc
abcd
abcde
abcdef```

As they all start from '`a`', they all will be copied into the same chunk in the first iteration. In the second iteration, we are going to split line by first two characters, however line '`a`' consists of only one character! We'll face the same situation in each iteration.

I handle these two problems by separating `string`s smaller than current substring length into a special unsorted file. As we don't need to sort it, the file can be of any size. (If only case-sensitive sorting were supported, it wouldn't be necessary to save short lines into a file, but only calculate their number.)

Incidentally the algorithm is stable, i.e., it maintains the relative order of records with equal values.

## Using the Code

The class `HugeFileSort` contains the following properties, which specify how sorting will be performed:

• `MaxFileSize` - the maximum file size that can be sorted in-memory (100MB by default)
• `StringComparer `- comparer used for sorting (case-sensitive `CurrentCulture `by default)
• `Encoding `- the input file encoding (UTF8 by default)

The main method is called simply `Sort`. It accepts two `string`s: input and output file names. If size of input file is less than `MaxFileSize`, its content is simply loaded into memory and being sorted. Otherwise the procedure described above is being performed.

```/// <summary>
/// Performs sorting
/// </summary>
/// <param name="inputFileName">The name of the file that needs to be sorted
/// </param>
/// <param name="outputFileName">The name of the file in which
/// sorted lines from the <paramref name="inputFileName"/> should be saved
/// </param>
public void Sort(string inputFileName, string outputFileName)
{
chunks = new SortedDictionary<string, ChunkInfo>(Comparer);
var info = new FileInfo(inputFileName);
//If file size is less than maxFileSize simply sort its content in memory.
if (info.Length < maxFileSize)
SortFile(inputFileName, outputFileName);
//Otherwise create temp directory and split file into chunks,
//saving each chunk as a new file into this directory
else
{
var dir = new DirectoryInfo("tmp");
if (dir.Exists)
dir.Delete(true);
dir.Create();
SplitFile(inputFileName, 1);
Merge(outputFileName);
}
} ```

During the execution, temporary directory tmp is created in the current folder. For the sake of better display, final set of temporary files is not deleted and stays in the folder. In production code, please uncomment two lines in the `Merge `method.

```/// <summary>
/// Merges all chunks into one file
/// </summary>
private void Merge(string outputFileName)
{
using (var output = File.Create(outputFileName))
{
foreach (var name in chunks)
{
name.Value.Close();
if (name.Value.NoSortFileName != null)
{
CopyFile(name.Value.NoSortFileName, output);
//File.Delete(name.Value.NoSortFileName);
}
if (name.Value.FileName != null)
{
CopyFile(name.Value.FileName, output);
//File.Delete(name.Value.FileName);
}
}
}
} ```

The core of the algorithm is the splitting method.

```/// <summary>
/// Splits big file into some chunks by matching starting characters in each line
/// </summary>
/// <param name="inputFileName">Big file name</param>
/// <param name="chars">Number of starting characters to split by</param>
private void SplitFile(string inputFileName, int chars)
{
var files = new Dictionary<string, FileChunk>(Comparer);
using (var sr = new StreamReader(inputFileName, Encoding))
{
while (sr.Peek() >= 0)
{
//The length of the line is less than the current
//number of characters we split by
//In this cases we add the line to the non-sorted file
if (entry.Length < chars)
{
ChunkInfo nameInfo;
if (!chunks.TryGetValue(entry, out nameInfo))
}
//Otherwise we add the line to the file corresponding
//to the first char characters of the line
else
{
string start = entry.Substring(0, chars);
FileChunk sfi;
if (!files.TryGetValue(start, out sfi))
{
sfi = new FileChunk(Encoding);
}
sfi.Append(entry, Encoding);
}
}
}
//For each of the chunk we check if size of the chunk is
//still greater than the maxFileSize
foreach (var file in files)
{
file.Value.Close();
//If it is - split to smaller chunks
if (file.Value.Size > maxFileSize)
{
SplitFile(file.Value.FileName, chars + 1);
File.Delete(file.Value.FileName);
}
//Otherwise save it to the dictionary
else
{
SortFile(file.Value.FileName, file.Value.FileName);
ChunkInfo nameInfo;
if (!chunks.TryGetValue(file.Key, out nameInfo))
nameInfo.FileName = file.Value.FileName;
}
}
} ```

`FileChunk `and `ChunkInfo `are auxiliary nested classes. The former is the helper that corresponds to new files on each iteration and is used to write lines into files. The latter contains information about the data that will be merged into the resultant file. During the recursive work of the algorithm, the program populates the sorted dictionary that maps text line startings to the instances of `ChunkInfo`.

`SortedDictionary<string, ChunkInfo> chunks;`

`ChunkInfo `contains the following information:

• `FileName `- the name of the file that contains sorted lines starting with the given substring
• `NoSortFileName `- the name of the file that contains non-sorted lines equal to the given substring (may differ by case)

Both properties can be `null`.

The class also contains method `AddSmallString() `that writes a given `string `into file `NoSortFileName`.

Test application requires three command-line arguments: input file name, output file name and max file size in bytes. It performs case-insensitive sorting of UTF8 files.

## Limitations

If you want to break it, pass in a large file with little or no line ends. As the algorithm uses standard `TextReader `to read text line by line from the file, it is not designed to handle such input data.

## History

• 20th November, 2011: Initial post

## Share

No Biography provided

## You may also be interested in...

 First Prev Next
 Averbud The large social network averbud24-Nov-15 19:39 averbud 24-Nov-15 19:39
 external sorting Member 112003452-Nov-14 8:11 Member 11200345 2-Nov-14 8:11
 Another Alternative PizzaPie13-Aug-14 6:17 PizzaPie 13-Aug-14 6:17
 Re: Another Alternative ideafixxxer13-Aug-14 7:57 ideafixxxer 13-Aug-14 7:57
 Re: Another Alternative PizzaPie13-Aug-14 7:58 PizzaPie 13-Aug-14 7:58
 Re: Another Alternative ideafixxxer13-Aug-14 8:06 ideafixxxer 13-Aug-14 8:06
 Windows Sort Ctznkane52512-Aug-14 3:02 Ctznkane525 12-Aug-14 3:02
 Numerical Values? tmag8-Aug-14 19:14 tmag 8-Aug-14 19:14
 Re: Numerical Values? ideafixxxer11-Aug-14 5:05 ideafixxxer 11-Aug-14 5:05
 What if? Petro Protsyk23-Nov-11 1:50 Petro Protsyk 23-Nov-11 1:50
 Re: What if? ideafixxxer23-Nov-11 5:31 ideafixxxer 23-Nov-11 5:31
 I don't understand... Axel Rietschin21-Nov-11 10:54 Axel Rietschin 21-Nov-11 10:54
 Re: I don't understand... ideafixxxer22-Nov-11 10:14 ideafixxxer 22-Nov-11 10:14
 Re: I don't understand... Kevin Bewley12-Aug-14 3:34 Kevin Bewley 12-Aug-14 3:34
 Re: I don't understand... arussell7-Aug-15 11:57 arussell 7-Aug-15 11:57
 Please show and explain the code. PIEBALDconsult20-Nov-11 9:40 PIEBALDconsult 20-Nov-11 9:40
 Re: Please show and explain the code. ideafixxxer20-Nov-11 10:14 ideafixxxer 20-Nov-11 10:14
 Last Visit: 31-Dec-99 18:00     Last Update: 24-May-16 10:30 Refresh 1