Introduction

This article is a contribution to the Code Lean and Mean challenge. The competition presentation states: In days of yore, it was a badge of honour to be able to fit a full file manager, spreadsheet application, appointment reminder, and Tetris Easter Egg within a 32KB executable.

This executable is as much as 22.5 KB, and only computes the smallest differences between two files. Not much room left for the spreadsheet.

The Challenge

Create an application that will calculate and display the changes between two text files as fast as possible using the least amount of memory possible. Provide timing data and maximum memory use data.

Background

The diff utility is described here, the related LCS (Longest Common Subsequence) algorithm is very well detailed here. The main application of this algorithm is in DNA studies, with our lines only composed of one letter, but our files being several millions lines long.

Issues

The LCS algorithm requires each line in a file to be compared (for equality) with all lines in the other file. The only shortcut is to exclude matching lines at the beginning and end of both files.

The algorithm is quadratic, and requires to fully populate a table with one row per line in one file and one column per line in the other. If the files are around a hundred lines, that's 10000; for files around 3000 lines, 10 millions. Don't hope to quickly compare two translations of the Bible.

The processing time also grows quadratically with the file sizes. So there is no global design trade off between memory usage and processing speed, at least with this proven algorithm. Finding new ways, unexplored by several thousands of researchers world-wide is, except for some geniuses, kind of Adventures in Wonderland, so I will stick to this algorithm.

Design

The process is purely sequential:

Check the file names and start the timer.
Open and access the files.
Exclude the matching lines at the beginning and end (if any).
Index the remaining lines.
Fill up the LCS array with the result of line comparisons.
Extract the mismatching line indexes from the LCS array.
Output the results and stop the timer.
Report time and memory consumption.

The aim for each step is to be thrifty with memory, albeit efficient and quick in processing.

Working code

Data Design

The contents of the files are accessed through memory mapping, saving some space.

The line addresses in the memory mapping space are stored in two vectors.

CAtlFileMapping<CHAR> FileMaps[2]; // File memory mappings
UINT iFirstWorkLine = 0; // Pos of first processed line
vector<LPCSTR> Lines[2]; // Beginning of processed lines

The LCS data for each pair of lines is stored as:

struct LCSInfo
{
    SHORT size;  // LCS size
    bool bMatch; // lines equality
};

This is a trade-off: recording the matching status of the pair avoids multiple readings and comparisons of the same lines, but increases the memory footprint. Packing the bool in the short would divide the array footprint by two, but will increase, more than proportionally, the processing time for recording and comparing, and be quickly compensated by an increase of 41% of the file sizes.

The LCSInfo data is stored in a single vector with bi-dimensional access:

class LCSArray : public vector<LCSInfo>

{
public:
    reference operator()(UINT i, UINT j)
    {
        static const size_t s1 = Lines[1].size();
        ATLASSERT((i < Lines[0].size()) && (j < s1));  
        return operator[](s1 * i + j);    }
} LCS;

LCS is allocated once at its unmoving size.

Process Flow

Steps 1 & 2: Check the file names, start the timer, open and access the files.

At the end of step 1, a timer is initialized and the files are opened and mapped. On multi processor machines, the step is parallelized, which may be unnecessary in most cases.

Step 3: Exclude the matching lines at the beginning and end (if any).

Two parallel threads (if multiprocessing) look for the first mismatching char from the beginning and end of the files:

// Find matching lines at begin and end (multithreaded)
size_t 
    iFirstMismatch = 0, 
    iLastMismatch = 0;
#pragma omp parallel sections 
{
    #pragma omp section
    {
        // Find first mismatch
        PCHAR 
            pc0 = FileMaps[0],
            pc0Max = pc0 + FileMaps[0].GetMappingSize(),
            pc1 = FileMaps[1];
        for (; (*pc0 == *pc1) && (pc0 != pc0Max); ++pc0, ++pc1)
            if (*pc0 == '\n')
                ++iFirstWorkLine;
        // Adjust to line begining
        for (; pc <= pcLast ; ++pc)
            if (*pc == '\n')
                Lines[i].push_back(pc + 1);
    }
// ... backwards

If no mismatch is found, the files are identical and the program terminates.

Step 4: Index the remaining lines.

In each file, a thread (if multiprocessing) looks for the '\n' characters between the first and last mismatches, and stores the addresses in a Lines vector.

The vectors are first allocated for an approximate line length computed from the beginning matching lines, or 1024 lines, if none.

// Build line indexes (multithreaded)

#pragma omp parallel for
for (int i = 0; i < 2; ++i)

{
    PCHAR 
        pc = FileMaps[i] + iFirstMismatch,
        pcMax = FileMaps[i] + FileMaps[i].GetMappingSize(),
        pcLast = pcMax - iLastMismatch ;

    if (iFirstWorkLine)
        Lines[i].reserve(FileMaps[i].GetMappingSize() * iFirstWorkLine / 
            iFirstMismatch);
    else 
        Lines[i].reserve(1024);

    // Never accessed index 0
    Lines[i].push_back(NULL);
    // First mismatching line at index 1
    Lines[i].push_back(pc++);

    // Add line indexes including last mismatching
    for (; pc <= pcLast ; ++pc)
        if (*pc == '\n')
            Lines[i].push_back(pc + 1);
}

Step 5: Fill up the LCS array with the result of line comparisons.

The LCS array is allocated at its known size, and the line pairs are processed sequentially. From now, no parallelizing is possible as each step depends on its predecessor.

// Fill up LCS array
// Allocate and set to {0, false}
size_t sizeLCS = (Lines[0].size() + 1) * (Lines[1].size() + 1);
LCS.resize(sizeLCS);
// Start with first mismatching lines
UINT 
    iStart = 1, 
    jStart = 1;
// End before last indexed lines
const UINT 
    iEnd = Lines[0].size() - 2, 
    jEnd = Lines[1].size() - 2;
// http://en.wikipedia.org/wiki/Longest_common_subsequence_problem
for (UINT i = iStart; i <= iEnd; ++i) 
    for (UINT j = jStart; j <= jEnd; ++j)
    {
        if (MatchLines(i, j))
            LCS(i, j).size = LCS(i - 1, j - 1).size + 1, 
            LCS(i, j).bMatch = true;
        else if (LCS(i - 1, j).size > LCS(i, j - 1).size) 
            LCS(i, j).size = LCS(i - 1, j).size;
        else 
            LCS(i, j).size = LCS(i, j - 1).size;
    }

The MatchLines() function is called there once on each line pair, and uses approximately one third of the processing time.

// Lines equality

inline bool MatchLines(UINT i, UINT j) 
{
     static const size_t 
        s0 = Lines[0].size() - 1,
        s1 = Lines[1].size() - 1;

    ATLASSERT((i < s0) && (j < s1));

    if (Lines[0][i+1] - Lines[0][i] != Lines[1][j+1] - Lines[1][j])
        return false;

    PCCH 
        ps0 = Lines[0][i], 
        ps1 = Lines[1][j];

    while (*ps0 == *ps1)
        if (*ps1 == '\n')
            return true;
        else
            ++ps0, ++ps1;

    return false;
}

If sizes are equals a char comparison is performed until a '\n' is found.

Step 6: Extract the mismatching line indexes from the LCS array.

The mismatching line indexes and types are extracted in reverse order from the LCS array and pushed to a Diffs stack.

// Extract diff lines from LCS array

UINT 
    y = iEnd,
    x = jEnd;

while ((y > 0) || (x > 0))
{
    if (LCS(y, x).bMatch)
       --y, --x;

    else if ((x > 0) && ((y == 0) || (LCS(y, x - 1).size > LCS(y - 1, x).size)))
        AddDiffLine(1, x, DIFF_ADDED), --x;
    
    else if ((y > 0) && ((x == 0) || (LCS(y, x - 1).size <= LCS(y - 1, x).size)))
        AddDiffLine(0, y, DIFF_REMOVED), --y;
 }

// Record number of diffs and output computing time
size_t nDiffLines = Diffs.size();
cout << endl << nDiffLines << " difference(s) found in " 
    << omp_get_wtime() - fstartTime << " second(s)." << endl;

At this step, all the computing is done, so the intermediate time is output. Less than half of the total execution time is spent.

Step 7: Output the results and stop the timer.

The result line indexes and types are popped from the Diffs stack and output through std::cout::operator<<() to stdout with a simple formatting. The timer is then stopped.

For a test with 21 lines, the output time is over the computing time.

Step 8: Report time and memory consumption.

The memory resource footprint is reasonable, the processing time is highly dependant on the system load, varying sometimes by a factor 2 in some seconds.

Conclusion

The small memory vs. high efficiency compromises are easily spotted on this simple linear process.

Nobody is perfect, cheers!

History

30 Aug 2009: Initial post.
3 Sep 2009: Update
- Correct handling of mismatches in first and last lines.
- Correct handling of insertion before first line.
- Various local optimzations.