Content
This is a candidate entry for The Code Project's Lean and Mean competition to write an efficient text diffing tool.
The problem can be described like this: given two text files file1 and file2, where file2 is supposed
to be a minor revision of file1, create a program that generates and displays a description of the differences,
i.e. produces instructions for how to turn file1 into file2, and do so in the most economical way both for
elapsed time and memory usage (specified as the amount of data needed on top of whatever .NET needs to get going).
My solution is based on the LCS algorithm, augmented with two personal modifications, and some mean and lean programming. The resulting program is compact and very fast, and scales perfectly with any file size.
For simplicity's sake, we will perform a line-by-line comparison, and represent text lines by a single letter; equal letters for equal lines, different letters for different lines.
A simple approach would be to start reading both files, and comparing the lines consecutively; as long as the lines in both files are identical, we continue without any problem. However as soon as there is a mismatch, we are in trouble as we have to decide whether the current line in file1 has disappeared, or the current line in file2 has been inserted, or the one line has been edited to become the other.
An example:
file1 = A B C D E F
file2 = A B X Y D E F Z
We recognize A=A, B=B, then reach C and X; now we should decide whether C has been removed, or X has been inserted, or C has been modified to become X (which can be achieved by removing C and inserting X). Depending on the algorithm used, if we make an unfortunate decision we might never get in sync again, and completely miss the commonality around DEF in both files.
In the earlier example the common parts of file1 and file2 (present in both files in the same order) would be "ABDEF":
file1 = A B C D E F
file2 = A B X Y D E F Z
LCS = AB DEF
edit = AB -C +X +Y DEF +Z
so the edits could be "AB -C +X +Y DEF +Z", which should be understood as: start from file1, copy lines as long as there are no plus or minus signs; remove the line after a minus sign, and insert the line after the plus sign. Obviously there can be multiple equivalent solutions, again in the example the sequence -C +X +Y can be rearranged in several ways, as long as X comes before Y.
We must conclude simply scanning the files once, without looking ahead, is bound to fail. And it gets worse, much worse, when there are duplicate lines, as in the next example:
file3 = ABACADA
file4 = AAAAACADAEA
If we match the first four A's from file4 with all the A's from file3, we would run out of A's before reaching the C, and so we fail to discover the common "CADA" sequence. So a greedy algorithm will not necessarily generate the best solution.
The problem isn't new of course, it has been studied over and over. Maybe the best known algorithm in this domain is the LCS algorithm, where LCS stands for "Longest Common Subsequence". I'm not going to repeat the theory here, it is well documented in many places.
The LCS algorithm uses a two-dimensional matrix, with N1 rows and N2 columns, where N1 and N2 are the number of lines in the files, so it could be quite huge. Basically the matrix is a playground where one has to move from the top left corner (the beginning of both files) to the bottom right corner (the end of both files) by stepping either one to the right, one down, or one step diagonally (down and right) corresponding to a removal, an insertion, and a copy operation respectively. Of course one can only step diagonally ("copy") if the lines in both files at that location are identical. The N1*N2 matrix is calculated based on N1*N2 string compares, in the right order, and accumulating intermediate results.
Once the matrix is filled, one needs to trace a way back from bottom left to top right according to some rules, and one takes notes of the path followed. Then these notes are played again in reverse order, resulting in a string similar to the
edit = AB -C +X +Y DEF +Z
that we saw earlier.
[New paragraph since V1.1]
And then there is a famous
article by Eugene W.Myers entitled "An O(ND) Difference Algorithm and Its Variations". I haven't studied it
in any detail yet; on a first glance it looked very thorough but quite some overkill for the problem at hand.
The solution to the "mean and lean" challenge should be very economical (both in CPU cycles and memory bytes) and scale well, so we will try and avoid:
- loading all the text into memory at once;
- using a 2-D matrix with dimensions that grow with the file sizes (typical for algorithms such as LCS).
Furthermore for maximum performance we will try and:
- read the files only once;
- use a "sliding window" technique, i.e. base our decisions on partial file content (the lines around the current position), rather than full file content (which we never possess due to our choices #1 and #3)
- and output results as we go.
So the first challenge is to derive something like a "local LCS algorithm". It constitutes major improvement #1. Here is how it goes:
- we fix a "window size" (or WS for short), that is the (maximum) number of lines we are willing to store for each of the files;
- we set up a WS*WS matrix exactly like the LCS algorithm would do (there will be a twist later);
- we trace back from bottom left to top right, note the path, then execute it, i.e. we generate edit instructions, but only for as long as there are still matches (i.e. diagonal steps). Once the last match has been reached, rather than emitting the remaining removals and insertions, we advance in both files.
The above is bound to work rather well, as long as the distance between matches never exceeds WS, the chosen window size. The "rather" means, if only one match were present in the matrix, we wouldn't be considering alternative matching, hence the file3-file4 example would suffer (that would require a window size of 2!). We remedy most of the locality problems by choosing a window size twice as high as the tallest difference we expect to encounter.
To summarize: we start reading WS lines from both files, apply local LCS on it, and generate the corresponding edit instructions; however, as soon as we reach the last match in the current pair of windows, we advance in both files until we again have WS lines from both, starting were we stopped emitting instructions. And we repeat until all file data has been consumed, at which point we do not stop on the last match, we continue with the extra removals and insertions as dictated by the file data.
Note: there is an obvious refinement that we must include; when the two windows start with several matching lines, we call those a match immediately, and move on, as it does not make sense to calculate LCS matrix data for those lines. It corresponds to directly accepting "AB" as the start of the edit instructions in our very first example.
There is something silly going on in regular LCS: all implementations I have seen (well, that is only a couple), build the matrix from top left to bottom right, then trace back to the top left taking notes, then work on those notes (which corresponds to another top left to bottom right journey). So in a way the matrix is traversed three times.
Obviously a file difference algorithm is basically symmetrical, you could as well start at the end of the files and work your way up to their beginning. The net result might be different (i.e. non-identical), but it would be equally economical, since there are many good solutions (remember the permutation choices we had in "-C +X +Y").
So the second improvement consists of building the LCS matrix from bottom right to top left (again, since we use sliding windows, it is from bottom of window to top of window); then we "trace back" from top to bottom, however we don't need to store anything now, as we can generate editing instructions right away, in the correct order. Obviously this is bound to save some storage, some cycles, and some code.
I basically have created three classes:
- a Document class, representing a text file, and implementing a list of text lines with a limited length;
a Document can be indexed from 0 to WS-1, and yields the text lines visible within the sliding window.
So it basically is a generic List of strings, knowing how to fill itself by reading a file.
- an LPTextFileDiff class, implementing the actual file comparison;
- an LPTextFileDiffTester class, providing the main() method for a console app that we need to test it all for correctness and performance.
The source for the entire app consists of only two files: one for Document and LPTextFileDiff
(about 360 lines of code); and one for LPTextFileDiffTester (about 140 lines of code).
The coding style is pretty standard for my doing, avoiding unnecessary spaces and most of the lines that are either empty or would just contain a single opening curly brace line. I prefer to see the largest part of the code possible on a single screen.
Here are some coding details, some of which could be considered slightly mean:
Contrary to my normal behavior, I did not go as far as smashing the two-dimensional arrays into one-dimensional ones; and neither did I opt for using pointers. I was confident the sliding window and the upside-down LCS techniques would perform pretty well already.
The Document class is very simple:
public class Document : IDisposable {
private int windowSize;
private static int counter=0;
private int docNumber;
private StreamReader sr;
private List<string> list;
private int lineNumber=0;
public Document(int windowSize, string filename) {
this.windowSize=windowSize;
docNumber=++counter;
sr=File.OpenText(filename);
list=new List<string>(windowSize);
}
public bool ReadMoreLines() {
bool someLinesAdded=false;
while (list.Count<windowSize) {
string s=sr.ReadLine();
if (s==null) break;
list.Add(s);
someLinesAdded=true;
}
return someLinesAdded;
}
public int Count { get { return list.Count; } }
public string this[int index] { get { return list[index]; } }
public string DropLine(out int lineNumber) {
string text=list[0];
list.RemoveAt(0);
lineNumber=++this.lineNumber;
return text;
}
public void Dispose() {
if (sr!=null) sr.Close();
}
}</string></string>
The LPTestFileDiff class holds one big constructor, no public methods, and some small private methods that generate output; The small ones are pretty obvious, and can be found in the download. The big chunk of code however is in the public constructor. There basically is a major for loop, that advances through the files, then some code to refill the sliding windows, then the LCS construction, the "back tracing", and the generation of edit instructions.
public class LPTextFileDiff {
private StreamWriter streamWriter;
private int lastReport;
private bool logMatchingLines;
private int linesMatched;
private int linesRemoved;
private int linesInserted;
public LPTextFileDiff(string file1, string file2, int windowSize, StreamWriter streamWriter, bool logMatchingLines) {
this.streamWriter=streamWriter;
this.logMatchingLines=logMatchingLines;
output2("File1="+file1);
output2("File2="+file2);
output2("window size="+windowSize+" lines");
using (Document doc1=new Document(windowSize, file1), doc2=new Document(windowSize, file2)) {
bool slidingWindowTooSmall=false;
for (int loop=1; ; loop++) {
bool moreDataLoaded1=doc1.ReadMoreLines();
bool moreDataLoaded2=doc2.ReadMoreLines();
bool moreDataLoaded=moreDataLoaded1 || moreDataLoaded2;
details("loop="+loop+" more="+moreDataLoaded1+" "+moreDataLoaded2);
if ((loop & 0x3)==0) GC.Collect();
int NW1=doc1.Count;
int NW2=doc2.Count;
if (NW1+NW2<=0) break;
bool matchedSome=false;
while (doc1.Count>=2 && doc2.Count>=2 && doc1[0]==doc2[0] && doc1[1]==doc2[1]) {
reportCopy(doc1, doc2);
matchedSome=true;
}
if (!matchedSome) {
int[,] LCS=new int[windowSize+1, windowSize+1];
bool[,] match=new bool[windowSize+1, windowSize+1];
for (int i2=0; i2<=NW2; i2++) LCS[NW1, i2]=0;
for (int i1=0; i1<=NW1; i1++) LCS[i1, NW2]=0;
for (int i1=NW1-1; i1>=0; i1--) {
for (int i2=NW2-1; i2>=0; i2--) {
if (doc1[i1]==doc2[i2]) {
LCS[i1, i2]=LCS[i1+1, i2+1]+1;
match[i1, i2]=true;
} else {
LCS[i1, i2]=LCS[i1, i2+1]>LCS[i1+1, i2]?LCS[i1, i2+1]:LCS[i1+1, i2];
}
}
}
int matchingLines=LCS[0, 0];
if (matchingLines==0) {
while (doc1.Count!=0) reportRemoval(doc1);
while (doc2.Count!=0) reportInsertion(doc2);
if (!moreDataLoaded) break;
if (!slidingWindowTooSmall) {
slidingWindowTooSmall=true;
output2("ERROR: difference too large for sliding window");
}
} else {
for (int i1=0, i2=0; ;) {
if (match[i1, i2]) {
reportCopy(doc1, doc2);
i1++;
i2++;
if (--matchingLines<1) break;
} else if (matchingLines==LCS[i1, i2+1]) {
reportInsertion(doc2);
i2++;
} else {
reportRemoval(doc1);
i1++;
}
}
}
}
}
do { while (doc1.Count!=0) reportRemoval(doc1); } while (doc1.ReadMoreLines());
do { while (doc2.Count!=0) reportInsertion(doc2); } while (doc2.ReadMoreLines());
}
output2("");
output2("Statistics: "+linesMatched+" lines matched, "+linesRemoved+" removed, "+
linesInserted+" inserted");
}
}
I don't think it makes much sense to explain all the nitty-gritty details; if you must, you should be able to work it out once you are familiar with straight LCS, and then apply the modifications I explained before.
I am not going to show the LPTextFileDiffTester code here; it is all pretty straightforward and can be found in the download.
[New paragraph since V1.1]
Note: the code shown stems from V1.0 and is the best starting point to understand how it all works;
the actual code in the download is slightly different due to further enhancements
(mainly the dynamic window sizing) and some new performance optimizations.
[New chapter since V1.1]
The idea of a fixed window size might be a bit frightening; too small a value would fail to find the optimal
edit path; too large a value would waste both CPU cycles and working set memory. However there isn't really anything
keeping us from adjusting the window size dynamically. So here is the third improvement on LCS: start with a small
window size, and increase it when the need occurs.
Increasing the window size is very simple, all that needs to be done is read more text lines,
and perform a larger LCS algorithm.
So we decided to implement a simple algorithm that can double the initial window size
up to eight times, so the maximum window size is fixed to 256 times the original one (which is still
user selectable and has a default value of 30 lines). We did not implement an automatic window size reduction,
as that could be a bit trickier, and its merits would probably be small if any: once two files have a large difference
block, chances are they have more of those and the peak working set has grown already.
Of course the critical part now is deciding when to double the window size. That too has a simple solution: we define
a minimum (or threshold) value T
for the number of matching lines that have to be present
in the sliding windows. T
itself is user selectable, and has a default value of 10. So by default,
we read 30 lines of both files; we skip all the lines that are identical, so the first line in the window is known
to be different, then we perform LCS and require the best solution has at least T
matching lines. If not,
we double the window size and redo the LCS, until it succeeds; in the extremely bizar case that we never get
a sufficient number of matching lines, we
generate edit instructions for removing and inserting all those lines, and we move on.
Note: the implementation of dynamic window sizing is absent in the code snippets above;
however it is available in the download code.
The comparison class is intended to work correctly for all kinds of situations,
so what comes out must be correct; there are no buts here.
Most design decisions were inspired by the typical usage as set forth in the challenge in general,
and confirmed by the test example provided.
So having both files rather similar has been assumed when making decisions that influence
performance (cycles and bytes).
[List updated in V1.1]
This is the list of limitations that apply as far as I know:
- The entire app is line oriented, lines are either identical or they are different.
No effort was spent to discover possible similarity between lines, nor to describe how to edit
one line from file1 to become some line for file2.
So all output is: copy line, remove line, insert line.
- The app needs newline characters to split the input files into text lines; when no carriage returns/linefeeds are
present the entire file is interpreted as a single line of text, and the only result one gets is whether both
files are identical or not.
- The dynamic window sizing is heuristic in nature, it isn't strict science, so there are no 100% guarantees.
The editing instructions generated are expected to be always correct, and optimal most of the time; under special
circumstances they might be suboptimal, i.e. describe more steps than strictly necessary to turn file1
into file2. Those circumstances depend on the height of the difference blocks (and maybe their distance), all
relative to the window size at that point. While a threshold value of 1 is strictly speaking enough
to expect decent behavior, I have set its default value at 10 to get the best results almost all the time.
Experience alone, i.e. using the app with lots of test cases, will confirm the validity of the above statements.
I strongly believe that even a mean-and-lean application deserves a good user interface; a bit of comfort is advantageous both for the developer and for the user. So the test application has a command line parser, built-in help, and some options. The help prints this:
LPTextFileDiff version V1.1, by Luc Pattyn, August 2009
Compares two text files
usage: LPTextFileDiff [ /option ...] file1 file2
options:
/B = create big files (expand by 10, 100, 1000)
/N = generate an output file and open it with Notepad
/M = also list matching lines to the output file
/Q = wait for the user to hit a key before quitting
/S = optimize for speed (no GC.Collect)
/Tdd = change threshold to dd matching lines (minimum 1, default 10)
/Wdd = change initial sliding window size to dd lines (min 5, def 30)
One of the options enables generation of an output file, and opening it with notepad when done;
having a real output file is a lot more comfortable than having to select all, and copy,
from a DOS window.
Another option lets one choose the sliding window size, so when both files are quite different,
one can still give it a try with a larger window size;
I have ran it with a window size of 100, without noticing a real effect on execution speed,
however the working set does go up.
The big files option generates six files, each containing from 10 to 1000 times the original content
of the two input files; that made it easy to test the scalability claim (see at the end).
In order to make life easier for the developer (that's me), I also included some extra debugging statements, and a couple of defines to control them. This is an example of how it is done:
#define WITHOUT_DETAILS_LOGGED // choose WITH or WITHOUT (default is WITHOUT)S
...
#if WITH_DETAILS_LOGGED
output2(s);
#endif
I could have achieved similar things using command line options, but that would complicate the user interface, and cause run-time checks, which I want to avoid when lots of data could be involved.
The application generates one line of text for each insertion or removal, and it inserts an empty line for every change in the sequence (change of edit instruction, or jump in line numbers). By default it does not mention lines that remain unchanged. For the test data, the following gets generated:
Jitted in XYZ msec
File1=grid1.html
File2=grid2.html
+++ 00041 : <H2>Version 2
+++ 00042 : <p>This is the modified version. There are 5 changes
+++ 00043 : <ul>
+++ 00044 : <li>A visible addition</li>
+++ 00045 : <li>A visible deletion</li>
+++ 00046 : <li>A visible text change</li>
+++ 00047 : <li>A linebreak removal</li>
+++ 00048 : <li>A non-visible change to the underlying HTML markup.</li>
+++ 00049 : </ul>
+++ 00050 : <p>Find them all as fast and as efficiently as you can.</p>
+++ 00172 : <TD><CODE lang="Text">gridctrl.cpp, gridctrl.h</CODE></TD>
00162 --- : <TD><CODE>gridctrl.cpp, gridctrl.h</CODE></TD>
+++ 00191 : <TD noWrap><CODE>GridDropTarget.cpp, GridDropTarget.h</CODE><...
00181 --- : <TD noWrap><CODE>GridDropTarget.cpp, GridDropTarget.h</CODE><...
00182 --- : <TD>Grid control OLE drag and drop target. Only necessary if ...
01241 --- : <TD noWrap><CODE>void SetShadedPrintOut(BOOL bEnable = TRUE)<...
01242 --- : <TD class=smallText width="100%">If TRUE, colored cells will ...
01243 --- : FALSE, all text prints as black on white.</TD></TR>
01244 --- : <TR vAlign=top>
Statistics: 1777 lines matched, 7 removed, 12 inserted
Window sizes: initial=30, final=30; threshold=10, minimum matches=20
LPTextFileDiff done (used XYZ msec and working set increase of XYZ KB)
The text lines get preceded by two columns containing either line numbers or symbols. The first column holds the line number for lines that exist in file1 (and +++ for those that don't); the second column holds the line number for lines that exist in file2 (and --- for those that don't). So
+++ 00172 : <TD><CODE lang="Text"...
means insert at line 172 the specified text, and
00162 --- : <TD><CODE>gridctrl...
means remove line 162. When option /M is selected, the matching lines would show with both line numbers (only in the optional output file, not on the console, as that would cause lots of scrolling and slowing down).
Note: on the console, lines are truncated so long lines don't cause extra scrolling due to wrapping; on the optional output file, no truncation gets applied.
Obviously the results of performance measurements depend on a lot of factors. To get optimal circumstances, one must
And here are the results: on a reasonably powerful laptop (Dell Inspiron 1720, Vista) with an Intel Core 2 Duo T8300 running at 2400 GHz, LPTextFileDiffTester takes no more than:
| 6.4 milliseconds (elapsed time) and 324 kilobytes (working set increase) without option /S
6.1 milliseconds (elapsed time) and 928 kilobytes (working set increase) with option /S
excluding an elapsed JIT-compile time of 8.4 milliseconds
|
| |
[List reworked in V1.1] Some more comments:
- elapsed times are jittering a bit; it is necessary to execute several runs to get a correct impression of
the actual time it takes.
- a significant part of the elapsed time is spent writing to the console, which seems to be a rather slow device;
its speed may depend on its window height, its buffer height, its color depth, etc., all characteristics
of the target machine; the
console time can be eliminated by redirecting the output to a file, however that violates the requirement of
calculating and displaying the differences.
- if this code were to be used on a web site, it would probably need JIT compilation every time, in which case
the sum of JIT and run times is more relevant; that would yield less than 15 milliseconds of total elapsed time.
GC.Collect()
calls are used to trade some CPU cycles for a smaller peak working set; their usage
is optional, specifying option /S disables them.- The algorithm is basically linear in the number of text lines:
for any given window size, making the input files twice as large,
LPTextFileDiff should take twice as long, no more; and the memory requirement should be constant
(assuming there are some
GC.Collect()
calls,
otherwise it would grow till the GC runs by itself). As a test case I concatenated the test files 1000 times
(yielding two 90MB files, see option /B) and compared them in less than 2.3 seconds (with redirected output),
still causing no more than a 440KB increase in the working set. Algorithms that read all the text lines at once,
and/or calculate a full LCS matrix, would require tens or hundreds of megabytes in this test.
When all was running fine and looking good, I obviously did another analysis pass to check for anything
I might have missed.
The one dimension I did not research initially was multi-threading: on a multi-core CPU,
only one core would be contributing to the file comparison, except that probably Windows
would pre-fetch the files as required. Anyway, using more than one thread would be rather
hard as everything by its very nature has to happen sequentially: filling an LCS matrix,
back tracing the matrix, generating edit instructions, sliding the windows.
They all have to wait for each other all the time, there isn't really much opportunity
for parallelism. So at first I dropped the idea.
[New paragraph since V1.1]
I later performed some experiments; first I verified a fully handshaked round-trip thread switch
takes about 10 microseconds on my PC. So I could shave of some milliseconds on the test case provided I only needed
a couple hundred of those round trips, which meant I would need to load and store some 20 lines of text at a time.
Alternatively I could assume there was going to be a multi-core CPU anyway, so why not get some work done by the
other core for free.
I did implement two such schemes, however they never were any faster than the single-threaded simple approach.
The main problem continues to be the inherent sequential nature of the algorithm:
as soon as some lines get dropped from the list, I immediately need the new text lines in order to calculate LCS.
Actually this is not quite correct: I could start filling the LCS matrix while the lines come in, but that would
be very disruptive for the program structure, so I dropped the idea, again.
A second point of concern was the arrays I used: the way I implemented LCS, I needed an integer array
for counting matches, and a boolean array for flagging matches (i.e. diagonal steps).
The dilemma here was: should I just create new ones for each iteration of the for loop,
or should I create them once and reuse them, saving the GC some work (and keeping the peak
working set smaller), but requiring re-initialization every so often.
After a simple experiment, I didn't see much difference so I went for new arrays,
so in this respect the code is more lean and less mean than it would have been otherwise.
[New paragraph since V1.1]
When both files are rather similar, most time is spent outside the LCS algorithm
itself, hence the arrays aren't that important as far as performance goes. In the test example the LCS part
only runs three times (that is dealing with 90 out of over 1700 lines, assuming a window size of 30).
| | Mission accomplished, I would say. The file comparison performs extremely well; all my initial goals have been met. And as the execution time is strictly linear, one could use this class and app to compare huge files, holding millions of lines.
|
- LPTextFileDiff V1.0 (2009, August 27; first release).
- LPTextFileDiff V1.1 (2009, August 31; second release).
- Added a table of content;
- Added a comment on newline characters;
- Removed an unclear and rather irrelevant comment on symmetry;
- Added dynamic window sizing (some logic, and a new threshold option /T);
- Made GC.Collect optional (some logic, and a new speed option /S);
- Applied some performance improvements;
- Performed some multi-threading experiments and reported on them being rejected.
I am an engineer with a background in electronics, software and mathematics.
I develop technical software, both for embedded systems and for desktop equipment. This includes operating systems, communication software, local networks, image processing, machine control, automation, etc.
I have been using all kinds of microcontrollers and microprocessors (Intel 4004/8080/8051/80386/Pentium, Motorola 680x/680x0/ColdFire/PowerPC, Microchip PIC, Altera NIOS, and many more), lots of programming languages (all relevant assemblers, Fortran, Basic, C, Java, C#, and many more), and different operating systems (both proprietary and commercial).
For desktop applications and general development tools I have been using both UNIX systems and Mac/MacOS for many years, but I have switched to x86-based PCs with Windows, Visual Studio and the .NET Framework several years ago.
I specialize in:
- cross-platform development (making software that runs on diverse hardware/OS combinations)
- instruction set simulation
- improving software performance, i.e. making sure the software runs the job at hand in as short a time as possible on the given hardware. This entails algorithm selection, implementation design, accurate measurements, code optimisation, and sometimes implementing virtual machines, applying SIMD technology (such as MMX/SSE), and more.