Click here to Skip to main content
Licence 
First Posted 18 Apr 2002
Views 81,195
Downloads 673
Bookmarked 33 times

Minimum Difference

By | 22 Apr 2002 | Article
Identifying the minimum difference between two data sets

Introduction

Calculating the minimum difference between two sets of data is a common requirement. The compare<> class can be used to identify the minimum difference between two sets of arbitrary types. It is a template class that uses a generic data source class interface to supply data to it. This means that the algorithm is entirely independent of the data, and the representation of the data.

Algorithm

To calculate the minimum difference between two data sets is to reverse the problem and calculate the Longest Common Subsequence and use the result of this to determine the differences. The LCS problem is well documented on the Internet and in research papers. I have implemented the Iterative LCS algorithm described by Professor David Eppstein at the University of California. The LCS will identify the longest groups of elements common between the two sources. By definition, extracting these groups from the original will identify the minimum non-common elements between the two.

Implementation

The algorithm is encapsulated entirely within one class, compare<> which is implemented within the cmp namespace. cmp::compare<> is a template class that is instantiated with a data source class to supply its data. I have included two data sources for demonstration purposes. The first is another template class called cmp::data_source<>. This can be used for any basic data type, for example to compare to text strings. The second is cmp::data_source_text_file. used to compare two text files

cmp::data_source<> example

To make life a little easier, we can typedef the cumbersome template classes:

typedef cmp::data_source<const char>        compare_data_source_t;
typedef cmp::compare<compare_data_source_t> compare_t;

Here's the data to compare:

const char str1[] = "little jack horner";
const char str2[] = "sat in a corner";

So we simply instantiate two data sources, and give them the data:

compare_data_source_t compare_data1(str1, strlen(str1));
compare_data_source_t compare_data2(str2, strlen(str2));

Now we can instantiate the cmp::compare<> class:

compare_t compare(&compare_data1, &compare_data2);

And finally we can call process()

compare_t::result_set seq;
int lcs = compare.process(&seq);

process() returns an integer which is the length of the Longest Common Subsequence. This isn't really very useful to us when trying to determine the difference, but the return value can give two useful pieces of information. A return value of -1 indicates an error, and 0 (zero) indicates that the two data sets are identical.

The seq parameter will, on successful return, contain the result sequence. This contains a list of records from the two data sets, along with information about their relationship to the second dataset. Each record will be marked as KEEP, REMOVE or INSERT. These are descriptive of creating the second data set from the first, thus

  • KEEP - records in both data sources
  • REMOVE - records in the first, but not the second data set
  • INSERT - records not in the first, but in the second data set

Memory Requirements

The LCS implementation is heavy on memory allocation. It requires n*m*sizeof(short) bytes of storage where n and m are the number of records in each of the data sets. For example, to compare two 1Kb datasets requires 1Mb of storage!

To overcome this overhead, we compare larger blocks of data and subdivide differing blocks and perform a separate LCS process on those. For example, the cmp::data_source_text_file class compares text files line by line to yield a result similar to that of a version control system, identifying which lines in the two files are different. To identify character changes in a text file, each line that is different could be passed through a cmp::data_source<char> as in the first example.

Contact

Web Site:http://homepage.virgin.net/cdm.henderson
e-mail:cdm.henderson@virgin.net

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Craig Henderson

Other
Centrix Software
United Kingdom United Kingdom

Member

Follow on Twitter Follow on Twitter
Craig graduated with a B.SC. Honours in Computing for Real Time Systems from the University of the West of England, Bristol in 1995 after completing an HND in Computing in 1993.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralVC++ how to compare constant char with a char PinmemberOriocat7:22 16 Jul '04  
GeneralCool PinmemberHockey3:54 16 May '04  
Generalfile compare using perl PinsussGraham Bright4:11 18 Jun '03  
GeneralComparison of text files PinmemberPhilippe Lhoste2:12 2 May '02  
GeneralRe: Comparison of text files PinmemberCraig Henderson5:24 3 May '02  
GeneralRe: Comparison of text files PinmemberPhilippe Lhoste6:05 3 May '02  
GeneralRe: Comparison of text files PinmemberCraig Henderson9:00 3 May '02  
GeneralRe: Comparison of text files PinmemberPhilippe Lhoste2:13 29 May '02  
Generalmisprint PinmemberIgor20:33 29 Apr '02  
GeneralInteresting article - thanks PinmemberNeville Franks1:10 23 Apr '02  
GeneralRe: Interesting article - thanks PinmemberCraig Henderson9:55 23 Apr '02  
Neville, thanks for your comments, I'm glad you found it useful. You're right about the runtime error with the large files. There was a minor oversight (!) in my test program: The cmp::compare<> class throws a C++ exception if the memory allocation fails. The test program fails to catch the exception and therefore causes a runtime error and exits. I've updated the test program with this fix Smile | :)
 
To compare your two 70,000 line files will require 70001*70001*2 bytes = 9346.3 Mb ! This is on top of the memory to hold the actual data and the result lists. A good way to overcome the problem of loading the entire files into memory would be to use memory mapped files in an alternative class to cmp::data_source_text_file.
 
Overcoming the working array size is a little more complex. The files could be compared block-by-block, but getting the optimal result is tricky. To achieve the ultimate, each 'block' should end on matching records so that the beginning of a new block is functionally equivalent to beginning a new file. If the block does not end on a matching record, the algorithm to find the difference will work in respect that the file1+changes=file2, but the changes will not be the minimum set required. For example, a record may be added to the result set as a removal from file1 only to be added as an insertion to file2 later.
GeneralRe: Interesting article - thanks PinmemberCraig Henderson0:33 7 Nov '02  
GeneralFancy page PinmemberJason Hooper0:55 20 Apr '02  
GeneralRe: Fancy page PinmemberCraig Henderson9:08 20 Apr '02  
GeneralRe: Fancy page PinmemberJason Hooper9:46 20 Apr '02  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web03 | 2.5.120529.1 | Last Updated 23 Apr 2002
Article Copyright 2002 by Craig Henderson
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid