12,630,455 members (32,220 online)
alternative version

90.9K views
35 bookmarked
Posted

Minimum Difference

, 22 Apr 2002 CPOL
 Rate this:
Identifying the minimum difference between two data sets.

Introduction

Calculating the minimum difference between two sets of data is a common requirement. The `compare<>` class can be used to identify the minimum difference between two sets of arbitrary types. It is a template class that uses a generic data source class interface to supply data to it. This means that the algorithm is entirely independent of the data, and the representation of the data.

Algorithm

To calculate the minimum difference between two data sets is to reverse the problem and calculate the Longest Common Subsequence and use the result of this to determine the differences. The LCS problem is well documented on the Internet and in research papers. I have implemented the Iterative LCS algorithm described by Professor David Eppstein at the University of California. The LCS will identify the longest groups of elements common between the two sources. By definition, extracting these groups from the original will identify the minimum non-common elements between the two.

Implementation

The algorithm is encapsulated entirely within one class, `compare<>` which is implemented within the `cmp` namespace. `cmp::compare<>` is a template class that is instantiated with a data source class to supply its data. I have included two data sources for demonstration purposes. The first is another template class called `cmp::data_source<>`. This can be used for any basic data type, for example to compare to text strings. The second is `cmp::data_source_text_file`. used to compare two text files

cmp::data_source<> example

To make life a little easier, we can `typedef` the cumbersome template classes:

```typedef cmp::data_source<const char>        compare_data_source_t;
typedef cmp::compare<compare_data_source_t> compare_t;```

Here's the data to compare:

```const char str1[] = "little jack horner";
const char str2[] = "sat in a corner";```

So we simply instantiate two data sources, and give them the data:

```compare_data_source_t compare_data1(str1, strlen(str1));
compare_data_source_t compare_data2(str2, strlen(str2));```

Now we can instantiate the `cmp::compare<>` class:

`compare_t compare(&compare_data1, &compare_data2);`

And finally we can call `process()`

```compare_t::result_set seq;
int lcs = compare.process(&seq);```

`process()` returns an integer which is the length of the Longest Common Subsequence. This isn't really very useful to us when trying to determine the difference, but the return value can give two useful pieces of information. A return value of `-1` indicates an error, and `0` (zero) indicates that the two data sets are identical.

The `seq` parameter will, on successful return, contain the result sequence. This contains a list of records from the two data sets, along with information about their relationship to the second dataset. Each record will be marked as `KEEP`, `REMOVE`, or `INSERT`. These are descriptive of creating the second data set from the first, thus:

• `KEEP` - records in both data sources
• `REMOVE` - records in the first, but not the second data set
• `INSERT` - records not in the first, but in the second data set

Memory requirements

The LCS implementation is heavy on memory allocation. It requires `n*m*sizeof(short)` bytes of storage where `n` and `m` are the number of records in each of the data sets. For example, to compare two 1KB datasets requires 1MB of storage!

To overcome this overhead, we compare larger blocks of data and subdivide differing blocks and perform a separate LCS process on those. For example, the `cmp::data_source_text_file` class compares text files line by line to yield a result similar to that of a version control system, identifying which lines in the two files are different. To identify character changes in a text file, each line that is different could be passed through a `cmp::data_source<char>` as in the first example.

Contact

 Website: http://homepage.virgin.net/cdm.henderson e-mail: cdm.henderson@virgin.net

Share

Craig graduated with a B.SC. Honours in Computing for Real Time Systems from the University of the West of England, Bristol in 1995 after completing an HND in Computing in 1993.

You may also be interested in...

 Pro

 First Prev Next
 VC++ how to compare constant char with a char Oriocat16-Jul-04 8:22 Oriocat 16-Jul-04 8:22
 Cool Hockey16-May-04 4:54 Hockey 16-May-04 4:54
 file compare using perl Graham Bright18-Jun-03 5:11 Graham Bright 18-Jun-03 5:11
 Comparison of text files Philippe Lhoste2-May-02 3:12 Philippe Lhoste 2-May-02 3:12
 Re: Comparison of text files Craig Henderson3-May-02 6:24 Craig Henderson 3-May-02 6:24
 Re: Comparison of text files Philippe Lhoste3-May-02 7:05 Philippe Lhoste 3-May-02 7:05
 Re: Comparison of text files Craig Henderson3-May-02 10:00 Craig Henderson 3-May-02 10:00
 Re: Comparison of text files Philippe Lhoste29-May-02 3:13 Philippe Lhoste 29-May-02 3:13
 misprint Igor29-Apr-02 21:33 Igor 29-Apr-02 21:33
 Interesting article - thanks Neville Franks23-Apr-02 2:10 Neville Franks 23-Apr-02 2:10
 Re: Interesting article - thanks Craig Henderson23-Apr-02 10:55 Craig Henderson 23-Apr-02 10:55
 Re: Interesting article - thanks Craig Henderson7-Nov-02 1:33 Craig Henderson 7-Nov-02 1:33
 Re: Interesting article - thanks Craig Henderson8-Sep-13 23:17 Craig Henderson 8-Sep-13 23:17
 Fancy page Jason Hooper20-Apr-02 1:55 Jason Hooper 20-Apr-02 1:55
 Re: Fancy page Craig Henderson20-Apr-02 10:08 Craig Henderson 20-Apr-02 10:08
 Re: Fancy page Jason Hooper20-Apr-02 10:46 Jason Hooper 20-Apr-02 10:46
 Last Visit: 31-Dec-99 19:00     Last Update: 7-Dec-16 18:02 Refresh 1

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Web01 | 2.8.161205.3 | Last Updated 23 Apr 2002
Article Copyright 2002 by Craig Henderson