Click here to Skip to main content

Article

FileDiff2 Optimized

Matthew Hazlett

Rate me:

3.80/5 (3 votes)

13 Aug 2009CPOL

31.6K

163

13

14

A file diff utility.

Download source code - 48.4 KB

Introduction

This application is pretty basic, it uses FileStream objects to perform its task.

Using the code

C#

ASCIIEncoding Encode = new ASCIIEncoding();

// Open the files
//
FileStream streamA = File.OpenRead(args[0]);
FileStream streamB = File.OpenRead(args[1]);

// Get the stream length
// (so we don't have to caluculate this a million times)
//
long lenA = streamA.Length - 1;
long lenB = streamB.Length - 1;

// Read the bytes
//
int byteA;
int byteB;

do
{
    // Read the streams
    //
    byteA = streamA.ReadByte();
    byteB = streamB.ReadByte();

    // Are they the same
    //
    if (byteA != byteB)
    {
        // Remember where we parked the car
        //
        long startPos = streamB.Position;

        // Read streamB until we = StreamA
        //
        do
        {
            byteB = streamB.ReadByte();
        }
        while (byteA != byteB && streamB.Position <= lenB);

        // How long is the difference?
        //
        long length = streamB.Position - startPos;

        // Read the bytes
        //
        byte[] theseBytes = new byte[length];
        streamB.Seek(length * -1, SeekOrigin.Current);|
        streamB.Read(theseBytes, 0, (int)length);

        Console.WriteLine("Pos:{0}, Len:{1}, Str:{2}", startPos, 
                          length, Encode.GetString(theseBytes));
    }
}
while (streamA.Position <= lenA && streamB.Position <= lenB);

streamA.Close();
streamB.Close();

History

Aug 12, 2009: Written.
Aug 13, 2009: Rewritten to be more explicit, and I modified the file seeking and some variables to bring this down form 57ms run time to 27ms run time.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Matthew Hazlett

Web Developer

United States

I started programming for fun when I was about 10 on an Franklin Ace 1000.

I still do it just for fun but it has gotten me a few jobs over the years. More then I can say for my Microsoft Certifications. Smile | :)

Smile | :)

The way I learned was by example, now its time to give back to the next generation of coders.

Comments and Discussions

Works, but that's it...

Rasqual Twilight22-Aug-09 0:58

Rasqual Twilight

22-Aug-09 0:58

It's about as naive (and slow) as you can get for achieving this task (and as pointed out, works in a limited way, too).
Should work well enough on very short texts, though.
The rationale for rating this a 5 is beyond understanding.

Pete Souza IV14-Aug-09 6:31

14-Aug-09 6:31

The solution I had mentioned earlier takes it about 2,200ms -- the vast majority of which goes to file seeking so it can do string-by-string comparisons. I'd normally read the whole thing in memory for text files under a few MB, but that wouldn't be very efficient.

Here's the output for my application:

grid2.html:52     (+) <H2>Version 2</H2>
grid2.html:43     (+) <p>This is the modified version. There are 5 changes</p>
grid2.html:44     (+) <ul>
grid2.html:45     (+)   <li>A visible addition</li>
grid2.html:46     (+)   <li>A visible deletion</li>
grid2.html:47     (+)   <li>A visible text change</li>
grid2.html:48     (+)   <li>A linebreak removal</li>
grid2.html:49     (+)   <li>A non-visible change to the underlying HTML markup.</li>
grid2.html:50     (+) </ul>
grid2.html:51     (+) <p>Find them all as fast and as efficiently as you can.</p>
--------------------------------------------------------------------------------------------
grid1.html:162    (-) <TD><CODE>gridctrl.cpp, gridctrl.h</CODE></TD>
grid2.html:175    (+) <TD><CODE lang="Text">gridctrl.cpp, gridctrl.h</CODE></TD>
--------------------------------------------------------------------------------------------
grid1.html:181    (-) <TD noWrap><CODE>GridDropTarget.cpp, GridDropTarget.h</CODE></TD>
grid1.html:182    (-) <TD>Grid control OLE drag and drop target. Only necessary if you don't define
grid2.html:196    (+) <TD noWrap><CODE>GridDropTarget.cpp, GridDropTarget.h</CODE></TD><TD>Grid control OLE drag and dro
p target. Only necessary if you don't define
--------------------------------------------------------------------------------------------
grid1.html:1246   (-) <TD noWrap><CODE>void SetShadedPrintOut(BOOL bEnable = TRUE)</CODE> </TD>
grid1.html:1243   (-) <TD class=smallText width="100%">If TRUE, colored cells will print as-is. If
grid1.html:1244   (-) FALSE, all text prints as black on white.</TD></TR>
grid1.html:1245   (-) <TR vAlign=top>
--------------------------------------------------------------------------------------------

Is this overkill, you think?

Re: My solution

Matthew Hazlett14-Aug-09 8:51

Matthew Hazlett

14-Aug-09 8:51

No, if your solution works that's great!

The output is very nice any clearly readable. Looks like a solid contender to me.
I'm embarrassed, I think I didn't read the instructions very carefully.

Matthew Hazlett
Fighting the good fight for web usability.

Re: My solution

Pete Souza IV20-Aug-09 10:33

20-Aug-09 10:33

I've taken your encouragement into consideration and have posted my solution after adding some comments and fixing formatting. It hasn't been optimized any more than what I'd mentioned earlier. Interested in your thoughts...

PSFileDiff - A Response to the File diff CodeProject Contest[^]

Re: My solution

Matthew Hazlett21-Aug-09 20:07

Matthew Hazlett

21-Aug-09 20:07

Very interesting. You create a string builder object there by allocating memory and desttroying the object over and over.
Not sure I would have gone with this approach but my program is not complete and yours is Smile | :)

Smile | :)

I did write another version or my program doing line by line like you suggested but I kept all the file handling in the main() block of code. Instead of passing references, and doing a lot of seeking in the file I extended the string object and added new methods to compare strings. I'm not sure how it will look when it's totally finished but the code is very simple thus far (fast too).

EX: if (StringA.isApprox(StringB))

the isApprox extension method is so simple, it takes every char form StringA and removes a corresponding char from StringB (1 to 1). Then takes the length of the resulting strings and calculates a percentage match, somthing like:

int difference = stringA.Length - stringB.Length;
int percent = (difference * 100) / stringA.Length;

then if percent is over 85 they are most likely a match, this will work for 100% matches as well as partial matches so you need only call a compare one time. This works well, you can basically just read the file one time, line by line and do all the comparisons to satisfy 3 out of the 4 matching requirements.

However, the trick comes in for matching file deletions, I have some ideas but haven't implemented anything yet as I have had other things that I have been working on (cough, lazy). I'll try to get my code finished and uploaded by the deadline, if I don't make it and you would like to look at it just let me know and i'll upload it to my web site (spelling errors and all).

Matthew Hazlett
Fighting the good fight for web usability.

Re: My solution

Pete Souza IV25-Aug-09 6:56

25-Aug-09 6:56

I just submitted v1.03 of my solution and am waiting for it to be published. In the meanwhile, I did strip out the StringBuilder entirely. I'm also reading in bytes chunks at a time instead of one by one. On my home PC, this brought execution time down to about 80ms at 900 KB memory use, which I am finally happy with.

I'm not sure I understand how your new approach is going to catch modifications to files where line endings are the same length, but contain different content. Why not just create a hash? Here's my code to produce an MD5 hash:

C#

public static string MD5(string input)
{
    System.Security.Cryptography.MD5CryptoServiceProvider md5_hasher = new System.Security.Cryptography.MD5CryptoServiceProvider();
    byte[] md5_bytes = System.Text.Encoding.UTF8.GetBytes(input);
    md5_bytes = md5_hasher.ComputeHash(md5_bytes);
    System.Text.StringBuilder md5_textbits = new System.Text.StringBuilder();
    foreach (byte b in md5_bytes)
        md5_textbits.Append(b.ToString("x2").ToLower());
    string md5_final = md5_textbits.ToString();
    return md5_final.ToUpper();
}

It's not the quickest, but it's a lot more accurate than a simple string length comparison, I would think. If you're determined to go that route, I mean.

Re: My solution

Matthew Hazlett25-Aug-09 8:09

Matthew Hazlett

25-Aug-09 8:09

So, now you are converting to a byte array and computing the hash and changing the bytes it to a hex string. Am I missing something? This would only be equal for 100% matches what about additions or deletions. Why do you want the strings in HEX, seems like a needless step to me.
---

Also, Mine isn't based of string length but string content.

For example:
String A: The quick brown fox jumped over the lazy dog.
String B: The quick brown fox leaped over the lazy dog.

Whats the same (length 39):
The quick brown fox over the lazy dog.

Whats different (length 6):
leaped

String Match: 73%

Then you have your program make the judgment call on what percentage constitutes a match.

Also:
You could instruct your program to match on chars:
String A: The quick brown fox jumped over the lazy dog.
String B: The quick brown fox leaped over the lazy-ass dog.
Difference: -ass

or based on words:
String A: The quick brown fox jumped over the lazy dog.
String B: The quick brown fox leaped over the lazy-ass dog.
Difference: lazy-ass

Matthew Hazlett
Fighting the good fight for web usability.

Re: My solution

Pete Souza IV25-Aug-09 8:15

25-Aug-09 8:15

No, no -- you completely misunderstand.

I was under the (mis)understanding that you were simply comparing the lengths of strings to determine their approximation to each other. I figured a hash comparison would be better than simply comparing their lengths, though if you're going to do that, you may as well just compare the strings directly (so scratch that).

I don't use hashes whatsoever in my solution.

Now that I have a better understanding of your new approach, though, it looks interesting. Kind of a tokenized, what-part-of-the-line changed method. I like it. How ready are you to write up a new article?

Re: My solution

Matthew Hazlett25-Aug-09 19:38

Matthew Hazlett

25-Aug-09 19:38

ya, I know I need to finish, its mapped out in my head how to implement the final features, but I have not sat down to bang it out.
I will try and make an effort to finish it up tomorrow.

Matthew Hazlett
Fighting the good fight for web usability.

Some thoughts...

Pete Souza IV13-Aug-09 13:50

13-Aug-09 13:50

While reviewing your code, I came across the following considerations:

Your approach does provide a (very) simple solution to comparing the bytes of a second file against a first with limited detection when discrepancies between the two end.   Please correct me if I'm wrong, but doesn't your approach suffer greatly when a character appearing in fileA doesn't appear in fileB?

Example, if your fileA's text only had the letter 'z' once and the only differences between the two files was a missing 'z' in fileB, your approach would mark the entire remainder of the contents of the file as a change.

i.e.:

fileA:
A1234
B1234
Z1234
T1234
R1234

fileB:
A1234
B1234
1234
T1234
R1234

Instead of your approach detecting only the missing 'Z', you will instead detect the entire remainder of the file as a difference in text.   This is because your approach is very aware of additional changes to the text from fileA -> fileB but is not at all aware of deletions of text from fileA -> fileB.

Also, and this is a gray area in regards to the actual contest, but I would think it'd be a poor idea to do a byte-by-byte comparison.   While there's more overhead in line-by-line (or other methods), you don't run the risk of this error:

This is some text.
This is slightly more text.

You should detect the removal of 'some' and addition of 'slightly more', but your application will actually detect the addition of 'lightly m' and 're text.'.   Even worse, it will at this point ignore the fact that 'me text.' was removed from the file (it had left over parts in fileA that weren't detected in fileB because it kept waiting for an 'm' to follow an 'o' -- something that never occurs in fileB).

What are your thoughts?

Re: Some thoughts...

Matthew Hazlett13-Aug-09 14:52

Matthew Hazlett

13-Aug-09 14:52

You are quite correct in your analysis

If fileA contains abcde and fileB contains acde according to this comparison the difference would be cde not simply b. However, if you reversed the compare order the results would indeed be b. (It depends on who is being compared to whom (you can't compare both A to B and B to A when you were explicitly asked to compare A to B).

In your second example --
This is some text.
This is slightly more text.

You are again quite correct, I had considered comparing words or lines instead of bytes but that would create some unique problems as well as only work for ascii compares and not binary comparisons (yes, binary compares was not a requirement).

If you were comparing by line-by-line as you suggested

A: "This is some text"
B: "This is some
text"

Both lines would be flagged when in reality there is only a character difference between the two files. The same applies if you are splitting by words

A: "This is so
me text"
"This is some text"

Again only a character difference but if it were going off words there would be three discrepancies "so, me and some".

Byte comparison has it's advantages and disadvantages. But the contest did say compare the files in the fastest time with the least amount or overhead (code bulk). Somehow, adding more logic thus increasing memory usage, executable size and execution time when it's not needed seemed counter productive.

Matthew Hazlett
Fighting the good fight for web usability.

Re: Some thoughts...

Pete Souza IV13-Aug-09 16:07

13-Aug-09 16:07

The reason I bring up the tokenized/line-by-line approach rather than byte-by-byte is because the grid2.html file included with this problem explicitly mentions finding five changes. It seems that if your code finds more than that (or less) because of the way you determine when the files sync and de-sync, then you won't be satisfying the parameters of the problem. In fact, I think that's really what adds flavor to the proposal -- if it were really so easy as to just do a byte-by-byte comparison, then there would arguably be very few solutions and not much of a contest.

Also, I think the phrasing of the problem is something to be reconsidered too: "calculate and display the changes between two text files". This implies that both should be compared to each other and that order should not matter.

Though I could be entirely wrong in my interpretation, I'd be interested to see how you would develop a solution that specifically targets this problem. I have written one earlier today, but I'm not satisfied with its performance by any means and I'd rather not use unsafe code (or why even bother with the managed environment, right?) -- so I'm still considering various techniques.

Re: Some thoughts...

Matthew Hazlett13-Aug-09 16:41

Matthew Hazlett

13-Aug-09 16:41

I must have totally missed the part the part where they talk about the 5 changes to find in the file.

In a retail compare program I get the following results:
1777 Line(s) match 4 Block(s) diff 12 : 7 Line(s) diff

Matthew Hazlett
Fighting the good fight for web usability.

Re: Some thoughts...

Pete Souza IV14-Aug-09 1:59

14-Aug-09 1:59

I use Beyond Compare in my day-to-day programming and get four block changes as well. This is because the non-visible text change and line-break removal occur at the same time. In fact, the solution I wrote detects four changes two (one of which has both). Another reason why I haven't submitted my solution.

This is a much more interesting problem than it appeared on the surface...

General News Suggestion Question Bug Answer Joke Praise Rant Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top