Click here to Skip to main content
Click here to Skip to main content
Alternative Article

Fault tolerant file comparison

, 17 Oct 2012 CPOL
Rate this:
Please Sign up or sign in to vote.
This is an alternative for "Fault Tolerance for Large Files on Cranky Hardware"

Introduction

Copying large files may result in errors, as described in the original article, especially on so called cranky hardware. I experienced it with a distinct USB drive, and it drove me crazy because I wanted to use this
drive for backups of large containers. Especially if you try to recover a large backup file or you want to verify the integrity of a backup you might be interested in the code discussed in this article.  

In this version I will concentrate on file comparison only. The source code repository is available at Github.

Background

The original article provided two tools, one for error tolerant copying and one for verifying a copy. The article provides not only information about how to compare such files, but also about how to achieve distinct UI effects with WPF and Windows. If you run Windows and have the .NET framework installed you just might want to check out the original article.   

If you are interested in a native C++ version (or you just love reading alternative versions) you are welcome to go on. This article will focus on C++ code which will compare two files and return a result if they are
suspected to be identical.  

The native C++ version has some advantages. I will not discuss any performance benefits of native code as this is not the focus of the C++ implementation and I believe they can be neglected here. 

  • C++ code is more portable, so it is more likely that you can use this code easily if a C++ compiler is available. At least I suspect that common libraries like Boost and a C++ compiler are more common on *ix systems than Mono. And there may still be systems where Mono is not available.
  • Such a system utility might be an important tool when everything else goes wrong. Imagine you just reinstalled your OS and have no possibility to download and install .NET or Mono to run this program, but you are in desperate need of verifying/copying your backup with this tool.  
  • A decent UI is most definitely something I prefer, but in some situations you might want (or are even forced) to use a shell utility. 
  • A static build of this program runs with no dependencies at all.

The code to compare file contents 

This is an excerpt of my version of the file comparison code. The part which does the actual work.

c_buffer_bytes is set to 1024 * 1024 * 10 (10 MB).  It should be configurable, which I plan to do in upcoming versions.

m_max_retries is set to 3 by default, and may be set via command line.

bool CSequenceComparer::Compare(const std::string& first_file, const std::string& second_file)
{
    // [...] checking that files exist and are regular files (not directories)
    // [...] open files

    char* p_buffer;
    char* p_compare_buffer;
    // [...] allocating c_buffer_bytes of memory for each buffer
    
    short retries = 0;
    while ( ! input_stream.eof() )
    {
        streamoff input_offset = input_stream.tellg();
        input_stream.read(p_buffer, c_buffer_bytes);
        const streamsize num_of_input_bytes = input_stream.gcount();

        streamoff compare_offset = compare_stream.tellg();
        compare_stream.read(p_compare_buffer, c_buffer_bytes);
        const streamsize num_of_compare_bytes = compare_stream.gcount();

        bool equal = false;
        if ( num_of_input_bytes == num_of_compare_bytes )
        {
            if ( memcmp(p_buffer, p_compare_buffer, (size_t)num_of_input_bytes) == 0 )
            {
                equal = true;
            }
        }

        retries = equal ? 0 : (retries + 1);
        if (retries > m_max_retries)
        {
            break;
        }

        if (retries > 0)
        {
            // [...] seek to previously stored offsets to retry read
        }
    }

    delete [] p_buffer;
    delete [] p_compare_buffer;
    
    return retries == 0;
}

Points of Interest  

If you are paranoid you will discover that the code above regards a file or its parts as equal as soon as the first success in comparing them is achieved. So in this case you might want to add code that will double check a successful comparison because the lower the compare buffer is set to in the first place, the higher the probability of a false positive.  

Since I wanted to provide a portable version I included the boost file-system library for the sake of readable code. In my first version created under windows I used some windows only functions to check whether files where available. This is the part which should work with Linux and Windows (its commented out above): 

path filePath1(first_file);
path filePath2(second_file);
if ( !exists(filePath1) || !is_regular_file(filePath1) 
    || !exists(filePath2) || !is_regular_file(filePath2) )
{
    throw std::invalid_argument("Please provide regular file paths as arguments");
}

I also discovered that the ifstream::seekg() method in MSVC resets some status bits. When running on Linux I have to invoke ifstream::clear() before continuing.

History 

Submitted to CodeProject 17 October, 2012.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Alexander Schwoch
Software Developer
Germany Germany
No Biography provided

Comments and Discussions

 
GeneralMy vote of 5 Pinmembersbarnes17-Oct-12 12:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.141223.1 | Last Updated 17 Oct 2012
Article Copyright 2012 by Alexander Schwoch
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid