Click here to Skip to main content
Licence CPOL
First Posted 2 Aug 2008
Views 19,722
Downloads 707
Bookmarked 33 times

Duplicate File Finder

By | 2 Aug 2008 | Article
A tool that finds duplicate files in your system.

Introduction

I wrote this tool to find duplicate files in my system, and I wanted to share it with you. As the main purpose of this article is to present the tool, there is relatively little to write about. However, I'll try to describe the basic mechanisms of the tool's implementation.

How to Use

Actually, it's very easy to use the tool. You can see its interface in action here:

duplicatefinder_src

In addition to having a Start button, the tool also provides Stop and Pause buttons as it may be a lengthy operation to find duplicates in a large number of files. Duplicated files found during scanning are immediately inserted into the tree control (CTreeCtrl), as you can see. It's a two level hierarchy, that holds duplicated file names in its second level, and general information in its first level. You also can log the resulting hierarchy in a textual form (the Log... button will be enabled when the search process ends up). In general, the tool works very fast; for example, it takes ~40 minutes to scan a directory of ~70 GB size. But, isn't that size too much for a single shot scan?

How it Works

Here, I will describe the underlying strategy of the implementation. The user interface is implemented via MFC (VC7.1 was used, sorry I don't have VC9). The scanning process is done in a separate worker thread so that the user can stop/pause the execution at any point. This thread starts when the user starts a new scan, and finishes when the user stops it or no more files remain to scan. And, as you guess, the main engine is concentrated in the worker thread itself. The steps it performs may be itemized into the following outline:

  • Gather necessary information about the directory size
  • Reserve enough amount of virtual memory to keep scanned file names
  • Start scanning the directory
  • Free the resources

Here is the description of the points above, in a little bit more details:

  • Gather necessary information about the directory size: This process is accomplished with the CFileFind MFC utility class by which the procedure walks through each file to compute the overall file count in a directory. I was struggling to find a Win32 equivalent, but in vein. And, if you happen to know a Win32 API returning the number of files residing in a directory, please let me know.
  • Reserve enough amount of virtual memory to keep scanned file names: The implementation uses VirtualAlloc and VirtualFree to allocate/free virtual memory chunks. As the directory to scan may be very large, it's proper to store the scanned file names in virtual memory, not in RAM directly. So, this virtual memory is needed to store file names that have already been scanned. The next file that is being scanned should be tested against duplicates in the already scanned file table. To optimize this process (i.e., not to traverse all the entries in the table), it keeps the table sorted (by file content, not their names), and uses binary search to find a duplicate (and if not found, it finds the position where to insert the file name in the table).
  • Start scanning the directory: The scanning process picks the next file and tests it against being a duplicate. To pick the next file, again, the CFileFind MFC class is used. So, how do we compare two files? The tool performs a bitwise comparison of two files. That is, it opens them, and compares their content in lexicographical order (as if two strings of bytes). To optimize this process, the files are opened as memory mapped filesm and their contents are compared by the C-runtime routine memcmp. To automate this process, I wrote a class CMMFile which handles all this procedure. It also ensures that if a file cannot be mapped (due to insufficient memory), it gets compared part by part (these parts will be small enough to be successfully mapped).
  • Free the resources: After the scanning process finishes, it's necessary to free the resources, e.g., the allocated virtual memory.

Acknowledgment

There are two articles that were used by this tool. Thanks to Dana Holt for his article about browsing folders. And, the second one was written by me, which provides a simple class encapsulating the VirtualAlloc/VirtualFree Win32 API functions. And certainly, many thanks to the readers and users of this article!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Arman S.

Software Developer
13
United States United States

Member



Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
AnswerCFileFind Win32 equivalent PinmemberAdrian Cole15:41 8 Feb '09  
GeneralMethodology PinmembereRRaTuM5:03 31 Oct '08  
GeneralDuplicate File names Pinmemberpyrodood5:20 9 Sep '08  
I've tried a couple of these types of tools, and though I can understand their purpose, it does not fulfill my needs.
 
I need to check for files with the same name in different sub-directories only 1-layer deep even if the files contain different information.
 
I was wondering if there were a way to add this option into the program.
 
Thanks.
GeneralThanks and updates Pinmemberedger18:13 25 Aug '08  
GeneralStandard Technique PinPopularmemberArash Partow20:00 3 Aug '08  
Generalcompare the size of the files before the content [modified] PinmemberMember 667613:28 3 Aug '08  
GeneralRe: compare the size of the files before the content PinmemberArman Z. Sahakyan2:17 4 Aug '08  
GeneralEarly optimization PinmemberK-ballo10:33 2 Aug '08  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web03 | 2.5.120604.1 | Last Updated 2 Aug 2008
Article Copyright 2008 by Arman S.
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid