Click here to Skip to main content
11,705,451 members (48,258 online)
Click here to Skip to main content

Duplicate Files Finder

, 15 Dec 2008 CPOL 107.4K 6.6K 125
Rate this:
Please Sign up or sign in to vote.
A utility to find any duplicate file in your hard drives using MD5 hashing.

DuplicateFinder.old.jpg

Search results

DuplicateFinder_deleted.JPG

File deleted

Introduction

Once a year, I do that terrific job of cleaning files I created or downloaded on my drives. The last time I tried to do it, it was such a fastidious task that I thought of doing that thing semi-automatically. I needed some free utility that could find duplicate files, but I found none that corresponded to my needs. I decided to write one.

Background

The CRC calculation method is available here. I use the MD5 hashing provided by the standard libraries. I added an event to the MD5 computing method so as to get a hashing progression, it is a thread that reads the stream position while the MD5 computing method is reading the same stream.

Using the code

The utility uses two main classes, DirectoryCrawler and Comparers. The use is obvious Smile | :) Please notice that instead of iterating through a list list.count X list.count times, DuplicateFinder uses a Hashtable that contains the pair <size,count>. Once populated, all files with count =1 will be removed: (Very much faster!!!!)

int len = filesToCompare.Length;
List<long> alIdx = new List<long>();
System.Collections.Hashtable HLengths = new System.Collections.Hashtable();
foreach (FileInfo fileInfo in filesToCompare)
{
    if (!HLengths.Contains(fileInfo.Length))
        HLengths.Add(fileInfo.Length, 1);
    else
        HLengths[fileInfo.Length] = (int)HLengths[fileInfo.Length] + 1;
}
foreach (DictionaryEntry hash in HLengths)
    if ((int)hash.Value == 1)
    {                    
        alIdx.Add((long)hash.Key);
        setText(stsMain, string.Format("Will remove File with size {0}", hash.Key));
    }
FileInfo[] fiZ = new FileInfo[len - alIdx.Count];
int j = 0;
for (int i = 0; i < len; i++)
{
    if (!alIdx.Contains(filesToCompare[i].Length))
        fiZ[j++] = filesToCompare[i];
}
return fiZ;

Points of interest

  • (Done) Optimizes file moving, UI may be unresponsive while moving big files Frown | :(
  • (Useless, my MD5 is better ^_^) Add options to choose between CRC32 and MD5 hashing.
  • Maybe use an XML configuration file. At this time, moving duplicate files to D:\DuplicateFiles (which is hard coded, viva Microsoft!) and skipping that folder during scanning is sufficient to me.
  • Don't forget that your posts make POIs.
  • (Done): Code an event enabled MD5 hashing class that would report hashing progression, imagine hashing a 10 GB file!

History

  • v0.2
    • Optimized duplicates retrieving (duplicate sizes and duplicate hashes).
    • Added Move to Recycle Bin.
    • Added file size criteria.
    • Files to delete info updated for every check/uncheck in listview.
    • Added colors and fonts to UI.
    • Debug enabled sources (#if DEBUG synchronous #else threaded).
    • Added List<Fileinfo> and List<string[]> instead of using array lists.
    • MD5 hashing is used instead of CRC32 (supercat9).
    • Added Skip Source Folder option.
    • Added Drop SubFolder.
    • Some optimizations...
  • v0.1
    • First time publishing. Waiting for bug reports Smile | :)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

eRRaTuM
Chief Technology Officer
Morocco Morocco
in his studies, erratum discovered c/c++.he appreciated it.
when he met oracle products, in his job, he fell in love.
he uses c# .net & ms sql.

he created a "f.r.i.e.n.d.s" like soap movie, melting all of the above.
went back in the university.
after he took courses of artificial vision & imagery, he finished his studies with a successful license plate recognition project.

You may also be interested in...

Comments and Discussions

 
Questionapp hangs in watchFinished() Pin
Art Psi14-May-15 20:44
memberArt Psi14-May-15 20:44 
QuestionDelete Duplicate Files Pin
bardy114-Oct-14 19:55
memberbardy114-Oct-14 19:55 
AnswerRe: Delete Duplicate Files Pin
eRRaTuM19-Nov-14 15:08
membereRRaTuM19-Nov-14 15:08 
GeneralMy vote of 5 Pin
Midax31-Oct-13 3:39
memberMidax31-Oct-13 3:39 
GeneralRe: My vote of 5 Pin
eRRaTuM28-May-14 3:43
membereRRaTuM28-May-14 3:43 
Generaldup0licate folder Pin
tirmizi1010-Dec-12 3:33
membertirmizi1010-Dec-12 3:33 
GeneralRe: dup0licate folder Pin
eRRaTuM28-May-14 3:42
membereRRaTuM28-May-14 3:42 
Questionsaymahayen Pin
Member 960575916-Nov-12 9:51
memberMember 960575916-Nov-12 9:51 
GeneralMy vote of 5 Pin
manoj kumar choubey6-Jul-12 19:55
membermanoj kumar choubey6-Jul-12 19:55 
GeneralRe: My vote of 5 Pin
eRRaTuM28-May-14 3:40
membereRRaTuM28-May-14 3:40 
QuestionReference to new version Pin
abusa4-Feb-12 21:57
memberabusa4-Feb-12 21:57 
Questionsource code problem Pin
arun116818-Jan-12 14:25
memberarun116818-Jan-12 14:25 
QuestionDuplicate Finder Pin
vernados24-Oct-11 6:07
membervernados24-Oct-11 6:07 
QuestionCrashes when using large number of files? Pin
markus folius10-Jan-11 12:29
membermarkus folius10-Jan-11 12:29 
AnswerRe: Crashes when using large number of files? Pin
eRRaTuM28-May-14 3:46
membereRRaTuM28-May-14 3:46 
QuestionO(N^2) dependency in the listview? Pin
SInsanity5-Dec-10 5:14
memberSInsanity5-Dec-10 5:14 
AnswerRe: O(N^2) dependency in the listview? Pin
eRRaTuM28-May-14 3:25
membereRRaTuM28-May-14 3:25 
GeneralContextSwitchDeadlock was detected Pin
ToothRobber22-Apr-10 4:45
memberToothRobber22-Apr-10 4:45 
The CLR has been unable to transition from COM context 0x1a7ff8 to COM context 0x1a7e88 for 60 seconds. The thread that owns the destination context/apartment is most likely either doing a non pumping wait or processing a very long running operation without pumping Windows messages. This situation generally has a negative performance impact and may even lead to the application becoming non responsive or memory usage accumulating continually over time. To avoid this problem, all single threaded apartment (STA) threads should use pumping wait primitives (such as CoWaitForMultipleHandles) and routinely pump messages during long running operations.

Using Visual Studio 2008

private void btnGo_Click(object sender, EventArgs e)
{
lockControls(true);

lstFiles.Items.Clear();

alFiles.Clear();

this.Cursor = Cursors.WaitCursor;

SearchFiles();
}

The application does not handle large directories well. When searching a large directory the main window does not get refreshed. The progress bars do not work.

Robert
GeneralMissing ConsoleTestRestorer & Restorer Pin
Member 28715804-Apr-10 4:20
memberMember 28715804-Apr-10 4:20 
GeneralRe: Missing ConsoleTestRestorer & Restorer Pin
k2ox27-Nov-10 7:06
memberk2ox27-Nov-10 7:06 
GeneralDuplicate File Finder , Name Finder , Zero Lenght etc Pin
stixoffire22-Dec-08 21:25
memberstixoffire22-Dec-08 21:25 
GeneralRe: Duplicate File Finder , Name Finder , Zero Lenght etc Pin
eRRaTuM23-Dec-08 2:37
membereRRaTuM23-Dec-08 2:37 
GeneralRe: Duplicate File Finder , Name Finder , Zero Lenght etc Pin
Booya10014-Jan-09 12:35
memberBooya10014-Jan-09 12:35 
AnswerRe: Duplicate File Finder , Name Finder , Zero Lenght etc Pin
eRRaTuM14-Jan-09 14:42
membereRRaTuM14-Jan-09 14:42 
GeneralMy vote of 5. Pin
Lion_King110922-Dec-08 17:03
memberLion_King110922-Dec-08 17:03 
GeneralRe: My vote of 5. Pin
eRRaTuM23-Dec-08 2:51
membereRRaTuM23-Dec-08 2:51 
GeneralDuplicate Files Finder. MD5 Encryption Pin
Henry Minute15-Dec-08 5:20
memberHenry Minute15-Dec-08 5:20 
GeneralRe: Duplicate Files Finder. MD5 Encryption Pin
eRRaTuM23-Dec-08 2:49
membereRRaTuM23-Dec-08 2:49 
GeneralHardcoded path causes RTE Pin
At Nel14-Dec-08 17:43
memberAt Nel14-Dec-08 17:43 
GeneralRe: Hardcoded path causes RTE Pin
eRRaTuM15-Dec-08 3:15
membereRRaTuM15-Dec-08 3:15 
Questionthread exception ! any solution?? Pin
Member 551477514-Dec-08 10:40
memberMember 551477514-Dec-08 10:40 
AnswerRe: thread exception ! any solution?? Pin
eRRaTuM15-Dec-08 3:31
membereRRaTuM15-Dec-08 3:31 
GeneralI like it Pin
=Xc@libur=12-Nov-08 17:12
member=Xc@libur=12-Nov-08 17:12 
GeneralRe: I like it Pin
eRRaTuM17-Nov-08 1:52
membereRRaTuM17-Nov-08 1:52 
GeneralDuplicate FIle Name finder Pin
pyrodood9-Sep-08 5:17
memberpyrodood9-Sep-08 5:17 
AnswerRe: Duplicate FIle Name finder Pin
eRRaTuM15-Sep-08 16:25
membereRRaTuM15-Sep-08 16:25 
GeneralRe: Duplicate FIle Name finder [modified] Pin
pyrodood16-Sep-08 4:43
memberpyrodood16-Sep-08 4:43 
GeneralRe: Duplicate FIle Name finder Pin
supercat915-Dec-08 6:29
membersupercat915-Dec-08 6:29 
GeneralRe: Duplicate FIle Name finder Pin
pyrodood15-Dec-08 7:18
memberpyrodood15-Dec-08 7:18 
GeneralGreat article Pin
final_zero1-Sep-08 9:09
memberfinal_zero1-Sep-08 9:09 
AnswerRe: Great article Pin
eRRaTuM15-Sep-08 16:34
membereRRaTuM15-Sep-08 16:34 
GeneralRe: Great article Pin
final_zero19-Sep-08 9:01
memberfinal_zero19-Sep-08 9:01 
GeneralRe: Great article Pin
eRRaTuM22-Sep-08 4:55
membereRRaTuM22-Sep-08 4:55 
QuestionCaching? Pin
GregSawin1-Sep-08 7:35
memberGregSawin1-Sep-08 7:35 
AnswerRe: Caching? Pin
eRRaTuM15-Sep-08 16:42
membereRRaTuM15-Sep-08 16:42 
GeneralRe: Caching? Pin
Pedro Barreto2-Oct-08 7:10
memberPedro Barreto2-Oct-08 7:10 
AnswerCompute MD5 for everything? Pin
supercat915-Dec-08 12:20
membersupercat915-Dec-08 12:20 
GeneralRe: Compute MD5 for everything? Pin
eRRaTuM23-Dec-08 3:47
membereRRaTuM23-Dec-08 3:47 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.150819.1 | Last Updated 15 Dec 2008
Article Copyright 2008 by eRRaTuM
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid