Updated Article & Code
Download DuplicateFileDetectorMVVM_exe.zip - 28.4 KB
Download source - 26.2 KB
Old Source Code
I needed a simple reliable utility which could basically find duplicate files in a directory. Main requirement was that it should not be based on the name of a file. The reason for that is that in a directory you could have single file with two different names. If you just use copy/paste approach the duplicate will not be detected.
Secondly, I could not use file time because two different files can have same time. Similarly, two files can be of same size but completely different.
I used the idea of using MD5 hash for each file which will be unique (most of the time) for each unique file. The utility scans the directory and populate MD5 for each file and saves in SQLite database. I then query database to extract the list of files with same hash value.
Since MD5 scans the bytes of the file, this ensures that two different files with even same size and same file time will generate two different hash files.
The utility should also provide the option to remove the duplicate files based on the user selection.
The utility is a simple WPF application using SQLite database to store list of file information and then figure out duplicates.
Since a directory can contain lots of files and it takes a little while to calculate MD5 hash especially for bigger files, the UI still needs to be responsive. If application performs directory scanning and MD5 calculation in UI thread, the application might appear to freeze or appear as hung though it might be busy doing work.
To resolve this, the simple solution is to spawn a BackgroundWorker to offload the directory scanning and MD5 calculation.
In case user stops the scan in the middle, the application will finish calculating the MD5 hash for current file and then terminate the scan. It will also populate the results with whatever duplicate files it has figured out.
Once I generate the MD5 for the file, I need to store file attributes along with hash value in some data structure. Simplest data structure which come to mind is list. .NET provides it and I could perform LINQ operations to query information I needed.
For small number of files, the .NET list was very efficient. As the number of files increased, the list started consuming lot of memory. Though it was not deal-breaker but I thought of trying something different.
One option was to use SQLite to store this information and query it. Some might say it is an overkill for what this utility is doing and I might agree with them.
The main motivation to use SQLite was to get my hands dirty with some local SQL database and learn how it works with .NET framework. SQLite was perfect solution with no installation hassles and lightweight.
I first decided to use file based database but quickly realized that continuously writing to disk for each file info was causing lot of overhead and a big performance hit. I read more about SQLite and found that I could create an in-memory database similar to .NET list implementation. That turned out to be really fast and with no disk read writes and boasted the performance.
Why not use Directory.GetFiles to get list of files in directory and sub-directories?
.NET provides a simple API Directory.GetFiles to get all the files in a list recursively for sub-directories. I chose not to use this API for three main reasons:
- If there was a permission issue for any of the file, the API would throw an exception. As a result, there is no way to get partial list of files scanned before exception was thrown. I wanted to consume exception for which application does not have rights to access but still continue to read other files in the directory. Typical example would be scanning c:\windows directory.
- This API would only return after it has retrieved list all the files in a directory and its sub-directories. This is ok for directory with small number of files. But typical scenario would be to scan a directory with few thousand files in a directory. So I needed a way to get list of files as we scan a directory and then go into each sub-directory and get list of files. To solve this problem, I wrote my own Breadth/Depth First file and directory enumeration class.
- Since this is UI application, the application has to be responsive. This means, it should notify the user about the file scan status along with option to cancel it midway if user decides so. With while this API is scanning a directory, I could not find a way to cancel this operation.
DirFileEnumeration class addresses above problems and also provides events for both File and directory found which the calling class can subscribe to notify information to user. It also provides a mechanism to cancel the enumeration process if user requests so.
SQLite Database Queries
The SQLite database is created in memory using this connection string
string ConnectionString = "Data Source=:memory:;Version=3;New=True;";
We create table using the below query.
CREATE TABLE FileDB(Id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, Name TEXT, FullName TEXT, Hash TEXT, Size INTEGER, FileExt TEXT, Directory TEXT, LastModTime INTEGER);
Each file with calculated MD5 is stored in the database using the below query.
INSERT INTO FileDB (Name, FullName, Hash, Size, FileExt, Directory, LastModTime)
VALUES (@Name, @FullName, @Hash, @Size, @FileExt, @Directory, @LastModTime);
Once we have populated the database, we need to get list of all the files which have same MD5 hash value. To get this information, below query does the trick and I use data grid to display the results.
SELECT s.id, s.Name, s.Size, s.Hash, s.Fullname
FROM FileDB s INNER JOIN
(SELECT Hash FROM FileDB GROUP BY Hash HAVING COUNT(*) > 1) q ON s.Hash = q.Hash
ORDER BY s.Hash DESC;
The utility provides a simple grid view to view files with same hash. User can use left click with Ctrl to select multiple items and delete them in one go.
The ideal control to view this probably would be a TreeView control with checkboxes. I would keep it as an enhancement because I wanted to get the first cut out with all the functionality.
This utility does what it claims i.e. find duplicate files using MD5 signature. That said there are few things which can be improved and enhanced:
- The columns in DataGrid view are obtained directly from public properties of the class. This can be changed and DataTemplate can be used to add or remove fields.
- Instead of DataGrid I feel Treeview control would have been better choice. An extended TreeView control with checkboxes and having each node based on the Hash value would have presented the information in more organised manner. However, in the interest of time, I had to use the simplest option available to me.