Click here to Skip to main content
15,884,986 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
C#
arrfiles   = Directory.GetFiles(C: , "*.txt" , SearchOption.AllDirectories); //now i have an array of strings of all paths

//error here, i have arrfiles array string with all paths of files, i want to add to list only the files with the same name, something wrong in these 2 loops 

 for (int i = 0; i < arrfiles.Length; i++) {
       for (int j = 1; j < arrfiles.Length; j++) {
       ...
       }
}
Posted
Updated 11-Feb-12 8:12am
v3
Comments
shek124 10-Feb-12 5:53am    
can you post the description of the error..
CRDave1988 10-Feb-12 6:01am    
u r getting syntax error or logical problem.
BobJanova 10-Feb-12 6:20am    
Do you want to:
i) *remove* duplicates, i.e. return a list of all the items but once only
ii) show *only non-duplicated items*, i.e. (a, a, b, c) -> (b, c)
iii) show *only duplicated items*, i.e. (a, a, b, c) -> (a)

Your question says one thing (i) and your code something completely different (iii).
zyck 11-Feb-12 19:33pm    
nice question (n_n)

Your problem is that your loops don't work the way you think they do.
You inner loop always starts at one, so it always checks the same files list
What happens when i == 1? the first item you check it against is j == 1 which is itself. So naturally, it matches. So every file except the first will find at least one duplicate.

Why not use a better structure?
C#
string[] files = Directory.GetFiles(@"C:/example", "*.txt", SearchOption.AllDirectories);
Dictionary<string, string> unique = new Dictionary<string, string>();
foreach (string path in files)
    {
    string file = Path.GetFileName(path);
    if (!unique.ContainsKey(file))
        {
        unique.Add(file, path);
        }
    }




Updated code from OP:
C#
 for (int i = 0; i < arrfiles.Length; i++) {
    for (int j = 0; j < arrfiles.Length -2; j++) {
        fi = new FileInfo(arrfiles[i]); //Console.WriteLine("arrfiles[i]" +fi.Length.ToString());
        fi2 = new FileInfo(arrfiles[j]); // Console.WriteLine("arrfiles[j]"+fi2.Length.ToString());
        Console.ReadLine();
        if (Path.GetFileName(arrfiles[i]) == Path.GetFileName(arrfiles[j]) && i != j))
        {  arrfilenames.Add(arrfiles[j]);//here this list must include all duplicates  }
    }
}


This first thing that spring into mind is that it doesn't find all duplicates:
Inputs:
   D:\Temp\ContactsShort.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\myFile.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\MyLargeTextFile.txt
   D:\Temp\Tip - enum in combobox.txt
   D:\Temp\New folder\ListSQL Server databases.txt
   D:\Temp\New folder\myFile.txt
   D:\Temp\New folder\myFileCopied.txt
   D:\Temp\New folder\MyLargeTextFile.txt
Outputs:
   D:\Temp\New folder\ListSQL Server databases.txt
   D:\Temp\New folder\myFile.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\myFile.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\MyLargeTextFile.txt
It has missed out the "New Folder" version of two of the files. There is also the problem that if you give it three files of the same name it reports four...

Secondly, it's not very efficient - you keep making the same comparisons over and over again. Why start with j at zero? you have already checked all the combinations before the current value of i anyway.
So, start the inner loop later, and throw away the test for i and j being the same. Then report both files when you find a match. (I have thrown out the FileInfo stuff, just for clarity)

C#
for (int i = 0; i < files.Length; i++)
    {
    for (int j = i + 1; j < files.Length; j++)
        {
        if (Path.GetFileName(files[i]) == Path.GetFileName(files[j]))
            {
            arrfilenames.Add(files[i]);
            arrfilenames.Add(files[j]);
            }
        }
    }
You can also end the outer loop one iteration earlier without losing data if you want to.
As it is, this assumes a single duplication, and will always report pairs of files together (so it has the same second problem as yours. It also has some inefficiency built in. The solution is to remove entries when you have reported them:
C#
for (int i = 0; i < files.Length - 1; i++)
    {
    bool reportNeeded = false;
    string checkFile = Path.GetFileName(files[i]);
    for (int j = i + 1; j < files.Length; j++)
        {
        if (files[j] != null && Path.GetFileName(files[j]) == checkFile)
            {
            reportNeeded = true;
            arrfilenames.Add(files[j]);
            files[j] = null;
            }
        }
    if (reportNeeded)
        {
        arrfilenames.Add(files[i]);
        }
    }
(I renamed the folders to make it a bit more readable)
Inputs:
   D:\Temp\ContactsShort.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\myFile.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\MyLargeTextFile.txt
   D:\Temp\Tip - enum in combobox.txt
   D:\Temp\F 1\ListSQL Server databases.txt
   D:\Temp\F 1\myFile.txt
   D:\Temp\F 1\myFileCopied.txt
   D:\Temp\F 1\MyLargeTextFile.txt
   D:\Temp\F 2\ListSQL Server databases.txt
   D:\Temp\F 2\myFile.txt
   D:\Temp\F 2\myFileCopied.txt
   D:\Temp\F 2\MyLargeTextFile.txt
Outputs:
   D:\Temp\F 1\ListSQL Server databases.txt
   D:\Temp\F 2\ListSQL Server databases.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\F 1\myFile.txt
   D:\Temp\F 2\myFile.txt
   D:\Temp\myFile.txt
   D:\Temp\F 1\myFileCopied.txt
   D:\Temp\F 2\myFileCopied.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\F 1\MyLargeTextFile.txt
   D:\Temp\F 2\MyLargeTextFile.txt
   D:\Temp\MyLargeTextFile.txt
 
Share this answer
 
v2
Comments
BobJanova 10-Feb-12 6:18am    
This finds the unique items, not the duplicated items, which is what the original code is doing.

For unique items, can't you just use the Distinct LINQ extension method? i.e. List<string> unique = files.Distinct().AsList();
OriginalGriff 10-Feb-12 6:20am    
Nope - because the original strings contain the full path, not just the file name.
OriginalGriff 10-Feb-12 6:21am    
BTW: I was working from the description of the problem:
"The objective is to remove duplicates"
csharpnew 10-Feb-12 8:39am    
Hello, i think that i have improved my code,, can u "orginalgrill" to check it please, should i post it here ?
OriginalGriff 10-Feb-12 9:42am    
Sure!
With the following code you have all in your hand to cover dups, unique, remove, etc.

C#
private static Dictionary<string, List<string>> GetFiles(string rootPath, string filePattern)
{
    var rawdata = Directory.GetFiles(rootPath, filePattern, SearchOption.AllDirectories);
    Dictionary<string, List<string>> files = new Dictionary<string, List<string>>();
    foreach (var path in rawdata)
    {
        string name = Path.GetFileName(path).ToLower();
        List<string> entries;
        if (files.TryGetValue(name, out entries)) entries.Add(path);
        else files.Add(name, new List<string>() { path });
    }
    return files;
}


E.g. you could run this main program to test it:

C#
static void Main(string[] args)
{
    var files = GetFiles(@"c:\examples", "*.txt");
    var duplicates = from e in files.Values where e.Count > 1 select e;
    var uniques = from e in files.Values where e.Count == 1 select e;

    Func<int, List<string>, int> print = (a, e) =>
      { Console.WriteLine("{0}.\t{1}", ++a, string.Join("\n\t", e)); return a; };

    int pos = 0;
    Console.WriteLine("Duplicates:");
    pos = duplicates.Aggregate(pos, print);
    Console.WriteLine("Unique entries:");
    pos = uniques.Aggregate(pos, print);
}


Cheers

Andi
 
Share this answer
 
v2
Comments
BillWoodruff 10-Feb-12 19:01pm    
+5 thank you for another "master level" lesson in Linq ... which I will study very closely.
Andreas Gieriet 10-Feb-12 19:45pm    
Thanks for your 5!
BTW: I came across some discussion about perfomrance of linq versus performance of for/foreach (find last element of of 100.000.000 items by evaluating some relation operation on each element). The figures I did try out myself were roughly:
- Debug: some difference, not relevant for performance measure
- Release: "for-break" 2x faster than "foreach-break" and 3x faster than "where-select".

I would prefer nonetheles the "foreach" over "for" unless it is really performance critical: index handling is rather error prone IMHO. And why Linq: well, I like expressive code: once you get used to read the from-in-where-select queries, I find these far superior regarding error possibilities compared with any explicit loop-if-break-if-notfound-...

And: as you see, "aggregate" is my friend ;-) But I admit, one can overuse... Maybe I did cross that line above already... Most people would probably prefer a foreach over the aggregate. And: Performance is here absolutely no issue since I guess the Console.WriteLine is slower by at least one order of magnitude compared to the loop increment.

Cheers

Andi
I hope the code is self-explanatory and it solves the problem.
C#
string[] searchLocations = new string[] { @"C:\" };
string[] searchLocations = new string[] { @"C:\" };
string searchFilter = "*.txt";

// Since the number of files can range from a few hundred to hundreds of thousands
// sorted lists offer faster searching for keys, since it's sorted
SortedList<string, List<FileInfo>> searchedList = new SortedList<string, List<FileInfo>>();
// Since duplicates are going to be lesser, a sorted list is not used
// use SortedSet<string> if required
List<string> duplicatesList = new List<string>();

foreach (string location in searchLocations)
{
  if (Directory.Exists(location))
  {
    try
    {
      string[] matchedFiles = Directory.GetFiles(location, searchFilter, SearchOption.AllDirectories);
      if ((matchedFiles != null) && (matchedFiles.Length > 0))
      {
        foreach (string currentFile in matchedFiles)
        {
          if (File.Exists(currentFile))
          {
            // Core Logic: Group all the files based on the file name (without the extension) while keeping track of duplicates
            // Solution: Check whether a key already exists with the current file name. 
            //           If key exists, possible duplicate. Therefore mark the file name as a duplicate.
            //           If not, create a List<FileInfo> under the file name.
            //           Then add the FileInfo object of the current file to a list under the file name regardless whether it's a duplicate.
            //           ----------------------------------------------
            //           When the loop is complete, you got a list of duplicate file names and a all files grouped by file names.
            //           Deal with it as you please :)

            FileInfo currentFileInfo = new FileInfo(currentFile);
            if (searchedList.ContainsKey(currentFileInfo.Name))
            {
              if (!duplicatesList.Contains(currentFileInfo.Name))
                duplicatesList.Add(currentFileInfo.Name);
            }
            else
              searchedList.Add(currentFileInfo.Name, new List<FileInfo>());

            searchedList[currentFileInfo.Name].Add(currentFileInfo);
          }
        }

        // Releasing memory taken by the array, we no longer need this
        matchedFiles = null;
      }
    }
    catch
    {
      // Directory.GetFiles() method can throw loads of exceptions
      // Do the necessary handling as fit :)
    }
  }
}

foreach (string fileName in duplicatesList)
{
  // The list of duplicates for that file name
  List<FileInfo> duplicates = searchedList[fileName];
}
}
 
Share this answer
 
v3
if ((i != j) && (Path.GetFileName(arrfiles[i]) == Path.GetFileName(arrfiles[j])))


You need to exclude the case where it matches because you access the same element (thus, it's not a duplicate).
 
Share this answer
 
Comments
lukeer 10-Feb-12 6:39am    
This way, every duplicate is listed twice because "A" == "B" as well as "B" == "A".
Better start the inner loop at i + 1. That way you save half the comparisons. And the double duplicates.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900