please help me with this 2 loops, to remove duplicates

Question

0.00/5 (No votes)

See more:

C#

arrfiles   = Directory.GetFiles(C: , "*.txt" , SearchOption.AllDirectories); //now i have an array of strings of all paths

//error here, i have arrfiles array string with all paths of files, i want to add to list only the files with the same name, something wrong in these 2 loops 

 for (int i = 0; i < arrfiles.Length; i++) {
       for (int j = 1; j < arrfiles.Length; j++) {
       ...
       }
}

Posted 9-Feb-12 23:48pm

csharpnew

Updated 11-Feb-12 8:12am

Chris Maunder772.8K

v3

Add a Solution

Comments

shek124 10-Feb-12 5:53am

can you post the description of the error..

CRDave1988 10-Feb-12 6:01am

u r getting syntax error or logical problem.

BobJanova 10-Feb-12 6:20am

Do you want to:
i) *remove* duplicates, i.e. return a list of all the items but once only
ii) show *only non-duplicated items*, i.e. (a, a, b, c) -> (b, c)
iii) show *only duplicated items*, i.e. (a, a, b, c) -> (a)

Your question says one thing (i) and your code something completely different (iii).

zyck 11-Feb-12 19:33pm

nice question (n_n)

4 solutions

Solution 3

With the following code you have all in your hand to cover dups, unique, remove, etc.

C#

private static Dictionary<string, List<string>> GetFiles(string rootPath, string filePattern)
{
    var rawdata = Directory.GetFiles(rootPath, filePattern, SearchOption.AllDirectories);
    Dictionary<string, List<string>> files = new Dictionary<string, List<string>>();
    foreach (var path in rawdata)
    {
        string name = Path.GetFileName(path).ToLower();
        List<string> entries;
        if (files.TryGetValue(name, out entries)) entries.Add(path);
        else files.Add(name, new List<string>() { path });
    }
    return files;
}

E.g. you could run this main program to test it:

C#

static void Main(string[] args)
{
    var files = GetFiles(@"c:\examples", "*.txt");
    var duplicates = from e in files.Values where e.Count > 1 select e;
    var uniques = from e in files.Values where e.Count == 1 select e;

    Func<int, List<string>, int> print = (a, e) =>
      { Console.WriteLine("{0}.\t{1}", ++a, string.Join("\n\t", e)); return a; };

    int pos = 0;
    Console.WriteLine("Duplicates:");
    pos = duplicates.Aggregate(pos, print);
    Console.WriteLine("Unique entries:");
    pos = uniques.Aggregate(pos, print);
}

Cheers

Andi

Posted 10-Feb-12 1:06am

Andreas Gieriet

Updated 10-Feb-12 1:07am

v2

Comments

BillWoodruff 10-Feb-12 19:01pm

+5 thank you for another "master level" lesson in Linq ... which I will study very closely.

Andreas Gieriet 10-Feb-12 19:45pm

Thanks for your 5!
BTW: I came across some discussion about perfomrance of linq versus performance of for/foreach (find last element of of 100.000.000 items by evaluating some relation operation on each element). The figures I did try out myself were roughly:
- Debug: some difference, not relevant for performance measure
- Release: "for-break" 2x faster than "foreach-break" and 3x faster than "where-select".

I would prefer nonetheles the "foreach" over "for" unless it is really performance critical: index handling is rather error prone IMHO. And why Linq: well, I like expressive code: once you get used to read the from-in-where-select queries, I find these far superior regarding error possibilities compared with any explicit loop-if-break-if-notfound-...

And: as you see, "aggregate" is my friend ;-) But I admit, one can overuse... Maybe I did cross that line above already... Most people would probably prefer a foreach over the aggregate. And: Performance is here absolutely no issue since I guess the Console.WriteLine is slower by at least one order of magnitude compared to the loop increment.

Cheers

Andi

Solution 4

I hope the code is self-explanatory and it solves the problem.

C#

string[] searchLocations = new string[] { @"C:\" };
string[] searchLocations = new string[] { @"C:\" };
string searchFilter = "*.txt";

// Since the number of files can range from a few hundred to hundreds of thousands
// sorted lists offer faster searching for keys, since it's sorted
SortedList<string, List<FileInfo>> searchedList = new SortedList<string, List<FileInfo>>();
// Since duplicates are going to be lesser, a sorted list is not used
// use SortedSet<string> if required
List<string> duplicatesList = new List<string>();

foreach (string location in searchLocations)
{
  if (Directory.Exists(location))
  {
    try
    {
      string[] matchedFiles = Directory.GetFiles(location, searchFilter, SearchOption.AllDirectories);
      if ((matchedFiles != null) && (matchedFiles.Length > 0))
      {
        foreach (string currentFile in matchedFiles)
        {
          if (File.Exists(currentFile))
          {
            // Core Logic: Group all the files based on the file name (without the extension) while keeping track of duplicates
            // Solution: Check whether a key already exists with the current file name. 
            //           If key exists, possible duplicate. Therefore mark the file name as a duplicate.
            //           If not, create a List<FileInfo> under the file name.
            //           Then add the FileInfo object of the current file to a list under the file name regardless whether it's a duplicate.
            //           ----------------------------------------------
            //           When the loop is complete, you got a list of duplicate file names and a all files grouped by file names.
            //           Deal with it as you please :)

            FileInfo currentFileInfo = new FileInfo(currentFile);
            if (searchedList.ContainsKey(currentFileInfo.Name))
            {
              if (!duplicatesList.Contains(currentFileInfo.Name))
                duplicatesList.Add(currentFileInfo.Name);
            }
            else
              searchedList.Add(currentFileInfo.Name, new List<FileInfo>());

            searchedList[currentFileInfo.Name].Add(currentFileInfo);
          }
        }

        // Releasing memory taken by the array, we no longer need this
        matchedFiles = null;
      }
    }
    catch
    {
      // Directory.GetFiles() method can throw loads of exceptions
      // Do the necessary handling as fit :)
    }
  }
}

foreach (string fileName in duplicatesList)
{
  // The list of duplicates for that file name
  List<FileInfo> duplicates = searchedList[fileName];
}
}

Posted 10-Feb-12 7:39am

CodeHawkz

Updated 10-Feb-12 7:42am

v3

Solution 2

if ((i != j) && (Path.GetFileName(arrfiles[i]) == Path.GetFileName(arrfiles[j])))

You need to exclude the case where it matches because you access the same element (thus, it's not a duplicate).

Posted 10-Feb-12 0:11am

phil.o

Comments

lukeer 10-Feb-12 6:39am

This way, every duplicate is listed twice because "A" == "B" as well as "B" == "A".
Better start the inner loop at i + 1. That way you save half the comparisons. And the double duplicates.

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Accepted Answer · 2012-02-10T00:10:00

Your problem is that your loops don't work the way you think they do.
You inner loop always starts at one, so it always checks the same files list
What happens when i == 1? the first item you check it against is j == 1 which is itself. So naturally, it matches. So every file except the first will find at least one duplicate.

Why not use a better structure?

C#

string[] files = Directory.GetFiles(@"C:/example", "*.txt", SearchOption.AllDirectories);
Dictionary<string, string> unique = new Dictionary<string, string>();
foreach (string path in files)
    {
    string file = Path.GetFileName(path);
    if (!unique.ContainsKey(file))
        {
        unique.Add(file, path);
        }
    }

Updated code from OP:

C#

 for (int i = 0; i < arrfiles.Length; i++) {
    for (int j = 0; j < arrfiles.Length -2; j++) {
        fi = new FileInfo(arrfiles[i]); //Console.WriteLine("arrfiles[i]" +fi.Length.ToString());
        fi2 = new FileInfo(arrfiles[j]); // Console.WriteLine("arrfiles[j]"+fi2.Length.ToString());
        Console.ReadLine();
        if (Path.GetFileName(arrfiles[i]) == Path.GetFileName(arrfiles[j]) && i != j))
        {  arrfilenames.Add(arrfiles[j]);//here this list must include all duplicates  }
    }
}

This first thing that spring into mind is that it doesn't find all duplicates:

Inputs:
   D:\Temp\ContactsShort.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\myFile.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\MyLargeTextFile.txt
   D:\Temp\Tip - enum in combobox.txt
   D:\Temp\New folder\ListSQL Server databases.txt
   D:\Temp\New folder\myFile.txt
   D:\Temp\New folder\myFileCopied.txt
   D:\Temp\New folder\MyLargeTextFile.txt
Outputs:
   D:\Temp\New folder\ListSQL Server databases.txt
   D:\Temp\New folder\myFile.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\myFile.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\MyLargeTextFile.txt

It has missed out the "New Folder" version of two of the files. There is also the problem that if you give it three files of the same name it reports four...

Secondly, it's not very efficient - you keep making the same comparisons over and over again. Why start with j at zero? you have already checked all the combinations before the current value of i anyway.
So, start the inner loop later, and throw away the test for i and j being the same. Then report both files when you find a match. (I have thrown out the FileInfo stuff, just for clarity)

C#

for (int i = 0; i < files.Length; i++)
    {
    for (int j = i + 1; j < files.Length; j++)
        {
        if (Path.GetFileName(files[i]) == Path.GetFileName(files[j]))
            {
            arrfilenames.Add(files[i]);
            arrfilenames.Add(files[j]);
            }
        }
    }

You can also end the outer loop one iteration earlier without losing data if you want to.
As it is, this assumes a single duplication, and will always report pairs of files together (so it has the same second problem as yours. It also has some inefficiency built in. The solution is to remove entries when you have reported them:

C#

for (int i = 0; i < files.Length - 1; i++)
    {
    bool reportNeeded = false;
    string checkFile = Path.GetFileName(files[i]);
    for (int j = i + 1; j < files.Length; j++)
        {
        if (files[j] != null && Path.GetFileName(files[j]) == checkFile)
            {
            reportNeeded = true;
            arrfilenames.Add(files[j]);
            files[j] = null;
            }
        }
    if (reportNeeded)
        {
        arrfilenames.Add(files[i]);
        }
    }

(I renamed the folders to make it a bit more readable)

Inputs:
   D:\Temp\ContactsShort.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\myFile.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\MyLargeTextFile.txt
   D:\Temp\Tip - enum in combobox.txt
   D:\Temp\F 1\ListSQL Server databases.txt
   D:\Temp\F 1\myFile.txt
   D:\Temp\F 1\myFileCopied.txt
   D:\Temp\F 1\MyLargeTextFile.txt
   D:\Temp\F 2\ListSQL Server databases.txt
   D:\Temp\F 2\myFile.txt
   D:\Temp\F 2\myFileCopied.txt
   D:\Temp\F 2\MyLargeTextFile.txt
Outputs:
   D:\Temp\F 1\ListSQL Server databases.txt
   D:\Temp\F 2\ListSQL Server databases.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\F 1\myFile.txt
   D:\Temp\F 2\myFile.txt
   D:\Temp\myFile.txt
   D:\Temp\F 1\myFileCopied.txt
   D:\Temp\F 2\myFileCopied.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\F 1\MyLargeTextFile.txt
   D:\Temp\F 2\MyLargeTextFile.txt
   D:\Temp\MyLargeTextFile.txt

please help me with this 2 loops, to remove duplicates

4 solutions

Solution 1

Solution 3

Solution 4

Solution 2

Add your solution here

Preview 0

Existing Members

...or Join us