Click here to Skip to main content
14,868,645 members
Please Sign up or sign in to vote.
1.67/5 (3 votes)
See more:
Let us say you have notepad file where it has the following lines. I have to find the duplicates I have achieved it partially. i.e my below program works and prints the result in console. if you notice "user1, user2" is repeated twice which should be removed which it does.. However I have to handle another scenario as well that is, it has to remove "user2, user1" also which it does not do

user1, user2
user3, user1
user1, user2
user5, user6
user2, user1


below is the program
C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Collections.Generic;
namespace ex
{
    class Program
    {
        static void Main(string[] args)
        {
            string path = @"C:\Users\Documents\Visual Studio 2010\Friends.txt";
            StreamReader sr = new StreamReader(path);


            List<string> lines = new List<string>();
            string line;
            
            while ((line=sr.ReadLine())!=null)
            {

               // string[] nl = line.Split(' ');

              //  for (int i = 0; i<nl.Length; i++)
               // {
                     lines.Add(line);
              //  }

               
            }

            List<string> removingduplicates = lines.Distinct().ToList();

           // string nn=removingduplicates.Join(",",removingduplicates);

            foreach (string item in removingduplicates)
            {
                Console.WriteLine(item);
            }

            
        }
    }
}
Posted
Comments
Sergey Alexandrovich Kryukov 21-Aug-15 12:02pm
   
"Notepad file" is nonsense. This is the same as saying: "Microsoft digit 7".
—SA

If you have to handle "user1, user2" as matching "user2, user1", then you wiull have to be a bit more constructive.

But...this is your homework, so no code!

Start by reading your lines, and using Split to "break" them into a left-of-the-comma and a right-of-the-comma part. Use Trim to remove any miscellaneous spaces.
Sort the parts so they are always in the same order.
Rebuild your strings, using string.Join to add the comma and space back in.
Now you can remove your duplicates.
   
Another option, similar to Griff's, would be to process line by line without loading the whole file into memory*:
Use File.ReadLines(path) to get an IEnumerable<string> for the input
pass that through the .Distinct(IEqualityComparer<string>) which gives another IEnumerable<string> for the output.
Then you can use File.WriteAllLines(path, IEnumerable<string>) to make an output file, or use a foreach loop to write all the lines to the Console.
So now the exercise is to write a small class that implements IEqualityComparer<string>. This can split the string into the parts and use whatever a priori information you may have about them to check if they match (and ensure matching inputs have the same HashCode).

There a couple of other optimizations I can think of, but I'll leave those as "exercises".

* The .Distinct() does internally build a representation that collects one entry for each unique string, but this is (potentially) much smaller than the whole file, and definitely smaller than both the whole file collection and the .Distinct() internal representation.
   
v3

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900