Click here to Skip to main content
14,880,254 members
Please Sign up or sign in to vote.
4.95/5 (19 votes)
See more:
Hi there

I have a file with allot of sentences. I need to make a dictionary with the words from that file. Until now I've separated the words and sort them using Split() and Sort() methods. My problem is to make a list without duplicate words. How can I do that?
C#
static int n = 0;

public static string[] NoDuplicate(string[] array)
{
    int i;
    string[] res = (string[])array.Clone();
    for (i = 0; i < array.Length-1; i++)
	{
        if (array[i + 1] != array[i])
            res[n++] = (string)array[i];
	}
    return res;
}


1) how can I do it more neat?
2) i don't like that method because is initialized using Clone() and the length is too big.

many thx
Posted
Updated 30-Nov-11 21:53pm
v4
Comments
[no name] 1-Dec-11 3:54am
   
EDIT: added "code" tag

Take a look at the HashSet<String>[^] class (.net 3.5 only). It provides an optimised hash collection and it doesn't allow duplicates, (it just ignores attempts to add duplicates), and you can call ToArray() when you are done with it if you really need a string array.
   
v2
Comments
Andrew Rissing 9-Jun-10 13:59pm
   
If you don't have .NET 3.5, you can also use Dictionary<string, object=""> and just set the value to null in all cases. As an alternative to HashSet.
Xeqtr 27-Jun-10 19:07pm
   
Reason for my vote of 5
Useful thing
Dalek Dave 27-Aug-10 5:02am
   
Agreed.
Here is your code using HashSet:
public static string[] RemoveDuplicates(string[] s)
{
    HashSet<string> set = new HashSet<string>(s);
    string[] result = new string[set.Count];
    set.CopyTo(result);
    return result;
}
</string></string>
   
Comments
Laxman Auti 30-Jun-10 6:34am
   
Reason for my vote of 5
The method provided is very efficient as compared to provided in other answers.
Dalek Dave 27-Aug-10 5:02am
   
Good call.
Laurence1234 27-Jan-11 8:00am
   
Can anyone give a more formal proof of the time complexity?
You can also filter the inputted array avoid defining a predicate.

If the inputted array is not null, the method allocates a List<string>.

The allocated collection is filled iteratively (maintaining the linear complexity) adding each element if it has not been not prevously inserted.

Finally, the collection is sorted and converted to string[]

public static string[] NoDuplicate(string[] inputSentences)
{
    if (inputSentences == null)
        return inputSentences;

    Debug.Assert(inputSentences != null);

    List<string> sentences = new List<string>();

    foreach (string inputSentence in inputSentences)
    {
        if (sentences.Contains(inputSentence) == true)
            continue;

        sentences.Add(inputSentence);
    }

    Debug.Assert(sentences != null);
            
    sentences.Sort();
    return sentences.ToArray();
}
   
v2
Comments
Laxman Auti 30-Jun-10 6:29am
   
If the size of inputSentences is large, the time taken for removing a duplicates is significantly grows up.
Alessandro Cislaghi 30-Jun-10 6:49am
   
Removing duplicate items not part of the solution shown.

The 'sentences' list is build avoid duplications, adding any 'inputSentence' item if it has not been previously inserted.
You can use List<string>.RemoveAll(Predicate<string> match)
Try this (Microsoft says it is O(n)):

private static string PreviousItem;

private static bool Match(string item)
{
    bool result = (item == PreviousItem);
    PreviousItem = item;
    return result;
}

public static string[] NoDuplicates(string[] input)
{
    PreviousItem = null;
    List<string> result = new List<string>(input);
    result.Sort();
    result.RemoveAll(Match);
    return result.ToArray();
}
   
Comments
Andrew Rissing 9-Jun-10 13:58pm
   
The check for the predicate is O(n), but the sort is at least O(n*Ln(n)). So, you can only claim O(n*Ln(n)) for speed.
I don't think linq was available at that time. but now it's quite easy to do with linq.

C#
string[] str = new string[] { "Hiren", "Solanki", "Hiren" };
        List<String> lst = str.ToList<String>();
        lst = lst.Distinct().ToList<string>();
        str = lst.ToArray();
   
Comments
Nish Nishant 27-Jan-11 9:53am
   
Why reactivate a 2 year old thread?
Nish Nishant 27-Jan-11 12:33pm
   
BTW the 1 vote was not me :-)
Hiren solanki 28-Jan-11 0:29am
   
I just seen that it isn't accepted still, BTW don't worry about 1 vote my 5 question intentionally downvoted by some platinum memeber. I am not worrying about votes now.
The simple approach I will give you to convert the array to a List and the use the list.Contains(key) to check whether the key is present in the list and then add it to the dictionary.

Cheers...
   
Comments
Toli Cuturicu 26-Aug-10 8:50am
   
Reason for my vote of 3
very inefficient
Hello!Why don't you use HashTable Which has Key and Value Properties ?
Using HashTable Will solve your problem.


Regards ,
Hemanth Kumar VLN
   
Comments
Christian Graus 27-Aug-10 5:03am
   
This question is two years old ( not sure how that is possible ) ? Why are you adding to a question that's been well and truly answered?
This is a method to remove the duplicates and then sort the result in the neatest way possible :)

C#
public string[] FilterAndSort(string[] array)
{
    List<string> retList = new List<string>();
    foreach (string item in array)
    {
        if (!retList.Contains(item))
        {
            retList.Add(item);
        }
    }
    retList.Sort();
    return retList.ToArray();
}
   
Comments
Dimitri Witkowski 9-Jun-10 13:36pm
   
It's performance will be O(n*n), this is the worst way to achieve the goal
Toli Cuturicu 26-Aug-10 8:49am
   
Reason for my vote of 3
poor performance
If it were me, I wouldn't use an array. Instead, I'd use a List. Arrays can't grow/shrink to contain their content (without some manual manipulation).

However, since your spec is currently for an array, here's an extension method that you can use to add unique strings to your array. It assumes that your array is already allocated to the appropriate size, but attempts a sanity check before adding the item.

C#
public static class ExtensionMethods
{
    public static bool AddUnique(this array[] collection, string text, bool caseSensitive)
    {
        bool added = false;
        var count = (from item in collection 
                     where (caseSensitive) ? item == text : item.ToLower() == text.ToLower() 
                     select item).Count();
        if (count == 0 && collection.GetUpperBound() > collection.Length)
        {
            collection[collection.Length] = text;
            added = true;
        }
        return added;
    }
}
   
v3
Comments
Nish Nishant 27-Jan-11 9:52am
   
Wow, what's with everyone answering a 2 year old thread?
#realJSOP 27-Jan-11 12:22pm
   
I didn't notice the date, It was at the top of the list so I looked at it. I think I suggested last year that questions be locked against answering/editing after they got to be a certain age so this kind of thing wouldn't happen.
Nish Nishant 27-Jan-11 12:34pm
   
Yeah it wasn't you, someone else replied to it bringing it to the top!
thatraja 27-Jan-11 21:19pm
   
I remember this movie & dialogue
"Program Alice Activated" - Resident Evil: Apocalypse (2004)
LOL :):):)
'Collection' is the answer for your question.

You can use any collection object with Key, Value pair like HashSet, Dictionary, etc. On finding every word, you insert that word as Key. On fetching the duplicate words, the existing key will be replaced with the newly read same word. Thus, we can avoid the duplicates automatically.

Beycnd this, you can count the number of occurences along with this operation.
   
Your solution is already close to the right one. You just don't have to clone the array, just edit it in-place.

C#
int i, j;

// Remove duplicates from the sorted array, by shifting the elements
for (i= 1, j= 1; i < array.Length; i++)
{
    if (array[i] != array[i - 1])
    {
        // Different element, keep it
        array[j]= array[i];
        j++;
    }
}


In the end, j contains the number of valid entries in array.

Simple and effective. You can resize the array but this involves a copy.

[I am not questioning the approach. Creating the array, sorting then purging will do the trick. This will take time proportional to N.Log(N).L for N words of L characters (on average) and require storage of N.L characters.

Using a hash table as some suggested will tend to reduce the time to N.L and the storage to N'.L, where N' denotes the number of distinct words. But there will be some overhead on time and space.

Telling the best approach would take some comparative experimentation.]
   
v3
pls check this .. am not tested

string[] testing(string[] StringArray)
{
List<string> StringList = new List<string>();
foreach (string str in StringArray)
{
if (!StringList.Contains(str))
{
StringList.Add(str);
}
}
return StringList.ToString();
}
   
Comments
raju melveetilpurayil 25-Jul-10 12:24pm
   
Reason for my vote of 1
function return type is string[], but code return string
Toli Cuturicu 26-Aug-10 8:51am
   
Reason for my vote of 1
Does not even complile. What about making some sense?
If all your words are in an array you can directly get distinct words.
string[] distinctArray = myarray.Select(word => word.Trim()).Distinct().ToArray
   
Comments
Rajesh Anuhya 21-Oct-10 4:14am
   
this is question posted in year 2008 .., why are u answering now..,

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900