Click here to Skip to main content
15,877,754 members
Please Sign up or sign in to vote.
4.95/5 (19 votes)
See more:
Hi there

I have a file with allot of sentences. I need to make a dictionary with the words from that file. Until now I've separated the words and sort them using Split() and Sort() methods. My problem is to make a list without duplicate words. How can I do that?
C#
static int n = 0;

public static string[] NoDuplicate(string[] array)
{
    int i;
    string[] res = (string[])array.Clone();
    for (i = 0; i < array.Length-1; i++)
	{
        if (array[i + 1] != array[i])
            res[n++] = (string)array[i];
	}
    return res;
}


1) how can I do it more neat?
2) i don't like that method because is initialized using Clone() and the length is too big.

many thx
Posted
Updated 30-Nov-11 21:53pm
v4
Comments
[no name] 1-Dec-11 3:54am    
EDIT: added "code" tag

Take a look at the HashSet<String>[^] class (.net 3.5 only). It provides an optimised hash collection and it doesn't allow duplicates, (it just ignores attempts to add duplicates), and you can call ToArray() when you are done with it if you really need a string array.
 
Share this answer
 
v2
Comments
Andrew Rissing 9-Jun-10 13:59pm    
If you don't have .NET 3.5, you can also use Dictionary<string, object=""> and just set the value to null in all cases. As an alternative to HashSet.
Xeqtr 27-Jun-10 19:07pm    
Reason for my vote of 5
Useful thing
Dalek Dave 27-Aug-10 5:02am    
Agreed.
Here is your code using HashSet:
public static string[] RemoveDuplicates(string[] s)
{
    HashSet<string> set = new HashSet<string>(s);
    string[] result = new string[set.Count];
    set.CopyTo(result);
    return result;
}
</string></string>
 
Share this answer
 
Comments
Laxman Auti 30-Jun-10 6:34am    
Reason for my vote of 5
The method provided is very efficient as compared to provided in other answers.
Dalek Dave 27-Aug-10 5:02am    
Good call.
Laurence1234 27-Jan-11 8:00am    
Can anyone give a more formal proof of the time complexity?
You can also filter the inputted array avoid defining a predicate.

If the inputted array is not null, the method allocates a List<string>.

The allocated collection is filled iteratively (maintaining the linear complexity) adding each element if it has not been not prevously inserted.

Finally, the collection is sorted and converted to string[]

public static string[] NoDuplicate(string[] inputSentences)
{
    if (inputSentences == null)
        return inputSentences;

    Debug.Assert(inputSentences != null);

    List<string> sentences = new List<string>();

    foreach (string inputSentence in inputSentences)
    {
        if (sentences.Contains(inputSentence) == true)
            continue;

        sentences.Add(inputSentence);
    }

    Debug.Assert(sentences != null);
            
    sentences.Sort();
    return sentences.ToArray();
}
 
Share this answer
 
v2
Comments
Laxman Auti 30-Jun-10 6:29am    
If the size of inputSentences is large, the time taken for removing a duplicates is significantly grows up.
Alessandro Cislaghi 30-Jun-10 6:49am    
Removing duplicate items not part of the solution shown.

The 'sentences' list is build avoid duplications, adding any 'inputSentence' item if it has not been previously inserted.
You can use List<string>.RemoveAll(Predicate<string> match)
Try this (Microsoft says it is O(n)):

private static string PreviousItem;

private static bool Match(string item)
{
    bool result = (item == PreviousItem);
    PreviousItem = item;
    return result;
}

public static string[] NoDuplicates(string[] input)
{
    PreviousItem = null;
    List<string> result = new List<string>(input);
    result.Sort();
    result.RemoveAll(Match);
    return result.ToArray();
}
 
Share this answer
 
Comments
Andrew Rissing 9-Jun-10 13:58pm    
The check for the predicate is O(n), but the sort is at least O(n*Ln(n)). So, you can only claim O(n*Ln(n)) for speed.
I don't think linq was available at that time. but now it's quite easy to do with linq.

C#
string[] str = new string[] { "Hiren", "Solanki", "Hiren" };
        List<String> lst = str.ToList<String>();
        lst = lst.Distinct().ToList<string>();
        str = lst.ToArray();
 
Share this answer
 
Comments
Nish Nishant 27-Jan-11 9:53am    
Why reactivate a 2 year old thread?
Nish Nishant 27-Jan-11 12:33pm    
BTW the 1 vote was not me :-)
Hiren solanki 28-Jan-11 0:29am    
I just seen that it isn't accepted still, BTW don't worry about 1 vote my 5 question intentionally downvoted by some platinum memeber. I am not worrying about votes now.
The simple approach I will give you to convert the array to a List and the use the list.Contains(key) to check whether the key is present in the list and then add it to the dictionary.

Cheers...
 
Share this answer
 
Comments
Toli Cuturicu 26-Aug-10 8:50am    
Reason for my vote of 3
very inefficient
Hello!Why don't you use HashTable Which has Key and Value Properties ?
Using HashTable Will solve your problem.


Regards ,
Hemanth Kumar VLN
 
Share this answer
 
Comments
Christian Graus 27-Aug-10 5:03am    
This question is two years old ( not sure how that is possible ) ? Why are you adding to a question that's been well and truly answered?
This is a method to remove the duplicates and then sort the result in the neatest way possible :)

C#
public string[] FilterAndSort(string[] array)
{
    List<string> retList = new List<string>();
    foreach (string item in array)
    {
        if (!retList.Contains(item))
        {
            retList.Add(item);
        }
    }
    retList.Sort();
    return retList.ToArray();
}
 
Share this answer
 
Comments
Dimitri Witkowski 9-Jun-10 13:36pm    
It's performance will be O(n*n), this is the worst way to achieve the goal
Toli Cuturicu 26-Aug-10 8:49am    
Reason for my vote of 3
poor performance
If it were me, I wouldn't use an array. Instead, I'd use a List. Arrays can't grow/shrink to contain their content (without some manual manipulation).

However, since your spec is currently for an array, here's an extension method that you can use to add unique strings to your array. It assumes that your array is already allocated to the appropriate size, but attempts a sanity check before adding the item.

C#
public static class ExtensionMethods
{
    public static bool AddUnique(this array[] collection, string text, bool caseSensitive)
    {
        bool added = false;
        var count = (from item in collection 
                     where (caseSensitive) ? item == text : item.ToLower() == text.ToLower() 
                     select item).Count();
        if (count == 0 && collection.GetUpperBound() > collection.Length)
        {
            collection[collection.Length] = text;
            added = true;
        }
        return added;
    }
}
 
Share this answer
 
v3
Comments
Nish Nishant 27-Jan-11 9:52am    
Wow, what's with everyone answering a 2 year old thread?
#realJSOP 27-Jan-11 12:22pm    
I didn't notice the date, It was at the top of the list so I looked at it. I think I suggested last year that questions be locked against answering/editing after they got to be a certain age so this kind of thing wouldn't happen.
Nish Nishant 27-Jan-11 12:34pm    
Yeah it wasn't you, someone else replied to it bringing it to the top!
thatraja 27-Jan-11 21:19pm    
I remember this movie & dialogue
"Program Alice Activated" - Resident Evil: Apocalypse (2004)
LOL :):):)
'Collection' is the answer for your question.

You can use any collection object with Key, Value pair like HashSet, Dictionary, etc. On finding every word, you insert that word as Key. On fetching the duplicate words, the existing key will be replaced with the newly read same word. Thus, we can avoid the duplicates automatically.

Beycnd this, you can count the number of occurences along with this operation.
 
Share this answer
 
Your solution is already close to the right one. You just don't have to clone the array, just edit it in-place.

C#
int i, j;

// Remove duplicates from the sorted array, by shifting the elements
for (i= 1, j= 1; i < array.Length; i++)
{
    if (array[i] != array[i - 1])
    {
        // Different element, keep it
        array[j]= array[i];
        j++;
    }
}


In the end, j contains the number of valid entries in array.

Simple and effective. You can resize the array but this involves a copy.

[I am not questioning the approach. Creating the array, sorting then purging will do the trick. This will take time proportional to N.Log(N).L for N words of L characters (on average) and require storage of N.L characters.

Using a hash table as some suggested will tend to reduce the time to N.L and the storage to N'.L, where N' denotes the number of distinct words. But there will be some overhead on time and space.

Telling the best approach would take some comparative experimentation.]
 
Share this answer
 
v3
pls check this .. am not tested

string[] testing(string[] StringArray)
{
List<string> StringList = new List<string>();
foreach (string str in StringArray)
{
if (!StringList.Contains(str))
{
StringList.Add(str);
}
}
return StringList.ToString();
}
 
Share this answer
 
Comments
raju melveetilpurayil 25-Jul-10 12:24pm    
Reason for my vote of 1
function return type is string[], but code return string
Toli Cuturicu 26-Aug-10 8:51am    
Reason for my vote of 1
Does not even complile. What about making some sense?
If all your words are in an array you can directly get distinct words.
string[] distinctArray = myarray.Select(word => word.Trim()).Distinct().ToArray
 
Share this answer
 
Comments
Rajesh Anuhya 21-Oct-10 4:14am    
this is question posted in year 2008 .., why are u answering now..,

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900