65.9K
CodeProject is changing. Read more.
Home

Dictionary Sorting and Inverse Duplication Removal

starIconstarIconstarIconstarIconstarIcon

5.00/5 (1 vote)

Sep 10, 2012

CPOL

1 min read

viewsIcon

12425

A few fun Dictionary utiltities.

Introduction

I have been using Dictionary data containers since VB6. As neat as they are, there are some limits and my latest stabs revealed them in a big ugly way.

The biggest? Changing the dictionary. Do that and the internal keys are no longer enumeral. Toss in the lack of a decent sort and all sorts of sad things happen.

Let us say that you have this set of data:

AAA - VVV

BBB - XXX

CCC - WWW

VVV - AAA

We want to remove that VVV - AAA set, because it is simply an inverse of the first set. The option of doing this within the values of a dictionary removes many-to-many relationships, whereas the key value of the dictionary would throw an error.

Background

My project is based on the main idea of deduplication. There are many sub tables, with things like phone number and email address, that are very easy to match up. This way, I can get hard matches without a lot of work. The problem? Sorting them and INVERSE MATCHES!!!

Using the code

This code is simple to use. There are three separate routines. One sorts, the other two look for duplicates; the first checks for inverse duplicates in the values field, the other in the key/values fields. 

The first one is a basic sort of a dictionary, with a string as the key:

Public Function SortDictionaryKeyString(Unsorted As Dictionary(Of String, String)) As Dictionary(Of String, String)

    Dim Working As List(Of String)
    Dim KeyPair As KeyValuePair(Of String, String)
    Dim KeyValue As String

    SortDictionaryKeyString = New Dictionary(Of String, String)

    Working = New List(Of String)

    For Each KeyPair In Unsorted
        KeyValue = KeyPair.Key.ToString
        Working.Add(KeyValue)
    Next

    Working.Sort()

    For Each Item As String In Working
        If Unsorted.ContainsKey(Item) Then
            SortDictionaryKeyString.Add(Item, Unsorted.Item(Item).ToString)
        End If
    Next

End Function

The next one is real clever - you have many to many relationships, so you have to use an index, but the data gets populated into the value with a colon separator. This allows manipulation of the string value to find inverse duplicates.

Public Function DeDupeDictionaryValues(ByVal Dupe As Dictionary(Of String, String)) As Dictionary(Of String, String)

    Dim KeyPair As KeyValuePair(Of String, String)
    Dim sValue As String
    Dim sTemp As String
    Dim iIdx As Int64
    Dim sSplit(2) As String

    DeDupeDictionaryValues = New Dictionary(Of String, String)

    For Each KeyPair In Dupe
        sValue = KeyPair.Value
        sSplit = Split(sValue, ":")
        sTemp = sSplit(1) & ":" & sSplit(0)
        If Not DeDupeDictionaryValues.ContainsValue(sTemp) Then
            iIdx = iIdx + 1
            DeDupeDictionaryValues.Add(iIdx, sValue)
        End If
    Next

End Function

The last one removes inverse duplications with a string, string dictionary:

Public Function DeDupeDictionary(ByVal Dupe As Dictionary(Of String, String)) As Dictionary(Of String, String)

    Dim Working As Dictionary(Of String, String)
    Dim KeyPair As KeyValuePair(Of String, String)
    Dim sValue As String
    Dim sTemp As String
    Dim sTemp2 As String

    DeDupeDictionary = New Dictionary(Of String, String)

    Working = New Dictionary(Of String, String)

    For Each KeyPair In Dupe
        sTemp = KeyPair.Key
        sTemp2 = KeyPair.Key
        sValue = KeyPair.Value
        If Not DeDupeDictionary.TryGetValue(sValue, sTemp) Then
            DeDupeDictionary.Add(sTemp2, sValue)
        End If
    Next

End Function