Click here to Skip to main content
Licence CPOL
First Posted 4 Oct 2007
Views 29,193
Downloads 88
Bookmarked 32 times

DataTable with SelectDistinct in VB

By | 11 Oct 2007 | Article
Allows you to perform a Distinct query against a DataTable.
 
Part of The SQL Zone sponsored by
See Also

Screenshot - Demo_VB.gif

Introduction

The ADO.NET objects provided with VS.NET allow you to retrieve data from a database or other sources, and stores this data in objects called DataSets or DataTables. In fact, the DataSet is a collection of DataTables. One commonly used method available with DataTables is Select(). It allows a developer to perform simple filtering against a DataTable and returns the result as a collection of DataRows. Presently, there is no way to perform a Distinct Query against a DataTable. I just found out that there is a way to perform a Distinct query in .NET. The DataView object exposes a method called ToTable(). This method expects a boolean flag to determine distinct or not, and the fields you wish to distinct on. While this method actually does return distinct rows from a DataTable, it appears that it should only be used for small recordsets. When used on a DataTable with 40,000 rows, it took about 18 seconds. I would imagine that it is pretty quick with a hundred or so. But, if you wish to distinct a large table, there is no better way. Until now.

Background

I had recently encountered a problem with a server being too slow to process user requests for the large amount of data that was returned. I needed to retrieve data and completely manipulate the data in memory to reduce the number of trips made to SQL Server. One major obstacle was the fact that I needed to apply a Distinct query against my DataTable. I quickly found out that it wasn't possible. There was an article from Microsoft that demonstrated a simple Distinct, but it wasn't robust enough to serve my needs. It only allowed for a single column to be specified, and I needed an unknown number of columns at different times. I worked and reworked the Microsoft example until I was finally able to handle multiple columns, filtering/conditions, and sorting.

Using the Code

Within the object, there is only one major method called SelectDistinct(). It is overloaded about ten times to allow flexibility in coding. To use my object, simply instantiate the object and call the SelectDistinct method.

I have included a demo application that will allow you to specify a SQL connection string and a query or Stored Procedure, so that you can test the performance and accuracy of the SelectDistinct object.

dsh = New clsDSH(dt)
Dim dt2 as DataTable = dsh.SelectDistinct(Fields,Filter,Sort)
'Or
dsh = New clsDSH()
Dim dt2 as DataTable = dsh.SelectDistinct(dt,Fields,Filter,Sort)

The code examples assume that "dt" has been defined and filled already. Take note that the "dsh" object, in both cases, will actually retain a copy of the DataTable, and can be accessed via the .Table property.

After completing the object, I discovered that it could be made to run faster. I originally did the whole thing with two DataTables, taking only the unique rows and placing them into the new table and returning that table from the object. To my dismay, that was slower than anyone could have imagined. To make a very long story short, I ended up hashing each DataRow and using that hash to compare and produce a distinct table. Unfortunately, the DataRow.GetHashCode() method wasn't distinct enough to be utilized. Luckily, I found that if I carefully converted each DataRow into a string and then executed String.GetHashCode(), it worked. And it worked fast. Currently, the object can take 40,000 records and turn them into around 2000 distinct records in a little over a second! Maybe it could be faster, maybe not. **Update** - Based on some of the code provided to me by srkinyon, I rewrote the object to incorporate some of his ideas, and reorganized and stripped out anything that was no longer necessary. That includes the previously mentioned hashing. Now the object is capable of performing a Distinct query on a DataTable with over 40,000 rows in about 0.6 second. That's ~3 times faster than it was before. Now, that's fast. I have to admit though that I had a slight redundancy in my code as well. Although, the majority of the speed increase seems to have come from the fact that I am now storing the data column values as the objects they are; "object" instead of converting them to a string, and then hashing that. I also have to admit that hashing isn't reliable enough to be used for distinct purposes. So, kudos to srkinyon. If anyone else can make it even faster or better, don't hesitate.

I also added a couple of extra properties for the heck of it. One is RecordCount, which should be self explanatory. Also, ElapsedTime which stores the total time the last Distinct call took. A slight waste of time, but I wanted to know how fast the object is.

Points of Interest

I found out during my development cycle that even the smallest things can slow you down tremendously. Even a .Trim() or a .Split(). So use them wisely.

Also, I wrote the same object in C# which is also available on this site, so you can have it both ways :) Incidentally, there was no real difference in speed between the two languages.

History

None.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

vacen2

Web Developer

United States United States

Member



Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralCompare with LINQ PinmemberMichael Freidgeim12:00 22 Oct '07  
QuestionIndex? Pinmemberdherrmann2:23 18 Oct '07  
General*.mdb Pinmember-tusk-12:31 16 Oct '07  
GeneralThanks & An improved version... Pinmembersrkinyon18:40 10 Oct '07  
Thanks for the code.   This contribution actually came up right when I needed to make a similar function for my application.   I used this function and the "properly referened" MS KB article as a basis for my own rontine, which ended up being 5 times faster for string comparisons and up to 10 times faster on non-strings (especially Guids).
 
My recordset was a table of 22,500 records.   To see the speed gain, you'll need a table of similar size.  
 
To use it, add the 2 private variable to the top and then replace the functions of the same name from the orginal project.   If you are "vacen2", feel free to update the project, just please remember to reference me if you do.
 
-----------------------------------------------------------------
 
      Private coLastValues() As Object = {}
      Private cdclFields() As DataColumn
 
      Private Function RowsToTable(ByVal drs As DataRow()) As DataTable 'converts dr() to datatable
 
            Dim dt As New DataTable

            If _Fields Is Nothing Then _Fields = GetFields(drs(0))
 
            For Each field As String In _Fields
                  dt.Columns.Add(field.Trim, drs(0).Table.Columns(field.Trim).DataType)
            Next
 
            For Each dr As DataRow In drs
                  dt.ImportRow(dr)
            Next
 
            Return dt
 
      End Function
 
      Private Function GetFields(ByVal dr As DataRow) As String() ' creates comma delineated string of column names
 
            Dim strFields() As String = Nothing
            Dim a As Integer = 0
 
            ReDim strFields(dr.Table.Columns.Count - 1)
 
            For Each col As DataColumn In dr.Table.Columns
                  strFields(a) = col.ColumnName
                  a += 1
            Next
 
            Return strFields
 
      End Function
 
      Private Function IsExists(ByVal dr As DataRow) As Boolean ' converts dr to string values for hashing
 
            Dim oValue As Object
            Dim iIndex As Integer
            Dim bExists As Boolean
 
            If coLastValues.Length = 0 Then
                  ReDim coLastValues(cdclFields.Length - 1)
                  bExists = False
            Else
                  bExists = True
            End If
 
            For Each dCol As DataColumn In cdclFields
                  oValue = dr.Item(dCol)
                  If bExists Then
                        If TypeOf oValue Is String Then
                              If Not Trim(oValue).ToLower.Equals(Trim(coLastValues(iIndex)).ToLower) Then
                                    bExists = False
                              End If
                        Else
                              If Not oValue.Equals(coLastValues(iIndex)) Then
                                    bExists = False
                              End If
                        End If
                  End If
                  coLastValues(iIndex) = oValue
                  iIndex = iIndex + 1
            Next
 
            Return bExists
 
      End Function

      Public Function SelectDistinct(ByVal SourceTable As DataTable, ByVal Fields() As String, ByVal Filter As String, ByVal Sort As String) As DataTable
 
            Dim dt As New DataTable
            Dim t As Long = Now.Ticks ' timing
            Dim iIndex As Integer
 
            ' Trim fields
            For iIdx As Integer = 0 To Fields.Length - 1
                  Fields(iIdx) = Fields(iIdx).Trim
            Next
 
            _Fields = Fields
            _FieldsList = Join(Fields, ",")
 
            ReDim cdclFields(Fields.Length - 1)
            coLastValues = New Object() {}
 
            'select distinct
            For Each field As String In Fields
                  cdclFields(iIndex) = SourceTable.Columns(field)
                  If IsNothing(cdclFields(iIndex)) Then Return Nothing
                  dt.Columns.Add(field.Trim, SourceTable.Columns(field.Trim).DataType)
                  iIndex += 1
            Next
 
            For Each dr As DataRow In SourceTable.Select(Filter, _FieldsList)   'SourceTable.Rows
                  If Not IsExists(dr) Then dt.ImportRow(dr)
            Next
            If dt.Columns.Count = 0 And dt.Rows.Count = 0 Then Return Nothing
 
            If Sort <> "" Then
                  Dim sSortFields As String
 
                  sSortFields = Sort.Replace(" ", "").Replace("desc", "").Replace("asc", "").ToLower
                  If sSortFields <> _FieldsList.ToLower Then
                        dt = RowsToTable(dt.Select("", Sort)) ' sorting
                  End If
            End If
 
            _ElapsedTime = Format(New TimeSpan(Now.Ticks - t).TotalSeconds, "#0.00")
            _RecordCount = dt.Rows.Count
            _SourceTable = dt
            Return _SourceTable
 
      End Function

GeneralRe: Thanks & An improved version... Pinmembervacen28:49 11 Oct '07  
GeneralRe: Thanks & An improved version... Pinmembersrkinyon9:45 11 Oct '07  
GeneralThanks for sharing Pinmember-tusk-16:27 7 Oct '07  
GeneralBubble Pinmemberjabryanwalmart7:06 4 Oct '07  
GeneralRe: Bubble Pinmembervacen28:06 4 Oct '07  
GeneralRe: Bubble Pinmembersrkinyon18:46 10 Oct '07  
Generalthanks Pinmemberpeter gabris6:52 4 Oct '07  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web04 | 2.5.120529.1 | Last Updated 11 Oct 2007
Article Copyright 2007 by vacen2
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid