Introduction
The ADO.Net objects provided with VS.Net allow you to retrieve data from a database or other sources and stores this data in objects called DataSets or DataTables. In fact, the DataSet is a collection of DataTables. One commonly used method available with DataTables is Select(). It allows a developer to perform simple filtering against the DataTable and returns the result as a collection of DataRows. Presently, there is no way to perform a Distinct Query against the DataTable. I just found out that there is a way to perform a distinct query in .Net. The DataView object exposes a method called ToTable(). This method expects Boolean flag to determine distinct or not and the fields you wish to distinct on. While this method actually does return distinct rows from a DataTable, it appears that it should only be used for small recordsets. When used on a DataTable with 40,000 rows, it took about 18 seconds. I would imagine that it is pretty quick with a hundred or so. But, if you wish to distinct a large table, there is no better way. Until now.
Background
I had recently encountered a problem with a server being too slow to process user requests for the large amount of data that was returned. I needed to retrieve the data and completely manipulate the data in memory to reduce the number of trips made to SQL Server. One major obstacle was the fact that I needed to apply a Distinct query against my DataTable. I quickly found out that it wasn't possible. There was an article from Microsoft that demonstrated a simple Distinct, but it wasn't robust enough to serve my needs. It only allowed for single column to be specified and I needed an unknown number of columns at different times. I worked and reworked the Microsoft example until I was finally able to handle multiple columns, filtering/conditions and sorting.
Using the code
Within the object there is only one major method called SelectDistinct(). It is overloaded about ten times to allow flexibility in coding. To use my object, simply instantiate the object and call the SelectDistinct method.
I have included a demo application that will allow you to specify a Sql Connection string and a query or stored procedure, so that you can test the performance and accuracy of the SelectDistinct Object.
dsh = New clsDSH(dt)
Dim dt2 as DataTable = dsh.SelectDistinct(Fields,Filter,Sort)
dsh = New clsDSH()
Dim dt2 as DataTable = dsh.SelectDistinct(dt,Fields,Filter,Sort)
The code examples assume that "dt" has been defined and filled already. Take note that the "dsh" object in both cases will actually retain a copy of the datatable and can be accessed via the .Table property.
After completing the object, I discovered that it could be made to run faster. I originally did the whole thing with two DataTables, taking only the unique rows and placing them into the new table and returning that table from the object. To my dismay, that was slower than anyone could have imagined. To make a very long story short, I ended up hashing each datarow and using that hash to compare and produce a distinct table. Unfortunately, the DataRow.GetHashCode() method wasn't distinct enough to be utilized. Luckily, I found that if I carefully converted each DataRow into a string and then executed String.GetHashCode() it worked. And it worked fast. Currently, the object can take 40,000 records and turn them into around 2000 distinct records in a little over a second! Maybe it could be faster, maybe not. **UPDATE** - Based on some of the code provided to me by srkinyon, I rewrote the object to incorporate some of his ideas and reorganized and stripped out anything that was no longer necessary. That includes the previously mentioned hashing. Now the object is capable of performing a distinct on a DataTable with over 40,000 rows in about 0.6 second. That's ~3 times faster than it was before. Now that's fast. I have to admit though that I had a slight redundancy in my code as well. Although, the majority of the speed increase seems to have come from the fact that I am now storing the datacolumn values as the objects they are, "object" instead of converting them to a string and then hashing that. I also have to admit that hashing isn't reliable enough to be used for distinct purposes. So, kudos to srkinyon. If anyone else can make it even faster or better, don't hesitate.
I also added a couple of extra properties for the heck of it. One is RecordCount, which should be self explanatory. Also, ElapsedTime which stores the total time the last Distinct call took. A slight waste of time, but I wanted to know how fast the object is.
Points of Interest
I found out during my development cycle that even the smallest thing can slow you down tremendously. Even a .Trim() or a .Split(). So use them wisely.
Also, I wrote the same object in C# which is also available on this site, so you can have it both ways:) Incidentally, there was no real difference in speed between the two languages.
History
None
| You must Sign In to use this message board. |
|
|
 |
|
 |
Have you tested your methods performance compare to LINQ(LINQ for collections, not LINQ for SQL)? Do you think, that these methods will be useful, after LINQ will be released?
Michael Freidgeim. Blog: http://geekswithblogs.net/mnf/
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
 | Index?  dherrmann | 3:23 18 Oct '07 |
|
 |
Hallo, I have to use the new Datatable 'dt2' as a table in a dataset for a datagridviewcomboboxcolumn. That means, that the table should have a new keycolumn. How can I do that?
Greetings from Austria Dietrich
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
 | *.mdb  -tusk- | 13:31 16 Oct '07 |
|
 |
Me again..
Short time after you posted your initail release (btw it would be a goo idea to track history of your version on this page I "adapted" the code to run with *.mdb files.
I do not use sql server, rather access files.
I actually worked.. because I just adapted the basic needs (connections strings.. oledbobjects instead of sqlobjects)
I will dl the new version and try it again.. could be usefull for the futur.. but if for some reason you think you can do a good copy of it specialized for mdb files... it would be nice.
I'm happy you changed your hashing thing. because I didnt like it (lol). I'll now dl & check the new version.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Thanks for the code. This contribution actually came up right when I needed to make a similar function for my application. I used this function and the "properly referened" MS KB article as a basis for my own rontine, which ended up being 5 times faster for string comparisons and up to 10 times faster on non-strings (especially Guids).
My recordset was a table of 22,500 records. To see the speed gain, you'll need a table of similar size.
To use it, add the 2 private variable to the top and then replace the functions of the same name from the orginal project. If you are "vacen2", feel free to update the project, just please remember to reference me if you do.
-----------------------------------------------------------------
Private coLastValues() As Object = {} Private cdclFields() As DataColumn
Private Function RowsToTable(ByVal drs As DataRow()) As DataTable 'converts dr() to datatable
Dim dt As New DataTable If _Fields Is Nothing Then _Fields = GetFields(drs(0))
For Each field As String In _Fields dt.Columns.Add(field.Trim, drs(0).Table.Columns(field.Trim).DataType) Next
For Each dr As DataRow In drs dt.ImportRow(dr) Next
Return dt
End Function
Private Function GetFields(ByVal dr As DataRow) As String() ' creates comma delineated string of column names
Dim strFields() As String = Nothing Dim a As Integer = 0
ReDim strFields(dr.Table.Columns.Count - 1)
For Each col As DataColumn In dr.Table.Columns strFields(a) = col.ColumnName a += 1 Next
Return strFields
End Function
Private Function IsExists(ByVal dr As DataRow) As Boolean ' converts dr to string values for hashing
Dim oValue As Object Dim iIndex As Integer Dim bExists As Boolean
If coLastValues.Length = 0 Then ReDim coLastValues(cdclFields.Length - 1) bExists = False Else bExists = True End If
For Each dCol As DataColumn In cdclFields oValue = dr.Item(dCol) If bExists Then If TypeOf oValue Is String Then If Not Trim(oValue).ToLower.Equals(Trim(coLastValues(iIndex)).ToLower) Then bExists = False End If Else If Not oValue.Equals(coLastValues(iIndex)) Then bExists = False End If End If End If coLastValues(iIndex) = oValue iIndex = iIndex + 1 Next
Return bExists
End Function Public Function SelectDistinct(ByVal SourceTable As DataTable, ByVal Fields() As String, ByVal Filter As String, ByVal Sort As String) As DataTable
Dim dt As New DataTable Dim t As Long = Now.Ticks ' timing Dim iIndex As Integer
' Trim fields For iIdx As Integer = 0 To Fields.Length - 1 Fields(iIdx) = Fields(iIdx).Trim Next
_Fields = Fields _FieldsList = Join(Fields, ",")
ReDim cdclFields(Fields.Length - 1) coLastValues = New Object() {}
'select distinct For Each field As String In Fields cdclFields(iIndex) = SourceTable.Columns(field) If IsNothing(cdclFields(iIndex)) Then Return Nothing dt.Columns.Add(field.Trim, SourceTable.Columns(field.Trim).DataType) iIndex += 1 Next
For Each dr As DataRow In SourceTable.Select(Filter, _FieldsList) 'SourceTable.Rows If Not IsExists(dr) Then dt.ImportRow(dr) Next If dt.Columns.Count = 0 And dt.Rows.Count = 0 Then Return Nothing
If Sort <> "" Then Dim sSortFields As String
sSortFields = Sort.Replace(" ", "").Replace("desc", "").Replace("asc", "").ToLower If sSortFields <> _FieldsList.ToLower Then dt = RowsToTable(dt.Select("", Sort)) ' sorting End If End If
_ElapsedTime = Format(New TimeSpan(Now.Ticks - t).TotalSeconds, "#0.00") _RecordCount = dt.Rows.Count _SourceTable = dt Return _SourceTable
End Function
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Thanks for the feedback. I did incorporate some of your code into the new object and it runs circles around the old version. Unfortunately, I couldn't just copy and paste, as your code had some unnecessary code just as mine did. So, I rewrote mine, incorporating your ideas. And, I gave credit at the top to you for the ideas as well as in the article, which has already been updated.
Once again thanks.
Corey
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
In the final version, the one I put in my app this morning, I added beginloaddata() & endloaddata() to the datatables when doing the copying. I also made it a shared function of the class rather than needing to start a object instance for the one call.
It could also sped up again if you manually create & copy each row & column across to the new table (i.e. loop around each column & row in the RowsToTable() rather than using dt.ImportRow()). Add that & it might drop it .01 seconds on larger return sets. I didn't get that far with it since it got to a speed that served by purpose: take a couple hundred rows of de-normalized records and find about 1-5 foriegn keys out of the bunch.
I also took a look at similar MS Kb's out there and there are parallel Group By and Join articles. If someone was so inclinded, you could put them all together and make LINQ-style set of functions for .Net 1.0 to 3.0, rather than needing 3.5 on all your client machines...
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
 | Bubble  jabryanwalmart | 8:06 4 Oct '07 |
|
 |
I am sure you code is elangant and well intnentioned but you can get a distinct by using the the ".toTable" method of a Dataview. regards
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
I appreciate the feedback. I was unaware of the ToTable method available through the DataView.
Although I must say, after testing the method I was unimpressed. The result took almost 18 seconds to return data. Whereas, by comparison, my object took about 2 seconds.
So far, so good. I guess there could be something I am missing regarding this dataview method. But it seems pretty simple.
This is what I executed...Dim dt As DataTable = dv.ToTable(True, _Fields)
This line took almost 18 seconds to return about 2500 records.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
If you're using .Net 2.0+ that is true. Not all projects are new or able to jump to the lastest version for a varity of reasons.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
 | thanks  peter gabris | 7:52 4 Oct '07 |
|
|
 |
|
|