Click here to Skip to main content
15,877,103 members
Please Sign up or sign in to vote.
3.00/5 (2 votes)
See more:
I got a .csv file containing 9 columns that I need to sort out. I need to check for duplicates in one column and if I find any I need to compare the duplicates and depending on which row is the oldest decide which one to delete, one of the columns contains datetime data.

I'm not really sure on how to proceed even with the most straight forward method of iterating through all rows and compare them all to each other. Performance wise this shouldn't be a problem for me.

Most of the problems comes from dealing with the file io. But I'm considering reading the file and for each line store the entire row in one collection, then just the time column in one collection and lastly the column which I will do the duplicate check on in one. As long as they are indexed the same way it should be simple enough to do the time comparison and then remove the correct row and save back to file.

I just feel as if this solution is highly inefficient and that there should be some better way of doing it.

One simple thing that I think I should be able to do is making sure that the file is sorted by the column where I'll do the duplicate check because then all I've got to check is the next value if it's equal or not to the previous before moving on to the next but this leaves me with having to sort the list a new at the end instead.

Any tips on how a good way to dealing with this is appreciated. I'm quite new to VB6 and both fileio and string manipulation is something I'm not very good at. My level is at saving appending some sort of tag and then just save line to file and when reading back the file read the tag and usually do very simple operations.

Since this was a workaround to another problem on Chill60's suggestion I include the original problem too.

The data I'm trying to format is contained in three different tables and is a total of 9 fields. What I need to do is to check for duplicates in one column and if I find any select the newest entry. This turned in to a nightmare because of my current SQL skills which is at the same level as my VB6 skills. Meaning barely 2 months of splotchy experience.

SQL
SELECT Stamps.Stampnr as Stampnr, Stamps.Time as 'Time', Stamps.amount as 'Amount', Products.Productname as 'Productname', Products.Articelnumber as 'Articlenumber', FlagContainer.id as 'FlagId', FlagContainer.FlagId as 'Flag', Process.prnr as 'CurrentPrNr', Process.numProcesses as 'ProcessNumbers' 
FROM     Stamps INNER JOIN 
process ON Stamps.prnr = Process.prnr INNER JOIN 
Products ON Process.productnr = Products.productnr INNER JOIN 
Flagcontainer ON Stamps.ID = Flagcontainer.id 
WHERE  (Stamps.Time > '" & dtmYesterday & "' + ' 06:00:00') 
and (Stamps.Time < '" & dtmNow & "' + ' 06:00:00') 
and Flagcontainer.flagid = 5 order by FlagId


It might look a bit weird with the tables called flagid but I translated from Swedish and tried to make it as readable as possible.

The column I'm looking for duplicates in is the one that's called Flagcontainer.id and then selecting the oldest using Stamps.Time.
Posted
Updated 1-Jun-15 3:27am
v2
Comments
CHill60 1-Jun-15 5:07am    
Is there some constraint in your workplace meaning that you "have" to use VB6? This would be quite a simple task in VB.NET and the Express version is free.
Member 11683251 1-Jun-15 6:06am    
Sadly yes. Currently the entire system is VB6 and I've got the task of maintaining as well as adding new features. We will most likely move away from it in the future but not this year at least.
_Asif_ 1-Jun-15 6:26am    
Is this csv file activity is one time or is it part of some business use case? and what Database server are you using?
Member 11683251 1-Jun-15 6:36am    
The csv is created by the program and then mailed to certain users. The data come from a SQL 2012 server.

1 solution

If you absolutely must use VB6 (see my comment above - oh and I've just seen your response!) then this method should work for you ... my apologies but I can't test any of this as I no longer have VB6.

Read the entire CSV file into a RecordSet. This link[^] should help you with that. Incidentally that site (which is nothing to do with me) is quite handy for finding code snippets in VB6 (at least until it disappears). I'm reproducing the code here in case the link breaks in the future...
VB
Dim connCSV As New ADODB.Connection
Dim rsTest As New ADODB.Recordset
Dim adcomm As New ADODB.Command
Dim path As String

path = "C:\Testdir\"  'Here Test dir is the Directory where
' the text file is located. don't write the file name here.

'This is connection for a text file without Header

 'connCSV.Open "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" _
 & path & ";Extended Properties='text;HDR=NO;FMT=Delimited'"


'This is connection for a text file with Header (i.e., columns
 
connCSV.Open "Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=" _
& path & ";Extensions=asc,csv,tab,txt;HDR=NO;Persist Security Info=False"
    
    
   rsTest.Open "Select * From test.txt", _
       connCSV, adOpenStatic, adLockReadOnly, adCmdText
Do While Not rsTest.EOF
MsgBox rsTest(0)   'You can select the required data
rsTest.movenext
Loop

'IF YOU WANT TO TEST THIS,
'SAVE THE FOLLOWINT TO C:\TESTDIR\TEST.TXT
'AND RUN THE ABOVE CODE WITH THE TWO DIFFERENT
'CONNECTION OPEN STATEMENTS

'Name,Address,City,State,Zip
'John , Doe, NY, NY, 910
Note it also shows how you can iterate through the data.

A recordset can be sorted - but there may be issues depending on the content of the data - if you have problems then have a look at the suggestions on this link[^]. You can either loop through the dataset applying your selection criteria, try to do something clever with Distinct or try the dictionary approach advocated here[^] (This is where my lack of being able to test anything is making this a bit vague - sorry).

I would suggest have a 2nd dataset (same schema) to copy the records you want into which can then be saved or whatever when you are complete e.g. save to CSV[^]

Having said all that, if you have access to a database (e.g. MS Access) then it might be worth saving the data into that and using database functions to manipulate the information.

Have a crack at it then come back if you hit any issues

[EDIT - alternative - a suggested method for de-duplicating on the database side]
If you put your current query into a CTE (Common Table Expression) you can do the de-duplication as a subsequent query against that CTE (see Common Table Expressions(CTE) in SQL SERVER 2008[^] for a more detailed explanation)

For example:

-- Assume these are passed in
DECLARE @dtmYesterday DATETIME
DECLARE @dtmNow DATETIME
DECLARE @flagid int

-- test data
SET @dtmYesterday = dateadd(dd, datediff(dd, 0, getdate()) - 1, 0)
SET @dtmNow = dateadd(dd, datediff(dd, 0, getdate()), 0)

-- Date's are passed is as midnight (based on what I saw)
-- so set to 06:00 hours
SET @dtmYesterday = dateadd(hh, 6, @dtmYesterday)
SET @dtmNow = dateadd(hh, 6, @dtmNow)
SET @flagid = 5

;WITH CTE AS
( 
	SELECT 
		Stamps.Stampnr as Stampnr, Stamps.Time as 'Time', Stamps.amount as 'Amount', 
		Products.Productname as 'Productname', Products.Articelnumber as 'Articlenumber', 
		FlagContainer.id as 'FlagId', FlagContainer.FlagId as 'Flag', 
		Process.prnr as 'CurrentPrNr', Process.numProcesses as 'ProcessNumbers' 
	FROM     
		Stamps 
		INNER JOIN process ON Stamps.prnr = Process.prnr 
		INNER JOIN Products ON Process.productnr = Products.productnr 
		INNER JOIN Flagcontainer ON Stamps.ID = Flagcontainer.id 
	WHERE  
		Stamps.Time > @dtmYesterday
		and Stamps.Time < @dtmNow 
		and Flagcontainer.flagid = @flagid 
)
select CTE.* from CTE
inner join (SELECT FlagId, Productname, MIN([Time]) as mintime FROM CTE GROUP BY FlagId, ProductName) A
	ON CTE.FlagId=A.FlagId AND CTE.Productname = A.Productname
	AND CTE.[Time] = A.mintime
ORDER BY FlagId, Productname
This might need tweaking for your purposes as I used both FlagContainer.id and Products.Productname to drive query. I assumed that if there were more than one id + Productname pairing then all the rest of the data had to come from the oldest entry - just remove Productname from the second query if you don't need it, or add in other columns if you need them.

Also note the way I've used local (sql) variables - ideally you would put this into a Stored Procedure that accepts those arguments.
 
Share this answer
 
v2
Comments
Member 11683251 1-Jun-15 6:39am    
Looks interesting, wasn't aware that you could use recordset for other things than SQL queries. Some other things has come up but will go through this once I've got the time.
Member 11683251 1-Jun-15 7:45am    
Just pointing it to my file and it opens fine, works like a charm. I'm just having some trouble getting it delimited to get to each column, as it is now each row contains a string including the semi colons. Tried adding FMT=Delimited to this row

connCSV.Open "Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=" _
& path & ";Extensions=asc,csv,tab,txt;HDR=NO;Persist Security Info=False"

and it still runs but cant get the values to separate.
CHill60 1-Jun-15 8:13am    
Ah - if you want to use a delimiter other than the "standard" then you will need a schema.ini file in the same folder as the input file which defines the separator - this link[^] has an example (I was almost actually considering digging out my VB6 install so I can trial this ... that link saved me!)

I notice you said the data is coming from SQL Server - do you have the option of removing the duplicates as the data is extracted? (Or is the file also used for other purposes?)
Member 11683251 1-Jun-15 8:33am    
Yea, just figured out that I needed a schema.ini to do that, found several links that discussed it but none that stated what should go in to it at first. Now I got my different fields. :)

I could try to do that but getting the query to work would be too much work I figured. As new to sql as I am to vb6. The data comes from 4 joined tables with 9 different fields and every time I tried to check for duplicates I just got a bunch of aggregate errors. And that would be only to find the duplicates, then to include selecting between two or more depending on the values in one of the selected columns is a bit out of my league right now.

I thought I could cheat by doing it in the program instead but as this has turned out I'm not sure if I've saved any time. At least I've learned a bit more about the up and coming language called VB6.
CHill60 1-Jun-15 9:02am    
:-D Probably the best way to address the query would be to use a CTE to get the data as it stands, and then push that into a de-duplication exercise. If you use the Improve question link to post the sql query I'll have a go at it if you like (although will be off line briefly while I uninstall the Windows Update that is messing with me)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900