Find duplicate entries in file and perform correct action

Question

3.00/5 (2 votes)

See more:

I got a .csv file containing 9 columns that I need to sort out. I need to check for duplicates in one column and if I find any I need to compare the duplicates and depending on which row is the oldest decide which one to delete, one of the columns contains datetime data.

I'm not really sure on how to proceed even with the most straight forward method of iterating through all rows and compare them all to each other. Performance wise this shouldn't be a problem for me.

Most of the problems comes from dealing with the file io. But I'm considering reading the file and for each line store the entire row in one collection, then just the time column in one collection and lastly the column which I will do the duplicate check on in one. As long as they are indexed the same way it should be simple enough to do the time comparison and then remove the correct row and save back to file.

I just feel as if this solution is highly inefficient and that there should be some better way of doing it.

One simple thing that I think I should be able to do is making sure that the file is sorted by the column where I'll do the duplicate check because then all I've got to check is the next value if it's equal or not to the previous before moving on to the next but this leaves me with having to sort the list a new at the end instead.

Any tips on how a good way to dealing with this is appreciated. I'm quite new to VB6 and both fileio and string manipulation is something I'm not very good at. My level is at saving appending some sort of tag and then just save line to file and when reading back the file read the tag and usually do very simple operations.

Since this was a workaround to another problem on Chill60's suggestion I include the original problem too.

The data I'm trying to format is contained in three different tables and is a total of 9 fields. What I need to do is to check for duplicates in one column and if I find any select the newest entry. This turned in to a nightmare because of my current SQL skills which is at the same level as my VB6 skills. Meaning barely 2 months of splotchy experience.

SQL

SELECT Stamps.Stampnr as Stampnr, Stamps.Time as 'Time', Stamps.amount as 'Amount', Products.Productname as 'Productname', Products.Articelnumber as 'Articlenumber', FlagContainer.id as 'FlagId', FlagContainer.FlagId as 'Flag', Process.prnr as 'CurrentPrNr', Process.numProcesses as 'ProcessNumbers' 
FROM     Stamps INNER JOIN 
process ON Stamps.prnr = Process.prnr INNER JOIN 
Products ON Process.productnr = Products.productnr INNER JOIN 
Flagcontainer ON Stamps.ID = Flagcontainer.id 
WHERE  (Stamps.Time > '" & dtmYesterday & "' + ' 06:00:00') 
and (Stamps.Time < '" & dtmNow & "' + ' 06:00:00') 
and Flagcontainer.flagid = 5 order by FlagId

It might look a bit weird with the tables called flagid but I translated from Swedish and tried to make it as readable as possible.

The column I'm looking for duplicates in is the one that's called Flagcontainer.id and then selecting the oldest using Stamps.Time.

Posted 31-May-15 23:00pm

Member 11683251

Updated 1-Jun-15 3:27am

v2

Add a Solution

Comments

CHill60 1-Jun-15 5:07am

Is there some constraint in your workplace meaning that you "have" to use VB6? This would be quite a simple task in VB.NET and the Express version is free.

Member 11683251 1-Jun-15 6:06am

Sadly yes. Currently the entire system is VB6 and I've got the task of maintaining as well as adding new features. We will most likely move away from it in the future but not this year at least.

_Asif_ 1-Jun-15 6:26am

Is this csv file activity is one time or is it part of some business use case? and what Database server are you using?

Member 11683251 1-Jun-15 6:36am

The csv is created by the program and then mailed to certain users. The data come from a SQL 2012 server.

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CHill60 · Accepted Answer · 2015-06-01T00:09:00

If you absolutely must use VB6 (see my comment above - oh and I've just seen your response!) then this method should work for you ... my apologies but I can't test any of this as I no longer have VB6.

Read the entire CSV file into a RecordSet. This link[^] should help you with that. Incidentally that site (which is nothing to do with me) is quite handy for finding code snippets in VB6 (at least until it disappears). I'm reproducing the code here in case the link breaks in the future...

VB

Dim connCSV As New ADODB.Connection
Dim rsTest As New ADODB.Recordset
Dim adcomm As New ADODB.Command
Dim path As String

path = "C:\Testdir\"  'Here Test dir is the Directory where
' the text file is located. don't write the file name here.

'This is connection for a text file without Header

 'connCSV.Open "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" _
 & path & ";Extended Properties='text;HDR=NO;FMT=Delimited'"


'This is connection for a text file with Header (i.e., columns
 
connCSV.Open "Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=" _
& path & ";Extensions=asc,csv,tab,txt;HDR=NO;Persist Security Info=False"
    
    
   rsTest.Open "Select * From test.txt", _
       connCSV, adOpenStatic, adLockReadOnly, adCmdText
Do While Not rsTest.EOF
MsgBox rsTest(0)   'You can select the required data
rsTest.movenext
Loop

'IF YOU WANT TO TEST THIS,
'SAVE THE FOLLOWINT TO C:\TESTDIR\TEST.TXT
'AND RUN THE ABOVE CODE WITH THE TWO DIFFERENT
'CONNECTION OPEN STATEMENTS

'Name,Address,City,State,Zip
'John , Doe, NY, NY, 910

Note it also shows how you can iterate through the data.

A recordset can be sorted - but there may be issues depending on the content of the data - if you have problems then have a look at the suggestions on this link[^]. You can either loop through the dataset applying your selection criteria, try to do something clever with Distinct or try the dictionary approach advocated here[^] (This is where my lack of being able to test anything is making this a bit vague - sorry).

I would suggest have a 2nd dataset (same schema) to copy the records you want into which can then be saved or whatever when you are complete e.g. save to CSV[^]

Having said all that, if you have access to a database (e.g. MS Access) then it might be worth saving the data into that and using database functions to manipulate the information.

Have a crack at it then come back if you hit any issues

[EDIT - alternative - a suggested method for de-duplicating on the database side]
If you put your current query into a CTE (Common Table Expression) you can do the de-duplication as a subsequent query against that CTE (see Common Table Expressions(CTE) in SQL SERVER 2008[^] for a more detailed explanation)

For example:

-- Assume these are passed in
DECLARE @dtmYesterday DATETIME
DECLARE @dtmNow DATETIME
DECLARE @flagid int

-- test data
SET @dtmYesterday = dateadd(dd, datediff(dd, 0, getdate()) - 1, 0)
SET @dtmNow = dateadd(dd, datediff(dd, 0, getdate()), 0)

-- Date's are passed is as midnight (based on what I saw)
-- so set to 06:00 hours
SET @dtmYesterday = dateadd(hh, 6, @dtmYesterday)
SET @dtmNow = dateadd(hh, 6, @dtmNow)
SET @flagid = 5

;WITH CTE AS
( 
	SELECT 
		Stamps.Stampnr as Stampnr, Stamps.Time as 'Time', Stamps.amount as 'Amount', 
		Products.Productname as 'Productname', Products.Articelnumber as 'Articlenumber', 
		FlagContainer.id as 'FlagId', FlagContainer.FlagId as 'Flag', 
		Process.prnr as 'CurrentPrNr', Process.numProcesses as 'ProcessNumbers' 
	FROM     
		Stamps 
		INNER JOIN process ON Stamps.prnr = Process.prnr 
		INNER JOIN Products ON Process.productnr = Products.productnr 
		INNER JOIN Flagcontainer ON Stamps.ID = Flagcontainer.id 
	WHERE  
		Stamps.Time > @dtmYesterday
		and Stamps.Time < @dtmNow 
		and Flagcontainer.flagid = @flagid 
)
select CTE.* from CTE
inner join (SELECT FlagId, Productname, MIN([Time]) as mintime FROM CTE GROUP BY FlagId, ProductName) A
	ON CTE.FlagId=A.FlagId AND CTE.Productname = A.Productname
	AND CTE.[Time] = A.mintime
ORDER BY FlagId, Productname

This might need tweaking for your purposes as I used both FlagContainer.id and Products.Productname to drive query. I assumed that if there were more than one id + Productname pairing then all the rest of the data had to come from the oldest entry - just remove Productname from the second query if you don't need it, or add in other columns if you need them.

Also note the way I've used local (sql) variables - ideally you would put this into a Stored Procedure that accepts those arguments.