Click here to Skip to main content
Click here to Skip to main content

Remove duplicate records from SQL tables

, 13 May 2007
Rate this:
Please Sign up or sign in to vote.
A generic procedure to retrieve and delete duplicate records from MS SQL Server tables.

Introduction

A generic stored procedure to retrieve and delete duplicate records from a table, specifying a custom list of fields identifying each record.

Background

When I did optimizations on a customer's production database, I found problems creating proper primary keys because of duplicate records found in tables. Doing a little web search, I found plenty of scripts to remove duplicate records. All existing scripts had two major problems:

  1. They are not generic and must be written for every table
  2. They are all assuming that duplicate records are identical in all fields and not only primary keys

In my situation, I had to process 150 tables, and I didn't want to write a script for every table. Also, in some tables, duplicate records had the same fields designated for the Primary Key, but other fields were different! So Select Distinct, used in all scripts, was still returning these records twice.

Therefore, I had to write my own script to solve the problem.

Using the code

There are two Stored Procedures:

GetDuplicates @TableName, @PK
DeleteDuplicates @TableName, @PK

Both procedures accept two string parameters: table name and a comma delimited list of primary key fields. Primary key fields are fields which uniquely identify records, they are not necessarily an actual table primary key.

Example to call:

--Get all duplicate recordes from table 
--        OrderDetails having unique CustomerID, OrderID and ItemID
EXEC GetDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'

Both Stored Procedures are using dynamic SQL to be generic. They are also calling the function GetTableFieldsList to get a string with all the table fields delimited by comma. The function allows to set a padding function to all the fields by adding prefix and suffix parameters. Every field in the returned list will be prefixed by the prefix and suffixed by the suffix parameter. The use of @ within the parameters allows getting the current field name.

Example to call:

SELECT dbo.GetTableFieldsList('Customers', 'MAX(', ') AS @')

will return MAX(f1) AS f1, MAX(f2) AS f2, ... where f1, f2, ... are fields of the Customers table.

CREATE     Function GetTableFieldsList(
 @TableName AS VARCHAR(255),
 @FieldPrefix AS VARCHAR(255) = '',
 @FieldSuffix AS VARCHAR(255) = '')
RETURNS VARCHAR(5000)
AS

BEGIN
declare @fields_list as varchar(5000)
set @fields_list = ''
declare names_curr cursor read_only forward_only
for 
select syscolumns.name from syscolumns inner join 
       sysobjects on syscolumns.id = sysobjects.id 
where sysobjects.name=@TableName
order by syscolumns.colid
declare @tmp_field as varchar(255)
declare @tmp_prefix as varchar(255)
declare @tmp_suffix as varchar(255)
open names_curr
fetch next from names_curr into @tmp_field
while (@@FETCH_STATUS = 0)
begin
 set @tmp_prefix = Replace(@FieldPrefix, '@', @tmp_field)
 set @tmp_suffix = Replace(@FieldSuffix, '@', @tmp_field)
 set @tmp_field = @tmp_prefix + @tmp_field + @tmp_suffix
 set @fields_list = CASE WHEN @fields_list='' THEN 
     @tmp_field ELSE @fields_list + ', ' + @tmp_field END
 fetch next from names_curr into @tmp_field
end
close names_curr
deallocate names_curr
RETURN @fields_list
END

The first Stored Procedure, GetDuplicates, will return all duplicate records of the table, ordered by the Primary Key fields list + the DuplicateCount field to indicate the number of duplicate records. This procedure has two string parameters: the table name and the Primary Key fields list. Note that Primary Key fields list must contain the field names only, without the table name.

Example to call:

EXEC GetDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'

will return all duplicate records of the OrderDetails table, having the same CustomerID, OrderID, and ItemID fields.

CREATE    PROCEDURE GetDuplicates(@TableName AS VARCHAR(255), @PK AS VARCHAR(4096))
AS
declare @table_pk as varchar(5000)
declare @tmp_pk as varchar(5000)
set @table_pk = replace(@pk, ' ', '') --Remove blanks
set @tmp_pk = @table_pk
--Add table name prefix
set @table_pk = @TableName+'.'+replace(@table_pk, ',', ', '+@TableName+'.')
--Add table name prefix
set @tmp_pk = '#dup_table_tmp.'+replace(@tmp_pk, ',', ', #dup_table_tmp.')
declare @sql as varchar(5000)
set @sql = 'select ' + @PK
set @sql = @sql + ', count(*) AS DuplicateCount '
set @sql = @sql + 'INTO #dup_table_tmp from ' + @TableName
set @sql = @sql + ' group by ' + @PK
set @sql = @sql + ' having count(*) > 1; '
set @sql = @sql + 'select '
set @sql = @sql + dbo.GetTableFieldsList(@TableName, @TableName+'.', ' as @')
set @sql = @sql + ', #dup_table_tmp.DuplicateCount'
set @sql = @sql + ' from ' + @TableName + ' inner join #dup_table_tmp'
set @sql = @sql + ' on Checksum(' + @table_pk + ') = Checksum(' + @tmp_pk + ')'
set @sql = @sql + ' order by ' + @table_pk
--print @sql
exec (@sql)

The next Stored Procedure, DeleteDuplicates, will delete all duplicate records of the table, leaving only one record for each Primary Key fields combination. Assuming that duplicate records might have different fields other than the Primary Key, we will use the MAX aggregate function to select records with maximal values in all fields.

Example to call:

EXEC DeleteDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'

will delete all duplicate records of the OrderDetails table, having the same CustomerID, OrderID, and ItemID fields, leaving only one record with the same CustomerID, OrderID, and ItemID.

CREATE PROCEDURE DeleteDuplicates(@TableName AS VARCHAR(255), @PK AS VARCHAR(4096))
AS
declare @sql as varchar(5000)
set @sql = 'begin transaction; '
set @sql = @sql + 'if exists (select ' + @PK + ' from ' + @TableName
set @sql = @sql + ' group by ' + @PK
set @sql = @sql + ' having count(*) > 1)'
set @sql = @sql + ' begin '
set @sql = @sql + 'select '
set @sql = @sql + dbo.GetTableFieldsList(@TableName, 'max(', ') as @')
set @sql = @sql + ' INTO #dup_table_tmp from ' + @TableName
set @sql = @sql + ' group by ' + @PK + '; '
set @sql = @sql + 'delete from ' + @TableName + '; '
set @sql = @sql + 'insert into ' + @TableName
set @sql = @sql + ' select * from #dup_table_tmp;'
set @sql = @sql + ' end '
set @sql = @sql + 'commit;'
--print @sql
exec (@sql)

During the work, this procedure copies the filtered contents of the table to a temporary table and then selects all the contents back into the original table.

The attached zip file includes both the Stored Procedures and the function creation script. There is also a sample script to create a table, fill in some test data, and execute the procedures.

History

  • May 13, 2007 - The first version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Eugene Ochakovsky
Architect Visicom
Israel Israel
No Biography provided
Follow on   Twitter

Comments and Discussions

 
QuestionUsage of replace in GetTableFieldsList function Pinmembersirajtcs30-May-13 4:19 
AnswerRe: Usage of replace in GetTableFieldsList function PinmemberEugene Ochakovsky30-May-13 4:23 
Generalmerge row values PinmemberPeterSmith4-Dec-09 9:37 
QuestionWhat about NULL values? PinmemberPetter Halseth30-Jun-09 22:26 
QuestionDelete both duplicates [modified] Pinmemberplimon18-Dec-08 11:34 
QuestionStored Procedure PinmemberPradeep Tiwari25-May-08 20:45 
Questionwhat happens with bit colums?? Pinmemberdanielr207815-Nov-07 10:25 
AnswerRe: what happens with bit colums?? PinmemberEugene Ochakovski17-Nov-07 21:00 
QuestionThis is great PinmemberSreevidyaK5-Jul-07 0:09 
AnswerRe: This is great PinmemberSreevidyaK5-Jul-07 0:42 
AnswerRead this instead of previous post Re: This is great PinmemberSreevidyaK5-Jul-07 0:55 
GeneralRe: Read this instead of previous post Re: This is great PinmemberEugene Ochakovski5-Jul-07 1:05 
QuestionProduction environment? PinmemberHåkan Nilsson30-May-07 4:03 
AnswerRe: Production environment? PinmemberEugene Ochakovski30-May-07 5:18 
GeneralUnique Column PinmemberHamed_ji14-May-07 4:23 
GeneralRe: Unique Column PinmemberEugene Ochakovski14-May-07 7:49 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140721.1 | Last Updated 13 May 2007
Article Copyright 2007 by Eugene Ochakovsky
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid