Click here to Skip to main content
15,867,453 members
Articles / Database Development / SQL Server

Remove duplicate records from SQL tables

Rate me:
Please Sign up or sign in to vote.
3.12/5 (11 votes)
13 May 2007CPOL3 min read 97.9K   503   34   16
A generic procedure to retrieve and delete duplicate records from MS SQL Server tables.

Introduction

A generic stored procedure to retrieve and delete duplicate records from a table, specifying a custom list of fields identifying each record.

Background

When I did optimizations on a customer's production database, I found problems creating proper primary keys because of duplicate records found in tables. Doing a little web search, I found plenty of scripts to remove duplicate records. All existing scripts had two major problems:

  1. They are not generic and must be written for every table
  2. They are all assuming that duplicate records are identical in all fields and not only primary keys

In my situation, I had to process 150 tables, and I didn't want to write a script for every table. Also, in some tables, duplicate records had the same fields designated for the Primary Key, but other fields were different! So Select Distinct, used in all scripts, was still returning these records twice.

Therefore, I had to write my own script to solve the problem.

Using the code

There are two Stored Procedures:

SQL
GetDuplicates @TableName, @PK
DeleteDuplicates @TableName, @PK

Both procedures accept two string parameters: table name and a comma delimited list of primary key fields. Primary key fields are fields which uniquely identify records, they are not necessarily an actual table primary key.

Example to call:

SQL
--Get all duplicate recordes from table 
--        OrderDetails having unique CustomerID, OrderID and ItemID
EXEC GetDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'

Both Stored Procedures are using dynamic SQL to be generic. They are also calling the function GetTableFieldsList to get a string with all the table fields delimited by comma. The function allows to set a padding function to all the fields by adding prefix and suffix parameters. Every field in the returned list will be prefixed by the prefix and suffixed by the suffix parameter. The use of @ within the parameters allows getting the current field name.

Example to call:

SQL
SELECT dbo.GetTableFieldsList('Customers', 'MAX(', ') AS @')

will return MAX(f1) AS f1, MAX(f2) AS f2, ... where f1, f2, ... are fields of the Customers table.

SQL
CREATE     Function GetTableFieldsList(
 @TableName AS VARCHAR(255),
 @FieldPrefix AS VARCHAR(255) = '',
 @FieldSuffix AS VARCHAR(255) = '')
RETURNS VARCHAR(5000)
AS

BEGIN
declare @fields_list as varchar(5000)
set @fields_list = ''
declare names_curr cursor read_only forward_only
for 
select syscolumns.name from syscolumns inner join 
       sysobjects on syscolumns.id = sysobjects.id 
where sysobjects.name=@TableName
order by syscolumns.colid
declare @tmp_field as varchar(255)
declare @tmp_prefix as varchar(255)
declare @tmp_suffix as varchar(255)
open names_curr
fetch next from names_curr into @tmp_field
while (@@FETCH_STATUS = 0)
begin
 set @tmp_prefix = Replace(@FieldPrefix, '@', @tmp_field)
 set @tmp_suffix = Replace(@FieldSuffix, '@', @tmp_field)
 set @tmp_field = @tmp_prefix + @tmp_field + @tmp_suffix
 set @fields_list = CASE WHEN @fields_list='' THEN 
     @tmp_field ELSE @fields_list + ', ' + @tmp_field END
 fetch next from names_curr into @tmp_field
end
close names_curr
deallocate names_curr
RETURN @fields_list
END

The first Stored Procedure, GetDuplicates, will return all duplicate records of the table, ordered by the Primary Key fields list + the DuplicateCount field to indicate the number of duplicate records. This procedure has two string parameters: the table name and the Primary Key fields list. Note that Primary Key fields list must contain the field names only, without the table name.

Example to call:

SQL
EXEC GetDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'

will return all duplicate records of the OrderDetails table, having the same CustomerID, OrderID, and ItemID fields.

SQL
CREATE    PROCEDURE GetDuplicates(@TableName AS VARCHAR(255), @PK AS VARCHAR(4096))
AS
declare @table_pk as varchar(5000)
declare @tmp_pk as varchar(5000)
set @table_pk = replace(@pk, ' ', '') --Remove blanks
set @tmp_pk = @table_pk
--Add table name prefix
set @table_pk = @TableName+'.'+replace(@table_pk, ',', ', '+@TableName+'.')
--Add table name prefix
set @tmp_pk = '#dup_table_tmp.'+replace(@tmp_pk, ',', ', #dup_table_tmp.')
declare @sql as varchar(5000)
set @sql = 'select ' + @PK
set @sql = @sql + ', count(*) AS DuplicateCount '
set @sql = @sql + 'INTO #dup_table_tmp from ' + @TableName
set @sql = @sql + ' group by ' + @PK
set @sql = @sql + ' having count(*) > 1; '
set @sql = @sql + 'select '
set @sql = @sql + dbo.GetTableFieldsList(@TableName, @TableName+'.', ' as @')
set @sql = @sql + ', #dup_table_tmp.DuplicateCount'
set @sql = @sql + ' from ' + @TableName + ' inner join #dup_table_tmp'
set @sql = @sql + ' on Checksum(' + @table_pk + ') = Checksum(' + @tmp_pk + ')'
set @sql = @sql + ' order by ' + @table_pk
--print @sql
exec (@sql)

The next Stored Procedure, DeleteDuplicates, will delete all duplicate records of the table, leaving only one record for each Primary Key fields combination. Assuming that duplicate records might have different fields other than the Primary Key, we will use the MAX aggregate function to select records with maximal values in all fields.

Example to call:

SQL
EXEC DeleteDuplicates 'OrderDetails', 'CustomerID, OrderID, ItemID'

will delete all duplicate records of the OrderDetails table, having the same CustomerID, OrderID, and ItemID fields, leaving only one record with the same CustomerID, OrderID, and ItemID.

SQL
CREATE PROCEDURE DeleteDuplicates(@TableName AS VARCHAR(255), @PK AS VARCHAR(4096))
AS
declare @sql as varchar(5000)
set @sql = 'begin transaction; '
set @sql = @sql + 'if exists (select ' + @PK + ' from ' + @TableName
set @sql = @sql + ' group by ' + @PK
set @sql = @sql + ' having count(*) > 1)'
set @sql = @sql + ' begin '
set @sql = @sql + 'select '
set @sql = @sql + dbo.GetTableFieldsList(@TableName, 'max(', ') as @')
set @sql = @sql + ' INTO #dup_table_tmp from ' + @TableName
set @sql = @sql + ' group by ' + @PK + '; '
set @sql = @sql + 'delete from ' + @TableName + '; '
set @sql = @sql + 'insert into ' + @TableName
set @sql = @sql + ' select * from #dup_table_tmp;'
set @sql = @sql + ' end '
set @sql = @sql + 'commit;'
--print @sql
exec (@sql)

During the work, this procedure copies the filtered contents of the table to a temporary table and then selects all the contents back into the original table.

The attached zip file includes both the Stored Procedures and the function creation script. There is also a sample script to create a table, fill in some test data, and execute the procedures.

History

  • May 13, 2007 - The first version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Chief Technology Officer Patrol-IT
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionUsage of replace in GetTableFieldsList function Pin
sirajcodeproject30-May-13 4:19
sirajcodeproject30-May-13 4:19 
AnswerRe: Usage of replace in GetTableFieldsList function Pin
Eugene Ochakovsky30-May-13 4:23
Eugene Ochakovsky30-May-13 4:23 
Generalmerge row values Pin
PeterSmith4-Dec-09 9:37
PeterSmith4-Dec-09 9:37 
QuestionWhat about NULL values? Pin
Petter Halseth30-Jun-09 22:26
Petter Halseth30-Jun-09 22:26 
QuestionDelete both duplicates [modified] Pin
plimon18-Dec-08 11:34
plimon18-Dec-08 11:34 
QuestionStored Procedure Pin
Pradeep Tiwari25-May-08 20:45
Pradeep Tiwari25-May-08 20:45 
Questionwhat happens with bit colums?? Pin
danielr207815-Nov-07 10:25
danielr207815-Nov-07 10:25 
AnswerRe: what happens with bit colums?? Pin
Eugene Ochakovsky17-Nov-07 21:00
Eugene Ochakovsky17-Nov-07 21:00 
QuestionThis is great Pin
SreevidyaK5-Jul-07 0:09
SreevidyaK5-Jul-07 0:09 
AnswerRe: This is great Pin
SreevidyaK5-Jul-07 0:42
SreevidyaK5-Jul-07 0:42 
AnswerRead this instead of previous post Re: This is great Pin
SreevidyaK5-Jul-07 0:55
SreevidyaK5-Jul-07 0:55 
GeneralRe: Read this instead of previous post Re: This is great Pin
Eugene Ochakovsky5-Jul-07 1:05
Eugene Ochakovsky5-Jul-07 1:05 
QuestionProduction environment? Pin
Håkan Nilsson (k)30-May-07 4:03
Håkan Nilsson (k)30-May-07 4:03 
AnswerRe: Production environment? Pin
Eugene Ochakovsky30-May-07 5:18
Eugene Ochakovsky30-May-07 5:18 
GeneralUnique Column Pin
Hamed J.I14-May-07 4:23
Hamed J.I14-May-07 4:23 
GeneralRe: Unique Column Pin
Eugene Ochakovsky14-May-07 7:49
Eugene Ochakovsky14-May-07 7:49 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.