Click here to Skip to main content
Click here to Skip to main content

SQL Wizardry Part Five - when not to use DISTINCT

, 6 Jan 2014 CPOL
Rate this:
Please Sign up or sign in to vote.
Just a quick explanation of why it's not always a good idea to use DISTINCT

 Introduction

I want to take a step back from the sorts of things I’ve been covering to talk about a common issue. A lot of the things I decide to write about, come out of questions I see in the forums. One thing I commonly see people doing, is throwing in the ‘DISTINCT’ keyword to try to solve their issues. Using DISTINCT For those who don’t know, here’s how it works. Run this SQL to create a pretty simple table:
create table testDistinct
( 
  id int
)

insert into testdistinct values (1), (2), (1), (3), (4), (1), (3), (2), (5)
Now run this:
select * from testDistinct
and then run this:
select distinct id from testdistinct
As you can see, it removes duplicate values. If I had a lot of columns, ALL the columns would need to be identical, in order for the row to be culled by DISTINCT. So, often people find they are getting too many results, and use DISTINCT to cull them down.

A more complex sample

Now run this SQL to create some test data. The data supposes that we are running a mailing server for auto repairers. We have tables for a repairer, a client, a client’s car, and also for mail settings for each repairer. Here is the SQL:
create table autoRepair
(
 id int identity,
 name varchar(50)
 )
 
 
 create table client
 (
   id int identity,
   repairerId int,
   name varchar(50) 
 )

 create table car
 (
   id int identity,
   clientId int,
   type varchar(50)  
 )
 
 create table repairMailSettings
 (
   repairId int,
   replyAddress varchar(100)
 )
 
 insert into autoRepair values('CO-OP Toyota'), ('Motors Holden')
 
 insert into client values (1, 'Fred Smith'), (1, 'Joe Kaputnuk'), (2, 'Bill Brown')
  
 insert into car values (1, 'Fred''s car'), (1, 'Fred''s other car'), (2, 'The Joe Mobile'),(3, 'Bill''s Brown Bomb')
 
 insert into repairMailSettings values (1, 'info@co-op.com'), (2, 'help@motors.com'), (1, 'info@co-op.com')
Now run this to get a list of all car owners, and their cars and repairer:
select ar.name as RepairerName, c.Name as ClientName, cr.Type as CarName from autoRepair ar 
 inner join client c on c.repairerId = ar.id
 inner join car cr on cr.clientId = c.id
This returns:
RepairerName	ClientName	CarName
CO-OP Toyota	Fred Smith	Fred's car
CO-OP Toyota	Fred Smith	Fred's other car
CO-OP Toyota	Joe Kaputnuk	The Joe Mobile
Motors Holden	Bill Brown	Bill's Brown Bomb 
Of course, if you wanted to send an email out to all clients, you’d need the mail settings too, so let’s include those:
select ar.name as RepairerName, c.Name as ClientName, cr.Type as CarName, ms.replyAddress from autoRepair ar 
 inner join client c on c.repairerId = ar.id
 inner join car cr on cr.clientId = c.id
 inner join repairMailSettings ms on ms.repairId = ar.id
This returns the following:
RepairerName	ClientName	CarName	replyAddress
CO-OP Toyota	Fred Smith	Fred's car	info@co-op.com
CO-OP Toyota	Fred Smith	Fred's car	info@co-op.com
CO-OP Toyota	Fred Smith	Fred's other car	info@co-op.com
CO-OP Toyota	Fred Smith	Fred's other car	info@co-op.com
CO-OP Toyota	Joe Kaputnuk	The Joe Mobile	info@co-op.com
CO-OP Toyota	Joe Kaputnuk	The Joe Mobile	info@co-op.com
Motors Holden	Bill Brown	Bill's Brown Bomb	help@motors.com
All the CO-OP records have been duplicated !!! So, the easy fix is to add DISTINCT:
select distinct ar.name as RepairerName, c.Name as ClientName, cr.Type as CarName, ms.replyAddress from autoRepair ar 
 inner join client c on c.repairerId = ar.id
 inner join car cr on cr.clientId = c.id
 inner join repairMailSettings ms on ms.repairId = ar.id
And we get:
RepairerName	ClientName	CarName	replyAddress
CO-OP Toyota	Fred Smith	Fred's car	info@co-op.com
CO-OP Toyota	Fred Smith	Fred's other car	info@co-op.com
CO-OP Toyota	Joe Kaputnuk	The Joe Mobile	info@co-op.com
Motors Holden	Bill Brown	Bill's Brown Bomb	help@motors.com
Looks great, right ? But have we solved an issue, or just hidden it ?

How joins work

Although tables are sometimes used to form 1 to 1 relationships ( as our repairMailSettings table is, in theory, the idea is to stop one table from becoming huge and full of disparate information ), most of the time, they define one to many relationships. For example, Fred is one person, but he owns two cars. His name appears in the client table once, but when we join clients to cars, because he has two cars, two rows are created, both with the same name, but with different car names.

Know what you’re asking

Sometimes people use DISTINCT because they have failed to ask the right question. For example, someone asked the other day why these two queries gave different results
Select count(a.id) from a inner join b on a.id = b.id
And
Select count(a.id) from a where a.id not in (select id from b)
The reason was, it was perfectly valid for b to have the same id more than once, but this created duplicate rows in the join, and increased the count. A join is always better than a subquery, but, sometimes a join is asking the wrong question.

This is a case where DISTINCT is perfectly valid. We know that the ids in a are unique but appear more than once in b. But we want the count of items in a that also appears in b. This is then how to do that:

Select count(distinct a.id) from a inner join b on a.id = b.id

Understand the result, don’t hide it

So, DISTINCT seemed to solve our car dealer issue, but all it did was fix the one query. A close look ( for those who missed it ) will reveal that there’s a duplicate row in repairMailSettings. This is the issue. Instead of using DISTINCT, we need to add a UNIQUE foreign key to our table, so the same repairerId can’t appear twice in our table. Had we gone with the DISTINCT issue, and had the user changed their settings, it’s possible we’d have ended up having two different sets of settings in our database. Even if they had never done that, every bit of SQL that used this table, would have needed DISTINCT to continue hiding the issue.

Conclusion

So, the moral of the story is, if you find you’re getting duplicate rows, understand why that is, don’t just hide it. DISTINCT is perfectly valid at times, but if you use it, you should first understand why you need it, and be sure there’s no other solution. Sometimes, what you’re doing is hiding an issue that is only going to resurface at a later date.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Christian Graus
Software Developer (Senior)
Australia Australia
Programming computers ( self taught ) since about 1984 when I bought my first Apple ][. Was working on a GUI library to interface Win32 to Python, and writing graphics filters in my spare time, and then building n-tiered apps using asp, atl and asp.net in my job at Dytech. After 4 years there, I've started working from home, at first for Code Project and now for a vet telemedicine company. I owned part of a company that sells client education software in the vet market, but we sold that and I worked for the owners for five years before leaving to get away from the travel, and spend more time with my family. I now work for a company here in Hobart, doing all sorts of Microsoft based stuff in C++ and C#, with a lot of T-SQL in the mix.

Comments and Discussions

 
QuestionFinish your car example PinprofessionalJV99995-Jan-14 22:46 
AnswerRe: Finish your car example PinprotectorChristian Graus6-Jan-14 11:37 
GeneralRe: Finish your car example PinprofessionalJV99996-Jan-14 23:34 
GeneralRe: Finish your car example PinprotectorChristian Graus6-Jan-14 23:47 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.141216.1 | Last Updated 6 Jan 2014
Article Copyright 2014 by Christian Graus
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid