Click here to Skip to main content
15,879,095 members
Articles / Database Development

Efficiency of Subqueries and OR conditions in SQL

Rate me:
Please Sign up or sign in to vote.
0.00/5 (No votes)
25 May 2011CPOL3 min read 11.9K   3   3
A discussion of how joins significantly outperform subqueries and how this is more evident when OR conditions are involved.

A discussion of how joins significantly outperform subqueries and how this is more evident when OR conditions are involved.

SQL queries can be written in a number of ways and be functionally equivalent but significantly different speed-wise. A great example of this is the use of cross joins instead of a while loop for populating a numbers table. The cross join, being set-based, was far more efficient. The insert statements used later in this post depend on the existence of a table ‘numbers’, which is used in place of a Student table. Here’s the code for this – pulled from the cross joins post [this only takes a few seconds instead of over a minute to run like the while loop - cross joins do have a place in SQL development]:

SQL
CREATE TABLE numbers (num INT)
 
----- Use a table variable 
DECLARE @numbers TABLE
(
      num INT
) 
 
----- Populate a table with numbers 0 to 9
DECLARE @counter INT
SET @counter = 0
 
While @counter < 10
BEGIN
      INSERT INTO @numbers VALUES(@counter)
      SET @counter = @counter + 1
END
 
INSERT INTO numbers (num)
SELECT 
(
      n100000place.num*100000 +
      n10000place.num*10000 +
      n1000place.num*1000 +
      n100place.num*100 + 
      n10place.num*10 + 
      n1place.num
) AS 'number'
FROM @numbers n1place 
CROSS JOIN @numbers n10place
CROSS JOIN @numbers n100place
CROSS JOIN @numbers n1000place
CROSS JOIN @numbers n10000place
CROSS JOIN @numbers n100000place
WHERE n100000place.num < 8 -----  < 800000
ORDER BY NUMBER

Today, I’d like to talk about subqueries and OR conditions in a SQL statement. Before discussing the particular queries, let’s set up a test table with some dummy data. For this, I’m inserting a significant number of rows [4 million] into a table so any performance issues become more evident than they would with a trivial amount of data.

SQL
--Note: These inserts in total take 23 seconds to run 
CREATE TABLE StudentTestData (StudentNumber VARCHAR(8), _
	testid VARCHAR(8), TestScore VARCHAR(3))
 
INSERT INTO StudentTestData 
SELECT numbers.num, 'AAABBBCC', numbers.num % 55 FROM numbers 
 
INSERT INTO StudentTestData 
SELECT numbers.num, 'AAABBBDD', numbers.num % 44 FROM numbers 
 
INSERT INTO StudentTestData 
SELECT numbers.num, 'FFFBBBDD', numbers.num % 33 FROM numbers 
 
INSERT INTO StudentTestData 
SELECT numbers.num, 'FFFRRRDD', numbers.num % 66 FROM numbers 
 
INSERT INTO StudentTestData 
SELECT numbers.num, 'CCCMMMDD', numbers.num % 55 FROM numbers

This table is similar in structure and size to a table we have at Boston Public Schools for storing student test results. Using the StudentTestData table, let’s say we needed to write a query to get a list of students who met the following conditions:

Got over a 10 on test FFFBBBDD AND over a 20 on test FFFRRRDD
  AND
Got over a 16 on test CCCMMMDD AND over a 30 on test AAABBBCC
  AND
Got over a 25 on test AAABBBDD

In this case, all 5 test score requirements must be met.

Here are two queries that return the results, the first based on joins and the second based on subqueries. I’ve used left joins because of a slight modification I’m going to make to the requirements later in the post – for the following query, inner joins would work fine.

SQL
-- Join-based query.  3 seconds the first time, < 1 seconds after that.  
-- SQL Server was able to determine a good execution plan to speed up the query. 
SELECT DISTINCT Numbers.num FROM Numbers
LEFT JOIN StudentTestData TestAAABBBCC 
ON TestAAABBBCC.StudentNumber = Numbers.num AND TestAAABBBCC.testid = 'AAABBBCC'
LEFT JOIN StudentTestData TestFFFBBBDD 
ON TestFFFBBBDD.StudentNumber = Numbers.num AND TestFFFBBBDD.testid = 'FFFBBBDD'
LEFT JOIN StudentTestData TestFFFRRRDD 
ON TestFFFRRRDD.StudentNumber = Numbers.num AND TestFFFRRRDD.testid = 'FFFRRRDD'
LEFT JOIN StudentTestData TestCCCMMMDD 
ON TestCCCMMMDD.StudentNumber = Numbers.num AND TestCCCMMMDD.testid = 'CCCMMMDD'
LEFT JOIN StudentTestData TestAAABBBDD 
ON TestAAABBBDD.StudentNumber = Numbers.num AND TestAAABBBDD.testid = 'AAABBBDD'
WHERE
-- Condition 1
(
	(TestFFFBBBDD.TestScore > 10 AND TestFFFRRRDD.TestScore > 20)
	AND
	(TestCCCMMMDD.TestScore > 16 AND TestAAABBBCC.TestScore > 30)
)
-- Condition 2
AND
(
	TestAAABBBDD.TestScore > 25
)
 
-- Here's a query that returns the same result but uses subqueries instead of joins:
-- 5 seconds the first time, 5 seconds after that. 
-- No apparent improvement can be made from the execution plan.
SELECT DISTINCT Numbers.num FROM Numbers
WHERE 
(
     (
	num IN (SELECT studentNumber FROM StudentTestData _
		WHERE TestScore > 10 AND testid = 'FFFBBBDD')
	AND num IN(SELECT studentNumber FROM StudentTestData _
		WHERE TestScore > 20 AND testid = 'FFFRRRDD')
     )
     AND
     (
	num IN (SELECT studentNumber FROM StudentTestData _
		WHERE TestScore > 16 AND testid = 'CCCMMMDD')
	AND num IN(SELECT studentNumber FROM StudentTestData _
		WHERE TestScore > 30 AND testid = 'AAABBBCC')
     )
)
AND
(
              num IN (SELECT studentNumber FROM StudentTestData _
		WHERE TestScore > 25 AND testid = 'AAABBBDD')
)

As I hinted would be the case in my previous post, the results clearly are in favor of the join-based query. While the speed differential wasn’t too significant for the first run of the query [5 seconds compared to 3 seconds], the differential became much more important for subsequent query runs. SQL Server appears to be much better at saving the execution plan for join-based queries than subquery-based queries.

Let’s say we throw a few OR conditions into the requirements. OR conditions will really illustrate why the join-based version is superior. Here are the new requirements:

Provide a list of student numbers for students who met one of the following two conditions:

Condition 1:
  Got over a 10 on test FFFBBBDD OR over a 20 on test FFFRRRDD
    AND
  Got over a 16 on test CCCMMMDD OR over a 30 on test AAABBBCC
Condition 2:
  Got over a 25 on test AAABBBDD

From a SQL query perspective, the queries are almost identical to the ones above, except that we needed to change AND to OR in a few cases and add parentheses when needed so precedence rules match the requirements. Also, the left joins are used in case some students didn’t take a test [using inner joins would remove students from the result if they didn't take all 5 tests].

SQL
-- Join-based statement.  3 seconds initially, 1.5 seconds afterwards.
SELECT DISTINCT Numbers.num FROM Numbers
LEFT JOIN StudentTestData TestAAABBBCC 
ON TestAAABBBCC.StudentNumber = Numbers.num AND TestAAABBBCC.testid = 'AAABBBCC'
LEFT JOIN StudentTestData TestFFFBBBDD 
ON TestFFFBBBDD.StudentNumber = Numbers.num AND TestFFFBBBDD.testid = 'FFFBBBDD'
LEFT JOIN StudentTestData TestFFFRRRDD 
ON TestFFFRRRDD.StudentNumber = Numbers.num AND TestFFFRRRDD.testid = 'FFFRRRDD'
LEFT JOIN StudentTestData TestCCCMMMDD 
ON TestCCCMMMDD.StudentNumber = Numbers.num AND TestCCCMMMDD.testid = 'CCCMMMDD'
LEFT JOIN StudentTestData TestAAABBBDD 
ON TestAAABBBDD.StudentNumber = Numbers.num AND TestAAABBBDD.testid = 'AAABBBDD'
WHERE
-- Condition 1
(
	(TestFFFBBBDD.TestScore > 10 OR TestFFFRRRDD.TestScore > 20)
	AND
	(TestCCCMMMDD.TestScore > 16 OR TestAAABBBCC.TestScore > 30)
)
-- Condition 2
OR
(
	TestAAABBBDD.TestScore > 25
)
 
-- Stopped after 90 seconds - did not finish.
SELECT DISTINCT Numbers.num FROM Numbers
WHERE 
(   
     (
	num IN (SELECT studentNumber FROM StudentTestData _
		WHERE TestScore > 10 AND testid = 'FFFBBBDD')
	OR num IN(SELECT studentNumber FROM StudentTestData _
		WHERE TestScore > 20 AND testid = 'FFFRRRDD')
      )
      AND
     (
	num IN (SELECT studentNumber FROM StudentTestData _
		WHERE TestScore > 16 AND testid = 'CCCMMMDD')
	OR num IN(SELECT studentNumber FROM StudentTestData _
		WHERE TestScore > 30 AND testid = 'AAABBBCC')
     )
)
OR
(
	num IN (SELECT studentNumber FROM StudentTestData _
		WHERE TestScore > 25 AND testid = 'AAABBBDD')
)

Just by adding the OR conditions to the requirements, the time differential between the join-based query and the subquery version skyrocketed. While the join-based query was slightly slower, the subquery version didn’t finish. One of these nights, I’ll let the subquery version run overnight to see how long it takes.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralIndexes? Pin
Richard Brett30-May-11 18:20
Richard Brett30-May-11 18:20 
GeneralInteresting, but... Pin
oggenok6428-May-11 2:36
oggenok6428-May-11 2:36 
Your conclusions about joins vs. subqueries are valid for MS SQL Server only. I've replicated your example with 4 mio records on an Oracle database and, for the fun of it, a Firebird database. Neither suffered from the degrading performance you experienced with SQL Server.

I've seen this argument over and over again: "Use joins instead of subqueries because they scale better". In general, it's rubbish. The truth is that MS has a pretty sloppy implementation of subqueries.

Let me suggest a change in the title from "Efficiency of Subqueries..." to "Defeciency of Subqueries in MS SQL Server..."


Regards


BTW
The declaration of StudentTestData.TestScore as VARCHAR(3) seems an odd choice. Whats wrong with an INT?
GeneralRe: Interesting, but... Pin
Andrew Zwicker28-May-11 5:24
Andrew Zwicker28-May-11 5:24 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.