This is a discussion about SQL joins and subqueries with information about how to format join statements properly.
In any non-trivial task, developers need to join tables together. Joins are important when we want to:
- Display/get data from more than 1 table in a result set.
- Filter out records based on a particular restriction which requires using a column from another table.
- Connect to an intermediary [relationship] table that helps connect two primary [entity] tables which have a many to many relationship. For example, if
School are two entity tables, then
EnrollmentRecords might be the relationship table connecting the two.
There are certainly variations and extensions of why joins might be needed, but the above 3 cover most of the circumstances.
There is an alternative to using joins in some scenarios – subqueries. In instances where the goal is to filter on a restriction, a subquery can be used. Consider the following 2 SQL statements [using the
SELECT * FROM AdventureWorks.Person.Address
WHERE StateProvinceID IN
WHERE StateProvinceCode = 'CA'
FROM AdventureWorks.Person.Address addr
INNER JOIN AdventureWorks.Person.StateProvince state
ON addr.StateProvinceID = state.StateProvinceID
WHERE state.StateProvinceCode = 'CA'
These are both functionally equivalent. Which of these is better? In this trivial example, both are fine. However, when query speed becomes an important factor, joins are generally going to outperform subqueries. I'll explore this issue more in my next post.
Also, the queries, as-is, use
select *. This is generally not a good practice – instead, a column list should be provided in the
select statement. However, having the
select * raises one important difference between the 2 queries above – the columns from the joined table will be available in the result set whereas the columns from the table in the subquery will not. This is why I specified the table alias in the 2nd query [for the
select *] to make the results display equivalently.
join, notice there are two conditions – one that the
StateProvinceID must be equal and one that the state code needs to be '
CA'. Why did I put the
StateProvinceID check in the '
on' clause and the state code check in the '
where' clause? I think doing it this way is cleaner. When I join tables, I consider there to be three types of
- Those that are fundamental to the tables [these would normally be identified by foreign key relationships or, lacking those, identically named columns. These should always be in the '
- Those that aren't fundamental to the tables but that are almost always used as filters when joining on a table. For example, on the
student enrollment table at Boston Public Schools, we have a field for whether the record is the most recent one and whether the withdrawal code is set for that record. Almost all queries we use check to ensure the record is the most recent one and that the
student isn't withdrawn. While these two filters are not fundamental
join conditions when joining with
enrollment table, they are so frequent that it simply makes sense and is cleaner to have these in the '
on' clause instead of in the '
- Those that serve as potentially modifiable filters. For example, a check that the state code is
'CA' or that some test score is > 30 should probably be in the
where clause. Theoretically, if this logic is in a stored procedure, there should be an input parameter for the value to check/numerical threshold for filters in this category.
It is also possible to join the tables without any '
on' condition. For example, the following is functionally equivalent to the above two statements:
FROM AdventureWorks.Person.Address addr, AdventureWorks.Person.StateProvince state
WHERE addr.StateProvinceID = state.StateProvinceID
AND state.StateProvinceCode = 'CA'
This is an implicit
join with a comma separating the joined tables and all conditions in the
where clause. The original
join [with an '
on' condition] is an explicit
join. Is the implicit
join syntax recommended? Functionally and speed-wise, these two are identical. However, for queries with a number of tables being joined, the implicit syntax can become unmaintainable and difficult to debug/extend. I always prefer the explicit