Introduction
The Common Table Expression that was introduced in SQL Server 2005 has long been something of an issue for me. I could reproduce the syntax to get what I wanted, but I couldn't honestly say I understood what was going on there. I always need a mental picture to understand something, and that was just missing.
Recently, I think I figured out a model that helps me work with them and explain how, although different, they're actually analogous to recursion in normal programming.
Background
Recursion
Let's start with recursion as a general programming concept. Consider this piece of code:
public string Rec(int Level) {
if (Level == 0)
return "0";
else
return Level.ToString() + "_" + Rec(Level - 1);
}
This code repeatedly calls itself until Level
reaches 0. So this call:
Rec(3);
gives the following result:
3_2_1_0
And this is roughly what happens during execution:
RecursionExample(3) =
"3_" + RecursionExample(2) =
"3_2_" + RecursionExample(1) =
"3_2_1_" + RecursionExample(0) =
"3_2_1_0"
Each time the function gets called, it receives the Level
from the previous call, turns it into a string that is returned, and passes the Level
to the next call after subtracting 1.
Recursion in SQL
Now to SQL. Here's an example of a recursive CTE:
with cte as (
select 1 id, null ParentId
union all
select C.Id, C.ParentId
from SomeTable as C
inner join cte as P
on C.ParentId = P.Id
)
In its definition, there is a join to itself, making it a recursive CTE. This is the same principle as in the C# example where the function called itself, but with a few twists.
First, it doesn't really join against itself, it joins against the previous pass of itself. Just like the C# example called itself a couple of times and each time getting a different result ("3_", "3_2_" etc.), there are stages in the execution of the CTE. Each time it calls itself, a new stage is created.
Second, SQL is a set-based language. So instead of passing an int to the next call and receiving a string, each pass in a CTE takes a set, and returns a set (of the same columns). Here we're joining to the previous pass, and we're union-ing the result of the current pass to the result of the previous pass. Each pass in the CTE results in a set of rows, and the next call to the CTE takes that set as its input.
But only the rows in the current pass are used in the next pass, like the int that is decremented in the C# code and passed to the next call. The union with all previous passes is only after that set is returned, like the string that was added to in C#. It's important to understand the separation between those two.
Last, it kind of works the other way around from C#. In the C# example, each time the function called itself, it passed what it needed to work, Level - 1, to it. In SQL, we do not actually pass in a variable, but we work against the previous pass of the CTE. It kind of looks back.
Let's clarify that with a simple example like the one we did in C#. Take this table that represents a small parent child tree:
When we would apply the above CTE to it, these would be the steps in the execution, analogous to what we did with C#:
1, null
2, 1
3, 1
4, 2
The first pass is where we selected (1, null) the first line in the CTE.
Then comes the union that makes a new call to the CTE. In this first pass, we only had one row, so that's what the second pass 'sees' to join against. This is the key to understanding recursive CTEs: each pass in the CTE only 'sees' the results of the previous pass.
So the second pass joins against that single row and returns (2, 1) and (3, 1) because they have Parent ID = 1.
Now another CTE calls itself again to join rows from SomeTable against these two rows, and out comes (4, 2).
One more join is made, but no more rows are returned, so the recursion stops and the CTE returns all five records.
Simulating a CTE with old-school SQL
This behaviour can be simulated without using a CTE, that would look like this:
declare @cte table (
id int,
ParentId int,
Pass int
)
declare @Pass int
set @Pass = 0
insert @cte(id, ParentId, Pass)
values (1, 0, @Pass)
while @@ROWCOUNT > 0
begin
set @Pass = @Pass + 1
insert @cte (id, ParentId, Pass)
select C.Id, C.ParentId, @Pass
from @t C
inner join @cte P
on C.ParentId = P.id
and P.Pass = @Pass - 1
end
select * from @cte
WHILE @@ROWCOUNT < 0
keeps repeating that section until no more rows are affected. Each time a new set is added to the resultset, they are kept separate by adding the pass depth to them. That pass depth can be used in the join to identify only the rows that where added in the previous pass.
Simulating old-school with a CTE
The Pass
column from the previous example can be easily added to the CTE to help a user trace the separate steps in the creation of the resultset. The following listing does just that.
with cte as (
select 1 id, 0 ParentId, 0 Pass
union all
select C.Id,
C.ParentId,
P.Pass + 1
from @t as C
inner join cte P on C.ParentId = P.Id
)
select * from cte
We start with a single start row, and give it Pass = 0
. Then we join against it to find the children and add them with Pass = 1
, and so on until no more rows are added.
Using the code
A sample code is included for the user to see it for himself. Just open it up and execute it.
Points of interest
When I understood that each pass only looks back at the previous pass, the CTE was demystified for me and I was a lot more comfortable using them in production. Before that, I was just repeating code that I did not fully grasp. The passes provide me with a mental picture that helps me work with CTEs. I hope it will help you to do the same. CTEs lead to much clearer code, are easier to get right, and are much easier to understand for a future developer on your code (once he has read this article, of course ;-)
I believe they are a very welcome addition to SQL because they let you specify what you want, instead of having to tell SQL how it should go about getting that result. SQL has been one of the most successful declarative languages for a long time, and CTEs let you specify your intent in a much clearer way than was previously possible. Comparing the two samples should prove that point.
History
- May 25, 2010: Version 1.0.
Gert-Jan is a Senior Quantitative Risk Manager at Leaseplan Corporation. In that job he doesn't get to code much he does these little projects to keep his skills up and feed the inner geek.