Download SQL_Wizardry_08.zip - 1.2 KB

Introduction

In the SQL world, there is a common acronym. That is RBAR, which stands for 'row by agonising row'. SQL is set based, that means that you operate on sets of data by defining rules to make clear what those sets are. Any time you move away from this, and process data using procedural code, that works one row at a time, you are at great risk of slowing your processes down.

The alternative to this quite often, is what's called a 'tally table'. This is just a table that has a sequence of numbers in it. I've used such a table in previous articles, and today I'm going to dig in to the code that creates it, and talk about some additional uses for such a table, beyond the ones I've already shown you.

Creating the table

Here is the first block of SQL from the sample file:

WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 1*10^2 or 10000 rows
  E8(N) AS (SELECT 1 FROM E4 a, E4 b) -- 1*10^2 or 100000000 rows

  select 'E1', count(1) from e1 union all
  select 'E2', count(1) from e2 union all
  select 'E4', count(1) from e4 union all
  select 'E8', count(1) from e8 
;

It returns the following:

E1	10
E2	100
E4	10000
E8	100000000

The comma in SQL is short hand for a cross join. This is not a join in terms of connecting based on a common value, it simply takes every value in table a, and for each value, returns every value in table b. The effect of this is clearly exponential. And so, we start with a union all block that creates 10 rows. Then we do a cross join, which returns 10 * 10, or, 100 rows. A cross join on this table returns 100 * 100, or 10000 rows. Cross joining this final table returns 100000000 rows, which should be enough in most situations. The final select shows this, by returning the row names and the number of rows in each.

Note that this is SUPER fast. There's all sorts of stuff on the web about ways to create tally tables, please just trust me that this is the most efficient way of doing so.

The second block of SQL looks like this:

WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 1*10^2 or 10000 rows
  E8(N) AS (SELECT 1 FROM E4 a, E4 b), -- 1*10^2 or 100000000 rows
  E(N) as(select row_number() over (order by n) from e8)


  select distinct top 100 e8.n as 'E8', e.n as e from e, e8
;

The trouble with the first block, is that we ended up with a table that had 100000000 1s in it. Now we use row_number() to create a sequence of numbers, which is what we need to make this really useful. The select does select 100 values from e8, but usually, I just cut back on the number of tables created, if I need less than 10000 rows. So, in this case, getting rid of E4, E8 and the top statement would have been more efficient.

Generating a binary sequence

The simplest thing we can do with a tally table, is to use the sequence with some maths to generate a sequence other than 1,2,3,4, etc. Here is a binary sequence.

;WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E(N) as(select row_number() over (order by n) from e2)

  select power(2, n) from e where n < 30
;

Date ranges

Another thing we may need to do from time to time, is generate a date range. If you need a range that is not inclusive of all days, we can mix this example with the example about to create a sequence of numbers that does not increase by 1, or add a where clause to skip values we want to skip, if the rule is not purely numeric. Here's the simple version. Note, you don't NEED to return the day, month and year, I'm simply doing so here for the purpose of illustration

DECLARE @BeginDate DATE = '2011-01-01', @EndDate DATE = '2012-06-30'

;WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 1*10^2 or 10000 rows
  E8(N) AS (SELECT 1 FROM E4 a, E4 b), -- 1*10^2 or 100000000 rows
  E(N) as(select row_number() over (order by n) from e8)

SELECT DATEADD(DD, N-1, @BeginDate) [Date]
,DAY(DATEADD(DD, N-1, @BeginDate)) [Day] 
,MONTH(DATEADD(DD, N-1, @BeginDate)) [Month]
,YEAR(DATEADD(DD, N-1, @BeginDate)) [Year]
FROM E
WHERE N <= DATEDIFF(DD, @BeginDate, @EndDate) + 1
;

Finding missing dates

So, what do we do now that we can generate a date range ? Anyone following these articles will know, I like to use the AdventureWorks database to get data to work against. Please use it now and run this:

DECLARE @DateStart DATETIME = '2005-07-01 00:00:00.000'

;WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 1*100^2 or 10000 rows
  E(N) as(select row_number() over (order by n) from e4)

	select convert(date, dateadd(dd, t.N, @DateStart))
	from Sales.SalesOrderHeader o
	right join E t on dateadd(dd, t.N, @DateStart) = o.OrderDate
	where o.orderdate is null and dateadd(dd, t.N, @DateStart) < getdate()

We could pull the lowest date out in our SQL, but it probably won't change, so it's more efficient to get it and hard code it ( assuming this became a proc that was used regularly ). We do a couple of things here. We use an outer join to get null values to identify where the join failed. We use the tally table to generate dates. Adventureworks does use DateTime just for dates, and I am assuming that time component is always empty, this is easy to fix, but it would clutter the example. The only other thing we do, is make sure we don't count past 'today', because we don't care that we didn't get an order in the future. The end result is the days on which no orders were received

Another thing we can do to make this even more useful, is to cull values that are not relevant to us. This version only returns days we didn't get an order, that are not a Saturday or Sunday.

DECLARE @DateStart DATETIME = '2005-07-01 00:00:00.000'

;WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 1*100^2 or 10000 rows
  E(N) as(select row_number() over (order by n) from e4)

	select convert(date, dateadd(dd, t.N, @DateStart))
	from Sales.SalesOrderHeader o
	right join E t on dateadd(dd, t.N, @DateStart) = o.OrderDate
	where o.orderdate is null 
        and dateadd(dd, t.N, @DateStart) < getdate() 
        and datepart(dw, dateadd(dd, t.N, @DateStart)) not in (1,7)

Find substring positions

Having to deal with a unit of data ( a string ) as a group of items, is always bad news in SQL and is best avoided. But, often we're called to work on existing systems and cannot change them. The following SQL will return the positions of a delimiter in a string. It would be trivial to make the delimiter multichar if we wanted to. Just change the 1 in the substring to len(@delimiter).

DECLARE @val varchar(max) = 'Christian;Donna;Hannah;Calvin;Jenny;Joe', @delimiter char(1) = ';'

;WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 1*100^2 or 10000 rows
  E(N) as(select row_number() over (order by n) from e4)
   
	SELECT n AS [Index]
	FROM E  
	WHERE n <= LEN(@val)  
	AND SUBSTRING(@val, n, 1) = @delimiter
	ORDER BY N
go

Extracting substrings

Now it's simple to write code that returns those substrings, instead of positions. Here's the SS2012 version:

DECLARE @val varchar(max) = 'Christian;Donna;Hannah;Calvin;Jenny;Joe', @delimiter char(1) = ';'

;WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 1*100^2 or 10000 rows
  E(N) as(select row_number() over (order by n) from e4),
  ind(N) as
  ( 
	SELECT n AS [Index]
	FROM E  
	WHERE n <= LEN(@val)  
	AND SUBSTRING(@val, n, 1) = @delimiter
   ),
   words(start, [end]) as
   (
     select lag(n, 1, -1) over (order by n) + 1, n - (lag(n, 1, -1)  over (order by n)) - 1 from ind
   )

   select substring(@val, start, [end]) from words as Names
GO

If you run this, you'll find it does not emit the last string. This is because it works by finding the delimiters. There's two ways around this that I can see. First, always put a delimiter on the end of the string. Second, treat the end of the string like a delimiter in the first place. This SQL takes the second approach, and returns the delimiter length past the last index as the final position ( so we skip back to remove the delimiter, and end up with the full string ). Thanks to Duke Carey for pointing this out )

DECLARE @val varchar(max) = 'Christian;Donna;Hannah;Calvin;Jenny;Joe', @delimiter char(1) = ';'

;WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 1*100^2 or 10000 rows
  E(N) as(select row_number() over (order by n) from e4),
  ind(N) as
  ( 
	SELECT n AS [Index]
	FROM E  
	WHERE n <= LEN(@val) + len(@delimiter) 
	AND (
	SUBSTRING(@val, n, 1) = @delimiter or n = len(@val) + len(@delimiter)
	)
   ),
   words(start, [end]) as
   (
     select lag(n, 1, -1) over (order by n) + 1, n - (lag(n, 1, -1)  over (order by n)) - 1 from ind
   )

   select substring(@val, start, [end]) from words as Names
GO

and here's a version for older SQL Server versions

DECLARE @val varchar(max) = 'Christian;Donna;Hannah;Calvin;Jenny;Joe', @delimiter char(1) = ';'

;WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 1*100^2 or 10000 rows
  E(N) as(select row_number() over (order by n) from e4),
  ind(N, r) as
  ( 
	SELECT n AS [Index], row_number() over (order by n) 
	FROM E  
	WHERE n <= LEN(@val)  
	AND SUBSTRING(@val, n, 1) = @delimiter
   ),
   words(start, [end]) as
   (
	select isnull(ind1.n, -1) + 1, isnull(ind2.n,len(@val) + 1)  from ind ind1 full join ind ind2 on ind1.r = ind2.r-1
   )

   select substring(@val, start, [end] - start) from words as Names

Counting occurences of a substring

Given that we were looking for a substring ( the delimiter ), counting them is easy:

DECLARE @val varchar(max) = 'Christian;Donna;Hannah;Calvin;Jenny;Joe', @find char(2) = ';J'

;WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 1*100^2 or 10000 rows
  E(N) as(select row_number() over (order by n) from e4)

	SELECT COUNT(1) AS [COUNT]
	FROM (
	SELECT N AS POS
	FROM e  
	WHERE N <= LEN(@VAL)  
	AND SUBSTRING(@VAL, N, LEN(@FIND)) = @Find
	) T

In this case, our substring includes the delimiter, and it means that we've found two names that start with J. One weakness in this approach is, what if we wanted to count names starting in C ? The way around that, would by to just add the delimiter to the start, like this:

DECLARE @val varchar(max) = 'Christian;Donna;Hannah;Calvin;Jenny;Joe', @find char(2) = ';C'

;WITH 
  E1(N) AS (
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
           ),                          -- 1*10^1 or 10 rows
  E2(N) AS (SELECT 1 FROM E1 a, E1 b), -- 1*10^2 or 100 rows
  E4(N) AS (SELECT 1 FROM E2 a, E2 b), -- 1*100^2 or 10000 rows
  E(N) as(select row_number() over (order by n) from e4)

	SELECT COUNT(1) AS [COUNT]
	FROM (
	SELECT N AS POS
	FROM e  
	WHERE N <= LEN(@find + @VAL)  
	AND SUBSTRING(@find + @VAL, N, LEN(@FIND)) = @Find
	) T

Conclusion

I've kind of changed my mind what I'd like to have called this series. I think a better name is 'Thinking in SQL', because if you're used to writing procedural code, it can be a challenge to stop thinking that way, and to think in terms of set based logic. A tally table is simply a way of using the database engine to do repetitive tasks the way it does things best, by working in sets. The next time you're tempted to write code that increases a counter one step at a time, either to count how often something is done, or to apply a sequence of numbers to an operation, stop and think if perhaps a tally table is a better way to solve your problem. I have considered creating a table based function to create my tally tables, but, a table based function creates a table in tempdb, which is exactly the thing I am setting out to avoid, by creating them the way that I do. If you have any further insight you'd like to offer on this possibility, I'm all ears, but for now I'm playing it safe and creating them on the fly.