Understanding “Set based” and “Procedural” approaches in SQL

Al-Farooque Shubho

4.91/5 (60 votes)

Mar 15, 2009

CPOL

12 min read

267265

Exploring Set based and Procedural approaches to write better SQL in SQL Server.

Introduction

OK, you know that you should write SQL using "Set based approaches" rather than "Procedural approaches" while developing data access routines. But, what are these two approaches really? why is Procedural approach better? I searched for the answer on the web, and did not find enough information to satisfy myself (though I found lots of technical explanations that somehow touched this issue, they did not explain it fully). So I tried myself to explore and understand the actual "knowhow" and differences between the two, and to know the reasons why "Set based approaches" are better.

In this article, I would like to share my ideas and findings with you, using some simple observations and explanations. Along with that, I will present some simple techniques to convert "Procedural SQL" into "Set based SQL" that might help you write better SQL :).

Please note that I am writing this article with SQL Server in mind. But the principle is mostly true for any database server implementation.

Using the code

The article may not be published anywhere else, except with the author's permission. The code can be re-used anywhere for learning purposes.

What is "Procedural approach"?

Well, simply speaking, Procedural approach is actually the "programmatic approach" that we are used to working with in our daily programming life. In this approach, we tell the system "what to do" along with "how to do" it. We query the database to obtain a result set and we write the data operational and manipulation logic using loops, conditions, and processing statements to produce the final result. The runtime does whatever we want it to do, however we want it to do.

In TSQL, any User Defined Function (UDF) or Cursor that executes on a result set row by row is a procedural approach. For example, if you are querying your database to obtain a result set and using a Cursor to navigate the result set to do further processing row by row, you are using a procedural approach. Also, if you are using a User Defined Function in your SQL for processing each row in the result set to calculate a scalar output, you are using a procedural approach. Looking for example code? Continuing reading..

What is "Set based approach"?

Set based approach is actually an approach which lets you specify "what to do", but does not let you specify "how to do". That is, you just specify your requirement for a processed result that has to be obtained from a "set of data" (be it a simple table/view, or joins of tables/views), filtered by optional condition(s). Sometimes, the specification of the "sets" you like to retrieve data from may be obtained by using complex joins/subqueries/conditional case statements, but in the end, there is a set of data from which the resultant set has to be obtained. You never have to specify "how" the data retrieval operation has to be implemented internally. You never have to specify how to implement the "joining operation" internally either. Also, you don't have to specify how to apply a filter condition against the rows. The database engine determines the best possible algorithms or processing logic to do these.

For example, following is SQL that is written using "Set based approach":

SELECT ProductName,CategoryName
FROM Product INNER JOIN Category
ON Product.CategoryID = Category.ID 
WHERE Price > 100

In the above SQL, "Product INNER JOIN Caterogy ON Product.CategoryID = Category.CategoryID" is the "set" of data from where ProductName and CategoryName has to be displayed for those products which have a Price > 100. You just specify your requirements and conditions, and the SQL engine does the rest to produce the result.

Why is "Set based approach" better than "Procedural approach"?

Frankly speaking, the internal execution engines of databases are designed and optimized for taking "set based instructions" as input (SQL) and executing these instructions in the best possible way (that varies based on a lot of criteria) to produce an output. That's why "Set based approaches" are almost always the better option.

Let us try to understand this with an example.

We often need to do joins between two or more sets of data (say, two tables) to obtain a desired result set. While joining, we only specify the participating table/view names and join conditions. Now, it's the SQL engine which has to implement the actual "joining operation" based upon the join conditions that we provide.

In the SQL Server execution engine, there are three types of joining techniques. These are:

Merge join: Most optimized joining algorithm. Takes place when both tables are joined on indexed columns which are sorted. In a merge join operation, SQL Server scans once through the sorted inputs and merges the data together.
Nested loop join: Moderate cost joining algorithm. Takes place when one of the tables in the join clause contains a small number of records (inner table) compared to the number of records in the other participating table (outer table). Nested loops scan once through each inner input row and search for the corresponding row in the outer input.
Hash join: Most expensive joining algorithm. Takes place when large, unsorted, non-indexed columns are used in the join condition. The SQL Server engine prepares a hash table for the smaller relation by applying a hash function to the join attribute of each row. Then it scans the larger relation and finds the relevant rows by looking at the hash table.

Now, whenever we specify any join predicate in any SQL in SQL Server, based upon the participating columns, data volume, indexing structure, and the set of values in the participating columns, SQL Server determines and uses the best possible joining algorithm and implements the logic to perform the actual joining operation in memory. Note that we don't have to specify the type of joining algorithm in the SQL. SQL Server does that, and does its best to provide the result as fast as possible.

That's where the database engine is in love with "Set based" approaches.

Another important fact is, whatever SQL written in "Set based approach" is issued in the database, the query optimizer generates an execution plan first, and then the execution engine executes the plan to retrieve data from the physical storage and processes the output in an efficient manner. That is, there is a single execution plan tree for each single SQL statement, be it simple or complex. Executing that single execution plan tree is generally a faster operation.

But, when we specify our own way of processing a result set (that is obtained by executing a SQL) using another SQL that works on a row-by-row manner in the resultset (perhaps using a UDF), the database engine has to execute an execution plan for each and every row, even after obtaining the result set by executing an execution plan. Imagine a row-by-row operation that is executed for a result set containing 1 million rows. In this case, the initial data retrieval operation would require an execution plan to be executed, and later, 1 million times another execution plan has to be executed for processing each row. That's what happens when a User Defined Function (UDF) is executed for each row in a result set. An additional overhead of using UDFs is the amount of stack I/O that takes place for invoking the UDF.

On the other hand, if you use a Cursor to process a result set row-by-row, while executing, the Cursor locks the rows in the corresponding table, and unlocks the rows when processing is done. This involves lots of resource usage on the server, and in the case of large result sets, severely slows down performance.

This experiment shows that using a UDF for row-by-row operations involving up to 1000 rows may provide a performance that is within an acceptable range. But, as the number of rows are increased, using UDF would result in a dramatically slow performance. The same goes for Cursors.

That is why "Set based SQL" always outperforms "Procedural SQL", specially if the result set to process becomes large.

OK, understood. Now tell me how to get rid of "Procedural approaches"?

It's not hard. Just follow the following simple tricks and re-factor some of your SQL.

Use inline sub queries to replace User Defined Functions

Let's assume, for a self-referential table Employee(ID, Name,MgrID), there is a query written in "Procedural approach" using a User Defined Function. The query outputs employee names and corresponding manager names.

Here is the query:

SELECT Name AS [Employee Name],dbo.fnGetManagerName(MgrID) as [Manager Name] FROM Employee

Here, dbo.fnGetManagerName(MgrID) is a UDF that returns the manager's name (which is nothing but another employee in the same Employee table) as follows:

ALTER FUNCTION [dbo].[fnGetManagerName](@ID int) RETURNS VARCHAR(50) AS
BEGIN
          --Declare the variable to hold result 
          DECLARE @ManagerName varchar(50)
          --Determine the Employee name by the given ID 
          SELECT @ManagerName = Name FROM Employee WHERE ID = @ID
          --Return the result
          RETURN @ManagerName
END

The above "Procedural SQL" could be re-written using a sub query in "Set based approach" as follows:

SELECT E.Name AS [Employee Name],
(
    SELECT Name FROM Employee WHERE ID = E.MgrID
) AS [Manager Name] 
FROM Employee E

In one of the projects I worked on, we had a slow performing Stored Procedure in a moderate sized SQL Server 2000 database. The SP used to process around 20,000 records to produce a result set. All we needed to optimize it was replace a UDF with an inline sub query (because all other optimizations were done already). Believe me, that turned down the total execution time from 90 long seconds to just 1 second!

Please note that the above query could also be written using a self join, that would equally give you a faster result.

Try to rewrite your Cursor based TSQLs with correlated subqueries

You can rewrite your Cursor based code with a correlated subquery.

Generally, you use a Cursor to obtain a result set, and then process each row one by one to formulate the desired result. This kind of processing could be replaced with a correlated subquery (in most cases).

What is a correlated subquery?

A correlated subquery is a subquery where the outer query has to be executed first, and then for each row in the outer query, the inner query is executed. That means, before executing the inner query for a particular row, the outer query has to be processed for that particular row (and hence, the inner query is correlated with the outer one).

Take a look at the following query again (you've seen this already). This is a correlated subquery:

SELECT E.Name AS [Employee Name],
(
    SELECT Name FROM Employee WHERE ID = E.MgrID
) AS [Manager Name] 
FROM Employee E

Let's break this query. The outer query is:

SELECT E.Name as [Employee Name] FROM Employee E

And, the inner query is:

SELECT Name FROM Employee WHERE ID = E.MgrID

Note the WHERE clause in the inner query (ID = E.MgrID). In order to execute this inner query, the query processing engine needs E.MgrID to be available already. So, to determine each row in the overall result set, the query is processed in the following way:

Obtain the "Name" column value (as Employee Name) and "MgrID" column value from the Employee table.
Obtain the "Name" column value (as Manager Name) from the Employee table where ID = MgrID.

As you can see, to determine each row in the result set, the SQL execution engine has to execute two different SQLs. But, the execution of this query is far more optimized compared to the UDF and Cursor based query, because in the subquery way, the SQL Server decides the optimized and best way to implement the inner query in its execution plan (like deciding the best algorithm for implementation of a join), and hence the query executes faster (Set based approach).

However, if you already have some complex processing logic implemented using a Cursor that is executed for each row in the result set, and if you think that implementing the same logic using a Set based approach is hard or near to impossible, you can follow this approach:

Instead of a cursor, use a table variable to process the result set

What is a "Table" variable?

In TSQL (since SQL Server 2000), a "Table" variable is a special kind of variable that resembles more or less an actual table. But, the most important thing about a "Table" variable is, it resides in memory almost 100% of the time (unless the Table variable itself is too large; in this case, the Table variable could reside in the tempdb database).

Use of a Table variable is efficient (most of the time, in terms of memory and execution time) compared to temporary tables because of the following reasons:

Temporary tables reside in the tempdb database, and operating on temporary tables results in inter-DB communication. This is bound to be slow. But, Table variables are mostly "in memory" variables, so I/O in Table variables is bound to be fast.
Operating on temporary tables result in lots of disk activities and resource usage because:

The temporary table has to be created
Data has to be inserted on the temporary table
Often, temporary table has to be joined with a physical table to obtain a result
A lock has to be established on the temporary table while updating data on it
Temporary table has to be dropped

But, operating on table variables requires no locking on the resources. Moreover, data insertion on a table variable is a lot faster than on a temporary table as no disk I/O and inter DB communication takes place. Also, the table variable goes out of scope when the corresponding SQL block goes out of scope. Therefore, table variables need not be dropped. All these make table variables an excellent choice for implementing faster TSQL.

Well, now it's obvious that Table variables are better than the temporary tables in most cases. But, can you use Table variables in place of Cursors?

Yes, you can. Following is an example of processing a result set using a Table variable (the SQL that uses the Cursor to process the result set is not included here, because I don't like you to learn Cursors.. ha ha)

--Declare the Table variable 
DECLARE @Elements TABLE
(
    Number INT IDENTITY(1,1), --Auto incrementing Identity column
    ProductName VARCHAR(300) --The string value
)

--Decalre a variable to remember the position of the current delimiter
DECLARE @N INT 

--Decalre a variable to remember the number of rows in the table
DECLARE @Count INT

--Populate the TABLE variable using some logic
INSERT INTO @Elements SELECT Name FROM dbo.Products

--Initialize the looper variable
SET @N = 1

--Determine the number of rows in the Table
SELECT @Count=max(Number) from @Elements

--A variable to hold the currently selected value from the table
DECLARE @CurrentValue varchar(300);

--Loop through until all row processing is done
WHILE @N <= @Count

BEGIN
    --Load current value from the Table
    SELECT @CurrentValue = ProductName FROM @Elements WHERE Number = @N
    --Process the current value
    print @CurrentValue
    --Increment loop counter
    SET @N = @N + 1;
END

I can bet you will be surprised to see the performance benefits by replacing the Cursor based code that you might have written with a Table variable based code.

Please note that you still should try not to write TSQL using the "Procedural approach" (use of a Table variable is still a Procedural approach). But, if for some reason you really need to write your own way of processing a result set, you can at least use Table variables to avoid Cursors.

Have fun writing "Set based" SQL. Enjoy!

History

Initial version: 15 March 2009.
Modifications:

16 March 2009
17 March 2009
23 March 2009