Click here to Skip to main content
Email Password   helpLost your password?

Introduction

OK, you know that you should write SQLs using “Set based approaches”, rather than “Procedural approaches” while developing data access routines. But, what these two approaches really are? why Procedural approach is better? I searched for the answer on the web, and, did not find enough information to satisfy myself (Though, I found lots of technical explanations that somehow touch this issue, but, not explain it fully). So, I myself tried to explore and understand the actual "knowhow" and differences between the two, and to know the reasons why “Set based approaches” are better.

In this article, I would like to share my ideas and findings with you, using some simple observations and explanations. Along with, I will present some simple techniques to convert “Procedural SQL’s” into “Set based SQL’s” that might help you writing better SQLs :).

Please be noted that, I am writing this article having the SQL Server in my mind. But, the principle is mostly true for any database server implementation.

Using the code

The article may not be published in any where else, except the author's permission. The codes can be re-used any where for learning purpose.

What is a “procedural approach”?

Well, simply speking, procedural approach is actually the “programmatic approach” that we are used to work with, in our daily programming life. In this approach, we tell the system “what to do” along with telling “how to do” it. We query the database to obtain result set and we write the data operational and manipulation logics using looping, conditional and processing statements to produce the final result. The runtime does whatever we want it to do, however we want it to do.

In TSQL, any User Defined Function (UDF), or Cursor that executes on a result set row by row is a procedural approach. For example, if you are quering your database to obtain a result set and using a Cursor to navigate the result set to do further processings row by row, you are using a procedural approach. Also, if you are using a User Defined Function in your SQL for processing each row in the result set to calculate a scaler output, you are using a procedural approach. Looking for example codes? Continuing reading..

What is a “Set based approach”?

Set based approach is actually an approach which lets you specify “what to do”, but, does not let you specify “how to do”. That is, you just specify your requirement for a processed result that has to be obtained from a “set of data” (Be it a simple table/view, or, joins of tables/views), filtered by optional condition(s). Sometimes, the specification of the “sets” you like to retrieve data from may be obtained by using complex joins/SubQuery/conditional case statements, but, at the end, there is a set of data from which the resultant set has to be obtained. You never have to specify “how” the data retrieval operation has to be implemented internally. You never have to specify how to implement the “joining operation” internally either. Also, you don’t have to specify how to apply filter condition against the rows. The database engine determines the best possible algorithms or processing logics to do these.

For example, following is an SQL that is written in a "Set based approach"

SELECT ProductName,CategoryName
FROM Product INNER JOIN Category
ON Product.CategoryID = Category.ID 
WHERE Price > 100 

In the above SQL, the "Product INNER JOIN Caterogy ON Product.CategoryID = Category.CategoryID" is the "set" of data from where ProductName and CategoryName has to be displayed for those products which have Price > 100. You just specify your requirements and conditions, the SQL engine does the rest to produce the result.

Why “Set based approaches” are better than the “Procedural approaches”?

Frankly speaking, the internal execution engines of databases are designed and optimized for taking “set based instructions” as input (SQLs) and executing these instructions in the best possible way (that varies based on lots of criteria) to produce output. That's why, "Set based approaches" are almost always better options.

Let us try to understand this with an example.

We often need to do joins between two or more sets of data (Say, two tables), to obtain a desired result set. While joining, we only specify the perticipating table/view names and join conditions. Now, it's the SQL engine which has to implement the actual "joining operation" based upon the join conditions that we provide.

In SQL Server execution engine, there are three types of joining techniques. These are:

Merge join : Most optimized joining algorithm. Takes place when both tables are joined on indexed columns which are sorted. In a merge join operation SQL Server scans once through the sorted inputs and merges the data together

Nested loop join : Moderate cost joining algorithm. Takes place when one of the table in the join clause contains small number of records (Inner table) comparing to the number of records in the other participating table (Outer table). Nested loops scan once through the each inner input row and search for the corresponding row in the outer input

Hash join : Most expensive joining algorithm. Takes place when large, unsorted, non-indexed columns are used in the join condition. The SQL server engine prepares a hash table for the smaller relation, by applying a hash function to the join attribute of each row. Then scan the larger relation and find the relevant rows by looking in the hash table.

Now, whenever we specify any join predicate in any SQL in the SQL Server, based upon the participating columns, data volume, indexing structure and, set of values in the participating columns, the SQL server determines and use the best possible joining algorithm and implement the logic to perform the actual joining operation in memory. Note that, we don’t have to specify the type of joining algorithm in the SQL. SQL server does that, and, does its best to provide the result as fast as possible.

So, that's where the database engine is in love with the "Set based" approaches.

Another important fact is, whatever SQL written in “set based approach” is issued in the database, the query optimizer generates an execution plan first, and then, the execution engine executes the plan to retrieve data from the physical storage and to process output in an efficient manner. That is, there is a single execution plan tree for each single SQL statement be it simple or complex. Executing that single execution plan tree is generally a faster operation.

But, when we specify our own way of processing a result set (That is obtained by executing an SQL already) using another SQL that works on row-by-row manner in the resultset (Perhaps, using a UDF), the database engine has to execute an execution plan for each and every row, even after obtaining the result set by executing an execution plan. Imagine a row-by-row operation that is getting executed for a result set containing 1 million rows. In this case, the initial data retrieval operation would require one execution plan to be executed, and, later, 1 million times another execution plan has to be executed for processing each row. That’s what happens when a User Defined Function (UDF) is executed for each row in a result set. An additional overhead of using UDF is the amount of stack I/O that takes place for invoking the UDF.

On the other hand, if you use a Cursor to process a result set row-by-row, while executing, the Cursor locks the rows in the corresponding table and unlocks rows when processing done. This involves lots of resource usage on the server and in case of large result sets, w severely slows down performance.

Experiment shows that, using a UDF for row-by-row operation involving up to 1000 rows may provide a performance that is within an acceptable range. But, as the number of rows are increased, using UDF would result in dramatically slow performance. The same goes for cursors.

That is why, “set based SQLs” always should outperform the “procedural SQLs”, specially, if the result set to process becomes large.

OK, understood. Now tell me how to get rid of "Procedural approaches"?

It's not hard. Just follow these simple tricks and re-factor some of your SQLs:

Use inline sub queries to replace User Defined Functions.

Let’s assume, for a self-referential table Employee(ID, Name,MgrID), there is a query written in a “Procedural approach” using a User Defined Funciton. The query outputs the Employee names and corresponding Manager’s names.

Here is the query:

SELECT Name AS [Employee Name],dbo.fnGetManagerName(MgrID) as [Manager Name] FROM Employee

Here, the dbo.fnGetManagerName(MgrID) is a UDF that returns the Manager’s Name (Which is nothing but another employee in the same Employee table) as follows:

ALTER FUNCTION [dbo].[fnGetManagerName](@ID int) RETURNS VARCHAR(50) AS
BEGIN
          --Declare the variable to hold result 
          DECLARE @ManagerName varchar(50)
          --Determine the Employee name by the given ID 
          SELECT @ManagerName = Name FROM Employee WHERE ID = @ID
          --Return the result
          RETURN @ManagerName
END

The above “Procedural SQL” could be re-written using a Sub Query in the “Set based approach” as follows:

SELECT E.Name AS [Employee Name],
(
    SELECT Name FROM Employee WHERE ID = E.MgrID
) AS [Manager Name] 
FROM Employee E 

In one of my projects I worked on, we had a slow performing stored procedure in a moderate sized SQL server 2000 database. The SP used to process around 20,000 records to produce a result set. All we needed to optimize it was replacing a UDF with an inline sub query (Because, all other optimizations were done already). Believe me, that turned down the total execution time from 90 long seconds to 1 second only!

Please note that, the above query could also be written using a self join, that would equally give you faster result.

Try to rewrite your Cursor based TSQLs with Correlated Subqueries

You can rewrite your cursor based code with a correlated subquery.

Generally, you use a cursor to obtain a result set and then process each row one by one to formulate the desired result. This kind of processing could be replaced with a correlated subquery (Most of the cases).

What is a correlated subquery?

A correlated subquery is a subquery where the outer query has to be executed first and then for each row in the outer query, the inner query is executed. That mean, before executing the inner query for a particular row, the outer query has to be processed for that particular row (And hence, the inner query is correlated with the outer one).

Take a look at the following query again (You’ve seen this already). This is a correlated subquery.

SELECT E.Name AS [Employee Name],
(
    SELECT Name FROM Employee WHERE ID = E.MgrID
) AS [Manager Name] 
FROM Employee E

Let’s break this query. The outer query is:

SELECT E.Name as [Employee Name] FROM Employee E

And, the inner query is:

SELECT Name FROM Employee WHERE ID = E.MgrID

Note the WHERE clause in the inner query (ID = E.MgrID) . In order to execute this inner query, the query processing engine needs the E.MgrID to be available already. So, to determine each row in the overall result set, the query processing processes the query in the following way:

1. Obtain the “Name” column value (As Employee Name) and “MgrID” column value from the Employee table.

2. Obtain the “Name” column value (As Manager Name) from the Employee table where  ID = MgrID.

So, as you can see, to determine each row in the result set, the SQL execution engine has to execute two different SQLs. But, execution of this query is far more optimized comparing to the UDF and Cursor based query because, in the SubQuery way, the SQL Server decides the optimized and best way to implement the inner query in its execution plan (Like deciding the best algorithm for implementation of a joining) and hence, the query executes faster (Set based approach).

However, if you already have some complex processing logics implemented using a Cursor that is executed for each row in the result set, and, if you think that implementing the same logic using a set based approach is hard or near to impossible, you can follow this approach:

Instead of a cursor, use a table variable to process the result set.

What is a "Table" variable?

In TSQL (Since SQL Server 2000), a “Table” variable is a special kind of variable that resembles more or less an actual table. But, most important thing about “Table” variable is, it resides in memory almost 100% time (Unless the Table variable itself is too large, in this case, the Table variable could reside in the tempdb database).

Use of table variable is efficient (most of the times, In terms of memory and execution time) comparing to the temporary tables because of the following reasons:

1. Temporary tables resides in the tempdb database, and, operating on the temporary tables results in inter-db communication. This is bound to be slow. But, Table variables are mostly “in memory” variables, so, I/O in table variables is bound to be fast.

2. Operating on temporary tables results in lots of disk activities and resource usage because:

But, operating on table variables requires no locking on the resources. Moreover, data insertion on table variable is a lot faster than the temporary table as no disk I/O and inter db communication takes place. Also, the table variable goes out of scope when the corresponding SQL block goes out of scope. So, table variables need not to be dropped. All these make table variable excellent choice for implementing faster TSQLs.

Well, now it's obvious that, Table variables better than the temporary tables most of the cases. But, can you use Table variables in place of Cursors?

Yes you can. Following is an example of processing a result set using a Table variable (The SQL that uses Cursor to process the result set is not included here, cause, I don't like you to learn Cursors..ha ha)

 

--Declare the Table variable 
DECLARE @Elements TABLE
(
        Number INT IDENTITY(1,1), --Auto incrementing Identity column
        ProductName VARCHAR(300) --The string value
)

--Decalre a variable to remember the position of the current delimiter
DECLARE @N INT 

--Decalre a variable to remember the number of rows in the table
DECLARE @Count INT

--Populate the TABLE variable using some logic
INSERT INTO @Elements SELECT Name FROM dbo.Products

--Initialize the looper variable
SET @N = 1

--Determine the number of rows in the Table
SELECT @Count=max(Number) from @Elements

--A variable to hold the currently selected value from the table
DECLARE @CurrentValue varchar(300);

--Loop through until all row processing is done
WHILE @N <= @Count

BEGIN
    --Load current value from the Table
    SELECT @CurrentValue = ProductName FROM @Elements WHERE Number = @N
    --Process the current value
    print @CurrentValue
    --Increment loop counter
    SET @N = @N + 1;
END 

I can bet, you will be surprised to see the performance benefits by replacing the Cursor based codes that you might have written with the Table variable based codes.

Please note that, you still should try not to write TSQLs using the "Procedural approach" (Use of Table variable is still a procedural approach). But, if for some reason you really need to write your own way of processing a result set, you can at least use Table variables to avoid the Cursors.

Have fun writing "Set based" SQLs. Enjoy!

History

Initial version:15th March 2009

Modifications

16th March 2009

17th March 2009

23rd March 2009

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralGood insights
Donsw
9:11 28 Apr '09  
well written. good job.

cheers,
Donsw
My Recent Article : Ajax Calendar Control

GeneralRe: Good insights
M.M.Al-Farooque Shubho
18:08 28 Apr '09  
Thanks
GeneralGreat article
professordavos
18:17 30 Mar '09  
Thanks for explaining this. I suppose I have learned the correct way as I never make use of cursors or UDF's.

In the example about the employer and manager, wouldn't this return the same name twice? as you are essentially selecting name, name from employee.

Perhaps this table has a Manager_Name column that you intended to refer to? so the resulting query would return name, manager_name from employee.

Its inconsequential and doesn't detract from the very useful info.

Or did I totally miss something there and get it wrong?
GeneralRe: Great article
M.M.Al-Farooque Shubho
18:52 30 Mar '09  
I guess, you got it wrong.

The Employee(ID, Name,MgrID) table is a "Self referential" table which contains a MgrID column. This MgrID column values actually point to the ID column of the same table (Hence, self referential).

For example, an employee "John", has MgrID = 2. That means, John's manager is someone who has ID 2 (Who is employee "Tom" in the same table).

So, in this case, the mentioned queries would return the following result for "John"

Employee Name--Manager Name
John-----------------Tom
GeneralRe: Great article
professordavos
19:00 30 Mar '09  
Ah thanks for explaining that one.
GeneralRe: Great article
Paul Horstink
21:11 27 Apr '09  
However, this should be done much simpler, with a 'self-refential' join:

SELECT E.Name AS [Employee Name], M.Name AS [Manager Name]
FROM Employee E JOIN Employee M ON M.ID = E.MgrID
GeneralRe: Great article
M.M.Al-Farooque Shubho
21:32 27 Apr '09  
Correct. "Self referential join" is a good way in terms of "simplicity"
General?????
i.j.russell
4:09 23 Mar '09  
The example of set-based that you provide is not set-based but procedural because you are still looping through the data item by item not as a set.
GeneralRe: ?????
M.M.Al-Farooque Shubho
6:14 23 Mar '09  
You are right. Actually, looping through the item is still a procedural approach.

I think I tried to point out how to avoid the "traditional" procedural approach of writing data processing logic using Cursors and UDF's. Hence, using a Table variable is recommended in place of a Cursor.

But, still, I am going to edit the article for correctness. Many thanks for pointing this out.
GeneralRe: ?????
spoodygoon
8:53 23 Mar '09  
I have been lucky enough to study under a great database administrator. I agree while loops are slightly better than cursors but I would have to say loops are only needed about 2% to 5% of the time if you are careful. But you probably already know that.

I'm going to have to say the inclusion of the while loops distract from a really good article. You might consider removing the while loop section to keep the focus on sets, it's not easy but if you ignore that loops are there you can come up with some really great ideas.

But I still say its a good article.
GeneralRe: ?????
M.M.Al-Farooque Shubho
20:09 23 Mar '09  
I've modified the article a bit. Thanks
GeneralGreat
deno0
0:54 18 Mar '09  
A very useful article. Thank you.
GeneralRe: Great
M.M.Al-Farooque(Shubho)
1:23 18 Mar '09  
You are welcome


Last Updated 24 Mar 2009 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010