![]() |
Database »
Database »
SQL Server
Intermediate
License: The Code Project Open License (CPOL)
Understanding “Set based” and “Procedural” approaches in SQLBy M.M.Al-Farooque ShubhoExploring the set based and procedural approaches to write better SQL in SQL server |
SQL, Windows, .NET, ASP.NET, SQL Server, ADO.NET, Architect, DBA, Dev
|
||||||||||||
|
Advanced Search Add to IE Search |
|
|
|
||||||||||||||||
OK, you know that you should write SQLs using “Set based approaches”, rather than “Procedural approaches” while developing data access routines. But, what these two approaches really are? why Procedural approach is better? I searched for the answer on the web, and, did not find enough information to satisfy myself (Though, I found lots of technical explanations that somehow touch this issue, but, not explain it fully). So, I myself tried to explore and understand the actual "knowhow" and differences between the two, and to know the reasons why “Set based approaches” are better.
In this article, I would like to share my ideas and findings with you, using some simple observations and explanations. Along with, I will present some simple techniques to convert “Procedural SQL’s” into “Set based SQL’s” that might help you writing better SQLs :).
Please be noted that, I am writing this article having the SQL Server in my mind. But, the principle is mostly true for any database server implementation.
The article may not be published in any where else, except the author's permission. The codes can be re-used any where for learning purpose.
What is a “procedural approach”?
Well, simply speking, procedural approach is actually the “programmatic approach” that we are used to work with, in our daily programming life. In this approach, we tell the system “what to do” along with telling “how to do” it. We query the database to obtain result set and we write the data operational and manipulation logics using looping, conditional and processing statements to produce the final result. The runtime does whatever we want it to do, however we want it to do.
In TSQL, any User Defined Function (UDF), or Cursor that executes on a result set row by row is a procedural approach. For example, if you are quering your database to obtain a result set and using a Cursor to navigate the result set to do further processings row by row, you are using a procedural approach. Also, if you are using a User Defined Function in your SQL for processing each row in the result set to calculate a scaler output, you are using a procedural approach. Looking for example codes? Continuing reading..
What is a “Set based approach”?
Set based approach is actually an approach which lets you specify “what to do”, but, does not let you specify “how to do”. That is, you just specify your requirement for a processed result that has to be obtained from a “set of data” (Be it a simple table/view, or, joins of tables/views), filtered by optional condition(s). Sometimes, the specification of the “sets” you like to retrieve data from may be obtained by using complex joins/SubQuery/conditional case statements, but, at the end, there is a set of data from which the resultant set has to be obtained. You never have to specify “how” the data retrieval operation has to be implemented internally. You never have to specify how to implement the “joining operation” internally either. Also, you don’t have to specify how to apply filter condition against the rows. The database engine determines the best possible algorithms or processing logics to do these.
For example, following is an SQL that is written in a "Set based approach"
SELECT ProductName,CategoryName
FROM Product INNER JOIN Category
ON Product.CategoryID = Category.ID
WHERE Price > 100
In the above SQL, the "Product INNER JOIN Caterogy ON Product.CategoryID = Category.CategoryID" is the "set" of data from where ProductName and CategoryName has to be displayed for those products which have Price > 100. You just specify your requirements and conditions, the SQL engine does the rest to produce the result.
Why “Set based approaches” are better than the “Procedural approaches”?
Frankly speaking, the internal execution engines of databases are designed and optimized for taking “set based instructions” as input (SQLs) and executing these instructions in the best possible way (that varies based on lots of criteria) to produce output. That's why, "Set based approaches" are almost always better options.
Let us try to understand this with an example.
We often need to do joins between two or more sets of data (Say, two tables), to obtain a desired result set. While joining, we only specify the perticipating table/view names and join conditions. Now, it's the SQL engine which has to implement the actual "joining operation" based upon the join conditions that we provide.
In SQL Server execution engine, there are three types of joining techniques. These are:
Merge join : Most optimized joining algorithm. Takes place when both tables are joined on indexed columns which are sorted. In a merge join operation SQL Server scans once through the sorted inputs and merges the data together
Nested loop join : Moderate cost joining algorithm. Takes place when one of the table in the join clause contains small number of records (Inner table) comparing to the number of records in the other participating table (Outer table). Nested loops scan once through the each inner input row and search for the corresponding row in the outer input
Hash join : Most expensive joining algorithm. Takes place when large, unsorted, non-indexed columns are used in the join condition. The SQL server engine prepares a hash table for the smaller relation, by applying a hash function to the join attribute of each row. Then scan the larger relation and find the relevant rows by looking in the hash table.
Now, whenever we specify any join predicate in any SQL in the SQL Server, based upon the participating columns, data volume, indexing structure and, set of values in the participating columns, the SQL server determines and use the best possible joining algorithm and implement the logic to perform the actual joining operation in memory. Note that, we don’t have to specify the type of joining algorithm in the SQL. SQL server does that, and, does its best to provide the result as fast as possible.
So, that's where the database engine is in love with the "Set based" approaches.
Another important fact is, whatever SQL written in “set based approach” is issued in the database, the query optimizer generates an execution plan first, and then, the execution engine executes the plan to retrieve data from the physical storage and to process output in an efficient manner. That is, there is a single execution plan tree for each single SQL statement be it simple or complex. Executing that single execution plan tree is generally a faster operation.
But, when we specify our own way of processing a result set (That is obtained by executing an SQL already) using another SQL that works on row-by-row manner in the resultset (Perhaps, using a UDF), the database engine has to execute an execution plan for each and every row, even after obtaining the result set by executing an execution plan. Imagine a row-by-row operation that is getting executed for a result set containing 1 million rows. In this case, the initial data retrieval operation would require one execution plan to be executed, and, later, 1 million times another execution plan has to be executed for processing each row. That’s what happens when a User Defined Function (UDF) is executed for each row in a result set. An additional overhead of using UDF is the amount of stack I/O that takes place for invoking the UDF.
On the other hand, if you use a Cursor to process a result set row-by-row, while executing, the Cursor locks the rows in the corresponding table and unlocks rows when processing done. This involves lots of resource usage on the server and in case of large result sets, w severely slows down performance.
Experiment shows that, using a UDF for row-by-row operation involving up to 1000 rows may provide a performance that is within an acceptable range. But, as the number of rows are increased, using UDF would result in dramatically slow performance. The same goes for cursors.
That is why, “set based SQLs” always should outperform the “procedural SQLs”, specially, if the result set to process becomes large.
OK, understood. Now tell me how to get rid of "Procedural approaches"?
It's not hard. Just follow these simple tricks and re-factor some of your SQLs:
Use inline sub queries to replace User Defined Functions.
Let’s assume, for a self-referential table Employee(ID, Name,MgrID), there is a query written in a “Procedural approach” using a User Defined Funciton. The query outputs the Employee names and corresponding Manager’s names.
Here is the query:
SELECT Name AS [Employee Name],dbo.fnGetManagerName(MgrID) as [Manager Name] FROM Employee
Here, the dbo.fnGetManagerName(MgrID) is a UDF that returns the Manager’s Name (Which is nothing but another employee in the same Employee table) as follows:
ALTER FUNCTION [dbo].[fnGetManagerName](@ID int) RETURNS VARCHAR(50) AS
BEGIN
--Declare the variable to hold result
DECLARE @ManagerName varchar(50)
--Determine the Employee name by the given ID
SELECT @ManagerName = Name FROM Employee WHERE ID = @ID
--Return the result
RETURN @ManagerName
END
The above “Procedural SQL” could be re-written using a Sub Query in the “Set based approach” as follows:
SELECT E.Name AS [Employee Name],
(
SELECT Name FROM Employee WHERE ID = E.MgrID
) AS [Manager Name]
FROM Employee E
In one of my projects I worked on, we had a slow performing stored procedure in a moderate sized SQL server 2000 database. The SP used to process around 20,000 records to produce a result set. All we needed to optimize it was replacing a UDF with an inline sub query (Because, all other optimizations were done already). Believe me, that turned down the total execution time from 90 long seconds to 1 second only!
Please note that, the above query could also be written using a self join, that would equally give you faster result.
Try to rewrite your Cursor based TSQLs with Correlated Subqueries
You can rewrite your cursor based code with a correlated subquery.
Generally, you use a cursor to obtain a result set and then process each row one by one to formulate the desired result. This kind of processing could be replaced with a correlated subquery (Most of the cases).
What is a correlated subquery?
A correlated subquery is a subquery where the outer query has to be executed first and then for each row in the outer query, the inner query is executed. That mean, before executing the inner query for a particular row, the outer query has to be processed for that particular row (And hence, the inner query is correlated with the outer one).
Take a look at the following query again (You’ve seen this already). This is a correlated subquery.
SELECT E.Name AS [Employee Name],
(
SELECT Name FROM Employee WHERE ID = E.MgrID
) AS [Manager Name]
FROM Employee E
Let’s break this query. The outer query is:
SELECT E.Name as [Employee Name] FROM Employee E
And, the inner query is:
SELECT Name FROM Employee WHERE ID = E.MgrID
Note the WHERE clause in the inner query (ID = E.MgrID) . In order to execute this inner query, the query processing engine needs the E.MgrID to be available already. So, to determine each row in the overall result set, the query processing processes the query in the following way:
1. Obtain the “Name” column value (As Employee Name) and “MgrID” column value from the Employee table.
2. Obtain the “Name” column value (As Manager Name) from the Employee table where ID = MgrID.
So, as you can see, to determine each row in the result set, the SQL execution engine has to execute two different SQLs. But, execution of this query is far more optimized comparing to the UDF and Cursor based query because, in the SubQuery way, the SQL Server decides the optimized and best way to implement the inner query in its execution plan (Like deciding the best algorithm for implementation of a joining) and hence, the query executes faster (Set based approach).
However, if you already have some complex processing logics implemented using a Cursor that is executed for each row in the result set, and, if you think that implementing the same logic using a set based approach is hard or near to impossible, you can follow this approach:
Instead of a cursor, use a table variable to process the result set.
What is a "Table" variable?
In TSQL (Since SQL Server 2000), a “Table” variable is a special kind of variable that resembles more or less an actual table. But, most important thing about “Table” variable is, it resides in memory almost 100% time (Unless the Table variable itself is too large, in this case, the Table variable could reside in the tempdb database).
Use of table variable is efficient (most of the times, In terms of memory and execution time) comparing to the temporary tables because of the following reasons:
1. Temporary tables resides in the tempdb database, and, operating on the temporary tables results in inter-db communication. This is bound to be slow. But, Table variables are mostly “in memory” variables, so, I/O in table variables is bound to be fast.
2. Operating on temporary tables results in lots of disk activities and resource usage because:
But, operating on table variables requires no locking on the resources. Moreover, data insertion on table variable is a lot faster than the temporary table as no disk I/O and inter db communication takes place. Also, the table variable goes out of scope when the corresponding SQL block goes out of scope. So, table variables need not to be dropped. All these make table variable excellent choice for implementing faster TSQLs.
Well, now it's obvious that, Table variables better than the temporary tables most of the cases. But, can you use Table variables in place of Cursors?
Yes you can. Following is an example of processing a result set using a Table variable (The SQL that uses Cursor to process the result set is not included here, cause, I don't like you to learn Cursors..ha ha)
--Declare the Table variable
DECLARE @Elements TABLE
(
Number INT IDENTITY(1,1), --Auto incrementing Identity column
ProductName VARCHAR(300) --The string value
)
--Decalre a variable to remember the position of the current delimiter
DECLARE @N INT
--Decalre a variable to remember the number of rows in the table
DECLARE @Count INT
--Populate the TABLE variable using some logic
INSERT INTO @Elements SELECT Name FROM dbo.Products
--Initialize the looper variable
SET @N = 1
--Determine the number of rows in the Table
SELECT @Count=max(Number) from @Elements
--A variable to hold the currently selected value from the table
DECLARE @CurrentValue varchar(300);
--Loop through until all row processing is done
WHILE @N <= @Count
BEGIN
--Load current value from the Table
SELECT @CurrentValue = ProductName FROM @Elements WHERE Number = @N
--Process the current value
print @CurrentValue
--Increment loop counter
SET @N = @N + 1;
END
I can bet, you will be surprised to see the performance benefits by replacing the Cursor based codes that you might have written with the Table variable based codes.
Please note that, you still should try not to write TSQLs using the "Procedural approach" (Use of Table variable is still a procedural approach). But, if for some reason you really need to write your own way of processing a result set, you can at least use Table variables to avoid the Cursors.
Have fun writing "Set based" SQLs. Enjoy!
Initial version:15th March 2009
Modifications
16th March 2009
17th March 2009
23rd March 2009
| You must Sign In to use this message board. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 23 Mar 2009 Editor: |
Copyright 2009 by M.M.Al-Farooque Shubho Everything else Copyright © CodeProject, 1999-2009 Web18 | Advertise on the Code Project |