SQL for Developers: Data Crunching

Tim Corey

Rate me:

4.61/5 (17 votes)

24 Jan 2011CPOL16 min read

32.6K

420

The developer's guide to getting the right information out of mountains of data. Learn how to boil down the data into useful information while being confident that you haven't left any data behind.

Introduction

In my last article on Basic Data Retrieval, I shared that sometimes software developers are called upon to retrieve data from SQL databases. These articles are designed to be an aid to those developers. This particular article focuses on crunching the data. Having a million points of data does a company no good if they don't know what it means. This guide will help you understand how to process these large amounts of data while ensuring that you aren't getting bad data. The worst thing you can do when querying large amounts of data is to be off by a little bit. You probably won't notice it until it is too late. This article will help you know how to be sure you will get the right information before you even run the query.

Notes

I have developed this article while using the AdventureWorks database provided by Microsoft. All of the following queries will work against that database. I have provided a tool that will run each query for you. Both the executable and the source code for the T-SQL Test Platform are provided to you. You can run my example queries directly or you can use this tool to test your own queries. I have tested everything against a Microsoft SQL Server 2008 R2, but I believe all of the included commands should work against even Microsoft SQL Server 2005 unless stated otherwise.

As part of my usual disclaimer, I want to state that it is outside the scope of this article to discuss the merits and drawbacks of calling SQL queries from code versus executing stored procedures on the server. It is my opinion that if you have the ability to do so, writing stored procedures is the way to go but that is not always an option. Either way, this article is not the place for that discussion.

Now that we have all of the administrative stuff out of the way, on to the good stuff.

Processing Data with SQL Functions

There are many functions that can be performed on a set of data. The ones that are well known include SUM, COUNT, AVG, etc. While these functions can be simple to understand, there are a few things that need to be understood about how these functions work so that you do not make a costly assumption that turns out to be faulty.

The COUNT operator is used quite often to figure out how many rows (or items) are in a particular set of data. For example, if we wanted to know how many records were in the Person.Contact table, we could run the following query:

SQL

SELECT COUNT(*) AS RecordCount
FROM Person.Contact

RecordCount
-----------
19972

(1 row(s) affected)

Perfect. With one simple query, I know how many rows are in my table. Before we dive deeper into what COUNT can do, I want to step back to look at this particular query. This method of getting the number of records in a table is a very common one. However, there is an alternative that is much faster. If you need to find the number of records in a particular table, use the following query:

SQL

SELECT rows AS RecordCount
FROM sysindexes
WHERE id = OBJECT_ID('Person.Contact') AND indid < 2

RecordCount
-----------
19972

(1 row(s) affected)

The speed difference between these two queries is incredible. On my machine, the COUNT(*) method takes 25 times as long as this method. That is a big deal difference, especially if you are executing the statement multiple times a day. This sysindexes table stores the information for all of the indexes in each table. The first Index ID (indid) indicates the Primary Key, which always indexes every single record in the table. Thus, we can pull the number of records it has indexed and find out how many records are in the table.

Getting back to the COUNT statement, we can also use COUNT on a specific column. This would seem like a better way to count records since it would only capture one column instead of the dreaded star select. We all know we shouldn't do SELECT * so it would make sense that we don't use COUNT(*) either, right? Actually no. Using COUNT against one column instead of star actually has a different purpose. Using COUNT with a column name counts the non-NULL entries in that column. If you use COUNT against a column that does not allow NULLs, you would not notice a difference in the resulting number. However, observe the following example:

SQL

SELECT COUNT(MiddleName) AS RecordCount
FROM Person.Contact

RecordCount
-----------
11473
Warning: Null value is eliminated by an aggregate or other SET operation.

(1 row(s) affected)

Notice that since I am running this query in the SQL Query window, it actually warned me of this little gem. If you were running this command in your code, you won't get the nice warning message to prompt you to consider what you are doing. The end result of this query is that we are "missing" about 8,500 records. That is a big deal. What can be even worse, however, is if you were missing only a few. You would probably notice if 40% of your records were not present but would you notice if 0.1% of them were missing? This is a prime example of why you need to know what every command does before using it.

I'm sure you are now wondering about the other SQL function statements such as SUM and AVG. The fact is that they work the same way. When you look at just the COUNT function, your first instinct is probably that this is a bug. However, when you look at this in light of SUM (and the other functions), it makes a lot of sense. NULL is not a number. In fact, NULL is the absence of anything. It is a vacuum. It does not mean the same thing as zero or empty string. It means "the absence of a value". Therefore, it should not be included in an average. Instead, those rows are skipped. If that isn't to your liking, the next section will tell you how to convert NULL values into default values.

Translating NULL Values with COALESCE

There will be times when you want to deal with NULL values as specific values. For example, if a student is required to turn in three assignments for a grade, you might want to treat any missing assignments as zeros after the assignment date comes due. You may not want to put a zero in their record, since this would indicate a completed assignment that received no credit (which is different than a missing assignment). When querying the data, you can use the COALESCE function to convert these entries to a default value. For example, in our above statement on using the COUNT(MiddleName) method for counting our rows, we found that there were a bunch of middle names that had NULL values. For the sake of demonstration (since this is just to show how to use the COALESCE keyword), we could change the query like so:

SQL

SELECT COUNT(COALESCE(MiddleName,'')) AS RecordCount
FROM Person.Contact

RecordCount
-----------
19972

(1 row(s) affected)

Notice that we now have the total record count again. That is because we used the COALESCE keyword and converted all of the NULL entries in the MiddleName column to be empty strings instead. If we were to use this to convert grades, we would have put a zero instead of an empty string. COALESCE can be a very useful tool as well when the system you are sending the results to does not play well with NULL values. You could use COALESCE on every column that allowed NULL values so that the resulting recordset was NULL-free.

The COALESCE operator can be used for even more interesting applications than just removing NULL values. As we will see below, we can use it to put new values into a column based upon what the NULL value means. We could even use COALESCE to combine two (or more) columns into one if we knew that only one column would have a value (or we only wanted the first value we found).

Collapsing Data with GROUP BY

One of the most common ways to get meaningful information out of a dataset is to collapse it. For example, we don't look at the results of every free throw a basketball player attempts. Instead, we look at the percentage that the player makes. When looking at multiple players inside of one table, you would use a GROUP BY statement to group each player's records together and then count how many shots were made and divide that by the total shots attempted. This would give us our percentage for each player.

A GROUP BY statement groups the records by each column specified in the GROUP BY statement. It can then perform aggregate functions against those grouped records. For example, here I have grouped the Contact records by Email Promotion and then counted how many records are in each group:

SQL

SELECT EmailPromotion, COUNT(*) AS RecordCount
FROM Person.Contact
GROUP BY EmailPromotion

EmailPromotion RecordCount
-------------- -----------
0              11158
1              5044
2              3770

(3 row(s) affected)

This query would be useful for finding out how many people were in each email promotion. While this is a simple use of GROUP BY, it does highlight the basic structure we need to use to form this command properly. When using GROUP BY, you need to include every column that is not an aggregate column in the GROUP BY statement. Each column that you put in this statement will be part of what is grouped on. Therefore, if you put just last name in this section, it will group on each last name. There might be 15 Smiths in your database. If you put first name and last name in the GROUP BY section, it would group by the combination of the two so there might be 12 entries with a last name of Smith. It is rare that you would put a large number of columns in the GROUP BY section since that will make the number of matches very few. The one exception would be if you were looking for duplicates. For example, if you wanted to create user names by combining the first and last name, you might first run a query to see how many duplicates you would have. You would do this by putting both the first and last name columns in the GROUP BY and doing a count of the records. You could then sort by the count descending to find the maximum number of duplicates for any given name. Here is an example of this type of query statement:

SQL

SELECT FirstName,LastName, COUNT(*) AS Duplicates
FROM Person.Contact
GROUP BY FirstName,LastName
ORDER BY Duplicates DESC

FirstName                                          LastName             Duplicates
------------------------------ ---------------------------------------  ----------
Laura                                              Norman                  5
Kim                                                Ralls                   4
Jean                                               Trenary                 4
Sheela                                             Word                    4
â€¦
(19516 row(s) affected)

Notice that we used the ORDER BY to sort descending so that we could have our duplicates on top. This query returned a lot of records, most of which were not duplicates. We could solve this by limiting our query using the TOP command but that wouldn't really give us what we want. Say, for example, we wanted to only find the records that had four or more duplicates. Logic would say that we would use a WHERE statement. The problem is that the WHERE statement operates on each record, not on the aggregate data. I've seen some people solve this by nesting the query inside of another that has the WHERE statement. It works but it is unnecessary. Instead, we should use the HAVING keyword. This keyword performs the filtering action on the resultant rows, not on the original rows. Here is an example of how we could use this in our query:

SQL

SELECT FirstName,LastName, COUNT(*) AS Duplicates
FROM Person.Contact
WHERE LastName = 'Miller' OR LastName = 'Martin'
GROUP BY FirstName,LastName
HAVING COUNT(*) > 3
ORDER BY Duplicates DESC

FirstName                                          LastName            Duplicates
------------------------------- -------------------------------------- -----------
Benjamin                                           Martin                 4
Mindy                                              Martin                 4
Dylan                                              Miller                 4
Frank                                              Miller                 4

(4 row(s) affected)

Notice that the HAVING statement does not recognize the alias from the SELECT statement so we need to put the expression in the HAVING statement as well. The reality is we wouldn't even need to have the expression in the SELECT statement if we didn't want to. For example, we might want only those who have four or more duplicates but we don't care about how many records they have. In that case, we would omit the duplicates column but would keep the statement in the HAVING section. Also note that I included a WHERE statement just to show it can be done. Remember, WHERE operates against each individual record while HAVING operates against the resultant row data.

Getting Sub-totals using the WITH ROLLUP Command

Using the GROUP BY statement starts you down the path of being able to summarize your data. Once you have a firm understanding of using GROUP BY, this next command makes life even better. The WITH ROLLUP command will allow you to get sub-totals inside of your query. For example, what if you had a bunch of purchase orders that had multiple items on each order. Your boss comes to you and asks for a report of what the total per purchase order was but you also need to provide a summary of how much was spent on each product on each purchase order (assuming there could be multiple lines on a purchase order that could reference the same product). One way to do this would be to do a GROUP BY on the purchase order number and the product number. You could do a SUM on the product cost column. That would give you the total per item per purchase order but it wouldn't give you the total for each purchase order. This is where the WITH ROLLUP command comes in. It will give you a sub-total per section and an overall total as well. Let's see what this would look like:

SQL

SELECT PurchaseOrderID ,
        ProductID ,
        SUM(LineTotal) AS Total
FROM Purchasing.PurchaseOrderDetail
WHERE PurchaseOrderID < 5
GROUP BY PurchaseOrderID,ProductID WITH ROLLUP

PurchaseOrderID ProductID   Total
--------------- ----------- ---------------------
1               1           201.04
1               NULL        201.04
2               359         135.36
2               360         136.7415
2               NULL        272.1015
3               530         8847.30
3               NULL        8847.30
4               4           171.0765
4               NULL        171.0765
NULL            NULL        9491.518

(10 row(s) affected)

If we ran this statement without the WITH ROLLUP command, it would have provided us with a similar list but the rows that contain NULL values would not be included. This is a simplistic example, but it shows how the WITH ROLLUP command works. Each row that contains a NULL value is a section summary. Look at the fifth row (PurchaseOrderID 2 ProductID NULL). This is a summary of the entire purchase order. Thus, the Total column is the SUM of the previous two sub-totals. At the bottom, the row that has NULL in both of the first two columns is the grand total row. The value in the Total column is the SUM of the entire query. As you can see, this can be a valuable command to keep in mind when you are looking to get aggregate information. You can use it with any of the commands like AVG, SUM, COUNT, etc. Please note that column order is important in the GROUP BY statement. I put my PurchaseOrderID column first so that I could get totals for each Purchase Order. If I had put ProductID first, I would have gotten totals for each product, which probably isn't what I want.

Advanced Sub-totals using the WITH CUBE Command

Once you understand the WITH ROLLUP command, you can go even further using the WITH CUBE command. This command allows you to get a summary each way. For instance, in our above example, we wanted the total per purchase order and the sum of each item under each purchase order. However, what if we wanted to also see how much we sold of each product as well. If you already have your query set up with the WITH ROLLUP command, just change it to be WITH CUBE instead. Here is an example:

SQL

SELECT PurchaseOrderID ,
        ProductID ,
        SUM(LineTotal) AS Total
FROM Purchasing.PurchaseOrderDetail
WHERE PurchaseOrderID < 5
GROUP BY PurchaseOrderID,ProductID WITH CUBE

PurchaseOrderID ProductID   Total
--------------- ----------- ---------------------
1               1           201.04
NULL            1           201.04
4               4           171.0765
NULL            4           171.0765
2               359         135.36
NULL            359         135.36
2               360         136.7415
NULL            360         136.7415
3               530         8847.30
NULL            530         8847.30
NULL            NULL        9491.518
1               NULL        201.04
2               NULL        272.1015
3               NULL        8847.30
4               NULL        171.0765

(15 row(s) affected)

If you compare this to the previous query results, you will see five extra rows. We have the sub total for each Product ID in addition to what we already had. The order is a bit different, where the totals per ProductID are intermixed in our list and the totals per purchase order are put at the bottom. If order is important to you, you can change the order given in the GROUP BY section since it won't affect the data given, just the order it is displayed in (unlike the WITH ROLLUP, where order is important). The WITH CUBE command allows us to get summary data for each column. Use this with care, however, since it can create a lot of overhead if you try to do too many columns in the GROUP BY section.

Identifying Summary Rows with the GROUPING Keyword

Once you start to become comfortable with the use of WITH ROLLUP and WITH CUBE, you might start to wonder if there is a way to identify the rows that are summary rows other than by the NULL value. SQL comes equipped with the GROUPING keyword that allows us to identify each summary row that the WITH ROLLUP or WITH CUBE command generates. Here is an example of how to use it:

SQL

SELECT PurchaseOrderID ,
        ProductID ,
        SUM(LineTotal) AS Total ,
        GROUPING(PurchaseOrderID) AS PurchaseOrderGrouping,
        GROUPING(ProductID) AS ProductGrouping
FROM Purchasing.PurchaseOrderDetail
WHERE PurchaseOrderID < 5
GROUP BY PurchaseOrderID,ProductID WITH CUBE

PurchaseOrderID ProductID   Total                 PurchaseOrderGrouping ProductGrouping
--------------- ----------- --------------------- --------------------- ---------------
1               1           201.04                0                     0
NULL            1           201.04                1                     0
4               4           171.0765              0                     0
NULL            4           171.0765              1                     0
2               359         135.36                0                     0
NULL            359         135.36                1                     0
2               360         136.7415              0                     0
NULL            360         136.7415              1                     0
3               530         8847.30               0                     0
NULL            530         8847.30               1                     0
NULL            NULL        9491.518              1                     1
1               NULL        201.04                0                     1
2               NULL        272.1015              0                     1
3               NULL        8847.30               0                     1
4               NULL        171.0765              0                     1

(15 row(s) affected)

As you can see, the GROUPING command will put a 1 in the column if that particular item is a sub-total or total. It will put a 0 if the record is one that was created by the GROUP BY statement. I used an alias to name these two new columns for easier reference. Note that you can now filter this query based upon these values if you use the HAVING statement.

Identifying Summary Rows with the COALESCE Keyword

We have already identified how the COALESCE command works. As we know, COALESCE gives us the first value that is not null in the list of values it has inside it. In the earlier examples, we used COALESCE to simply scrub out a NULL value and replace it with a constant as a method of providing a default. However, we can do much more advanced statements inside of our COALESCE. Let me show you what I mean via example and then I will explain what I did.

SQL

SELECT COALESCE(CAST(PurchaseOrderID AS NVARCHAR(10)),_
'Product Total: ' + CAST(ProductID AS NVARCHAR(10)), 'Grand Total') AS PurchaseOrder,
        COALESCE(CAST(ProductID AS NVARCHAR(10)),'PO Total: ' + _
        CAST(PurchaseOrderID AS NVARCHAR(10)), '') AS Product ,
        SUM(LineTotal) AS Total
FROM Purchasing.PurchaseOrderDetail
WHERE PurchaseOrderID < 5
GROUP BY PurchaseOrderID,ProductID WITH CUBE

PurchaseOrder             Product              Total
------------------------- -------------------- ---------------------
1                         1                    201.04
Product Total: 1          1                    201.04
4                         4                    171.0765
Product Total: 4          4                    171.0765
2                         359                  135.36
Product Total: 359        359                  135.36
2                         360                  136.7415
Product Total: 360        360                  136.7415
3                         530                  8847.30
Product Total: 530        530                  8847.30
Grand Total                                    9491.518
1                         PO Total: 1          201.04
2                         PO Total: 2          272.1015
3                         PO Total: 3          8847.30
4                         PO Total: 4          171.0765

(15 row(s) affected)

The first thing you will note is that this is a bit complicated. Because the columns are of type int, I had to CAST the value to type nvarchar or the statement would have thrown an error. This makes for an ugly query but the performance is no different than the statement where we used the GROUPING keyword (I checked it multiple times because I didn't believe it).

Getting beyond the fact that it looks scary, there isn't much to this query. Basically, I use a COALESCE statement with three options. The first option is the field itself. However, if that is NULL, we know that we are getting a total for the other column. In that case, I combine a static string with the other column value to get a total line. However, if the other column is NULL too, that means that we are on the grand total line. In this case, I alternated what I did based upon which column I was in. For the first column, I marked it as the grand total line. The second column I just put an empty string since the line was already designated as a grand total line. The only real complex thing to remember here is that if you find a NULL value in one column, it means that the other column is the one being totaled (not the current column).

Conclusion

In this article, we have learned how to compact and aggregate data into meaningful information. We covered a lot of information centered around the GROUP BY statement and how this can be used effectively beyond just the simple methods we normally see. There is a lot more that I have not yet covered about SQL. My plan is to keep addressing the needs of the software developer when it comes to writing T-SQL. I hope you have found something of interest in this article. I appreciate your constructive feedback and I look forward to your thoughts on what can be done to improve this article.

History

January 22, 2011: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Tim Corey

Software Developer (Senior) DeGarmo

United States

I am currently a Senior Software Developer at a company in Illinois called DeGarmo. My primary skills are in .NET, SQL, JavaScript, and other web technologies although I have worked with PowerShell, C, and Java as well.

In my previous positions, I have worked as a lead developer, professor and IT Director. As such, I have been able to develop software on a number of different types of systems and I have learned how to correctly oversee the overall direction of technology for an organization. I've developed applications for everything from machine automation to complete ERP systems.

I enjoy taking hard subjects and making them easy to understand for people unfamiliar with the topic.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.