Click here to Skip to main content
15,885,537 members
Articles / All Topics
Technical Blog

The Many Uses of ROW_NUMBER

Rate me:
Please Sign up or sign in to vote.
4.00/5 (2 votes)
14 Jun 2013CPOL3 min read 12.3K   6  
The many uses of ROW_NUMBER.

Without a doubt one of the most used SQL functions lately for me has been ROW_NUMBER. My two main uses for it?

  1. Providing a Ranking or an Order of Preference to rows.
  2. Identifying and Removing Duplicate Rows.
  3. Paginating results.

1. Providing a Ranking or an Order of Preference to Rows

The first thing to say about ROW_NUMBER is its basic syntax:

SQL
ROW_NUMBER() OVER (PARTITION BY Department
    ORDER BY SUM(SaleTotal) DESC) AS SalesRankingInDepartment

The PARTITION BY clause is entirely optional and can be used just to create sub-sets within the dataset based on the values in the column, whereas the ORDER BY defines the ordering we apply in order to generate the row numbers.

SQL
-- First work out the sales ranking for each employee based on their total sales
with salesCte AS
(
    SELECT emp.EmployeeName
        , emp.DepartmentId
        , SUM(s.SaleTotal) AS TotalSales
        -- We are looking to do this by department, so we want to partition
        -- by the employees department id
        , ROW_NUMBER() OVER (PARTITION BY emp.DepartmentId 
        -- And since the aim of the query is to get the top seller for each
        -- department, the ordering is based on the total sales for the employee
        -- in descending order
                ORDER BY SUM(s.SaleTotal) DESC) AS SalesRankingInDepartment
 
    FROM Employee emp
    LEFT OUTER JOIN Sales s
        ON emp.EmployeeId = s.EmployeeId
 
    GROUP BY emp.EmployeeName, emp.DepartmentId
)
 
SELECT dep.DepartmentName
    , sales.EmployeeName AS 'Department Best Seller'
    -- In the unfortunate case the department has no sales,
    -- the left outer join will still retrieve the department
    -- and in this case it's entirely acceptable to just leave
    -- the values as null
    , sales.TotalSales AS EmployeeTotalSales
 
FROM Department dep
LEFT OUTER JOIN salesCte sales
    ON sales.DepartmentId = dep.DepartmentId
-- We only want to see the top seller, so we only want 
-- the records with a rownumber of 1
WHERE SalesRankingInDepartment = 1
-- Or due to the left outer join, where the department 
-- has no top seller. i.e, they have made no sales!
    OR SalesRankingInDepartment IS NULL
ORDER BY TotalSales DESC

Now imagine you are wanting to get the number of sales each employee made, but for those times when two employees in a department have the same number of sales you want to control who gets the credit as the department’s top seller.. By default ROW_NUMBER will just number them based on the first record it comes to, this usually isn’t good enough so to look at making it more predictable we might want to prefer certain employee types, let’s say Managers, and let’s assume the database has no ordinal anywhere associated with each employee type so we have to do it ad-hoc in the query:

SQL
-- Declare temporary tables that contain the ordinals for each
-- different type of employee
DECLARE @employeePositions TABLE (EmployeePosition varchar(15), Ordinal int)
INSERT INTO @employeePositions VALUES ('Manager', 1)
INSERT INTO @employeePositions VALUES ('Associate', 2);
 
with salesCte AS
(
    SELECT emp.EmployeeName
        , emp.DepartmentId
        , COUNT(s.SaleTotal) AS SalesCount
        , ROW_NUMBER() OVER (PARTITION BY emp.DepartmentId 
        -- The second order by column here will specify that we want to prefer
        -- managers over associates when they have equal sales counts
                ORDER BY COUNT(s.SaleTotal) DESC
                    , pos.Ordinal) AS SalesCountRankingInDepartment
 
    FROM Employee emp
    LEFT OUTER JOIN Sales s
        ON emp.EmployeeId = s.EmployeeId
    LEFT OUTER JOIN @employeePositions pos
        ON pos.EmployeePosition = emp.EmployeePosition
 
    GROUP BY emp.EmployeeName, emp.DepartmentId
)
 
SELECT dep.DepartmentName
    , sales.EmployeeName AS 'Department Best Seller'
    , sales.SalesCount AS EmployeeSalesCount
 
FROM Department dep
LEFT OUTER JOIN salesCte sales
    ON sales.DepartmentId = dep.DepartmentId
WHERE SalesCountRankingInDepartment = 1
    OR SalesCountRankingInDepartment IS NULL
ORDER BY EmployeeSalesCount DESC

The use of the temporary table at the beginning of the script is to allow an ordinal/preference to be applied to each employee type.

2. Identifying and Removing Duplicate Rows

When duplicates creep in to a table, for whatever reason, while they are very easy to identify, they can sometimes be very time consuming to remove, for example let’s consider a table to hold addresses. You might identify duplicates using the COUNT function:

SQL
SELECT COUNT(*) AS [AddressCount]
        ,[Name]
        ,[HouseNumber]
        ,[HouseName]
        ,[Line1]
        ,[Line2]
        ,[City]
        ,[County]
        ,[PostCode]
FROM Addresses
GROUP BY [Name]
    ,[HouseNumber]
    ,[HouseName]
    ,[Line1]
    ,[Line2]
    ,[City]
    ,[County]
    ,[PostCode]
HAVING COUNT(*) > 1

But knowing that there are duplicates is only half the problem, they still need to be fixed.

Before we can just delete the duplicates however, it’s probably best to try and fix any tables referencing the rows using foreign keys! This script does just that.

SQL
-- If the temp table exists drop it before continuing with the script
IF OBJECT_ID('tempdb..#addressIdChanges') IS NOT NULL
    DROP TABLE #addressIdChanges;
 
-- Get the duplicate addresses, ranked arbitrarily using ROW_NUMBER()
WITH addressDuplicatesCte AS 
    (
    SELECT [ID]
        ,[Name]
        ,[HouseNumber]
        ,[HouseName]
        ,[Line1]
        ,[Line2]
        ,[City]
        ,[County]
        ,[PostCode]
        , ROW_NUMBER() OVER (PARTITION BY [Name]
                        ,[HouseNumber]
                        ,[HouseName]
                        ,[Line1]
                        ,[Line2]
                        ,[City]
                        ,[County]
                        ,[PostCode]
                ORDER BY [HouseNumber]) AS [RowNum]
    FROM Addresses
    )
-- Add all duplicates (ignoring parent-less records) to a temp table
-- with the duplicate IDs in one column and an arbitrary ID being elected as the new
-- 'none-duplicate' key
SELECT ID AS OldID
    , (
        SELECT ID 
        FROM addressDuplicatesCte d 
        WHERE d.HouseNumber = dupes.HouseNumber 
            AND d.PostCode = dupes.PostCode 
            AND d.RowNum = 1
        ) AS NonDupeID
INTO #addressIdChanges
FROM addressDuplicatesCte dupes
WHERE RowNum > 1
 
-- All foreign keys need to be fixed before we delete the duplicate records
UPDATE Orders
SET AddressID = #addressIdChanges.NonDupeID
FROM Orders
INNER JOIN #addressIdChanges
    ON Orders.AddressID = #addressIdChanges.OldID
 
-- And finally, the duplicate records can be deleted
DELETE
FROM Addresses
WHERE ID IN (SELECT OldID FROM #addressIdChanges)
 
DROP TABLE #addressIdChanges

A few notes on this script:

  • Duplicate rows won’t always be exact copies, it depends on the application domain. In this instance UK addresses can be uniquely identified from their HouseNumber, AddressLine1 (only useful for flats etc) and PostCode since these will correspond to one property only.
  • You can’t use the result of ROW_NUMBER() in either a WHERE or HAVING clause, which is why I’ve wrapped it up in a CTE.
  • I opted to use a CTE purely for readability, it’s not really needed but it reduces the annoyance of reading large nested sub-queries!
  • The reason for dropping the temp table at the start is for those times when the script has been edited and fails to run.
    • If the temp table hasn’t been dropped, the query will error complaining the temp table already exists!
  • This script doesn’t remove parent-less records, this is deliberate (The reason of this script is purely to remove duplicates!)

3. Paginating Results

This one would probably be the most common use case for ROW_NUMBER. Sometimes you might have hundreds of thousands of rows, and almost never will you want to show all of these at one point and so there’s little to no point in returning them all at once, this is where pagination comes in.

Pagination is basically splitting datasets into pages in a predictable way in which row numbers can be applied.

SQL
CREATE PROCEDURE [Orders_Fetch]
(
    @customerId UNIQUEIDENTIFIER
    , @pageNumber INT
    , @pageSize INT = 10
)
AS
-- we need to work out the first and last record index
-- Assuming the page numbers will be passed in starting from 1
-- the pages need to be 0 indexed
DECLARE @firstRecord INT = @pageSize * (@pageNumber - 1)
    -- The BETWEEN operator is an inclusive operator, which means
    -- the last record needs to be reduced by 1
    , @lastRecord = (@pageSize * @pageNumber) - 1;

    WITH ordersCTE AS
    (
        SELECT OrderDate
            , OrderId
            -- There's no nee to partition here, since we are only interested
            -- in applying the row numbers accross the entire data set as a whole
            , ROW_NUMBER() OVER (ORDER BY OrderDate DESC) AS OrderNumber
        FROM Orders
        WHERE CustomerId = @customerId
    )
    SELECT *
    FROM ordersCTE
    WHERE OrderNumber BETWEEN @firstRecord AND @lastRecord

Links

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



Comments and Discussions

 
-- There are no messages in this forum --