Without a doubt one of the most used SQL functions lately for me has been ROW_NUMBER
. My two main uses for it?
- Providing a Ranking or an Order of Preference to rows.
- Identifying and Removing Duplicate Rows.
- Paginating results.
1. Providing a Ranking or an Order of Preference to Rows
The first thing to say about ROW_NUMBER
is its basic syntax:
ROW_NUMBER() OVER (PARTITION BY Department
ORDER BY SUM(SaleTotal) DESC) AS SalesRankingInDepartment
The PARTITION BY
clause is entirely optional and can be used just to create sub-sets within the dataset based on the values in the column, whereas the ORDER BY
defines the ordering we apply in order to generate the row numbers.
with salesCte AS
(
SELECT emp.EmployeeName
, emp.DepartmentId
, SUM(s.SaleTotal) AS TotalSales
, ROW_NUMBER() OVER (PARTITION BY emp.DepartmentId
ORDER BY SUM(s.SaleTotal) DESC) AS SalesRankingInDepartment
FROM Employee emp
LEFT OUTER JOIN Sales s
ON emp.EmployeeId = s.EmployeeId
GROUP BY emp.EmployeeName, emp.DepartmentId
)
SELECT dep.DepartmentName
, sales.EmployeeName AS 'Department Best Seller'
, sales.TotalSales AS EmployeeTotalSales
FROM Department dep
LEFT OUTER JOIN salesCte sales
ON sales.DepartmentId = dep.DepartmentId
WHERE SalesRankingInDepartment = 1
OR SalesRankingInDepartment IS NULL
ORDER BY TotalSales DESC
Now imagine you are wanting to get the number of sales each employee made, but for those times when two employees in a department have the same number of sales you want to control who gets the credit as the department’s top seller.. By default ROW_NUMBER
will just number them based on the first record it comes to, this usually isn’t good enough so to look at making it more predictable we might want to prefer certain employee types, let’s say Managers, and let’s assume the database has no ordinal anywhere associated with each employee type so we have to do it ad-hoc in the query:
DECLARE @employeePositions TABLE (EmployeePosition varchar(15), Ordinal int)
INSERT INTO @employeePositions VALUES ('Manager', 1)
INSERT INTO @employeePositions VALUES ('Associate', 2);
with salesCte AS
(
SELECT emp.EmployeeName
, emp.DepartmentId
, COUNT(s.SaleTotal) AS SalesCount
, ROW_NUMBER() OVER (PARTITION BY emp.DepartmentId
ORDER BY COUNT(s.SaleTotal) DESC
, pos.Ordinal) AS SalesCountRankingInDepartment
FROM Employee emp
LEFT OUTER JOIN Sales s
ON emp.EmployeeId = s.EmployeeId
LEFT OUTER JOIN @employeePositions pos
ON pos.EmployeePosition = emp.EmployeePosition
GROUP BY emp.EmployeeName, emp.DepartmentId
)
SELECT dep.DepartmentName
, sales.EmployeeName AS 'Department Best Seller'
, sales.SalesCount AS EmployeeSalesCount
FROM Department dep
LEFT OUTER JOIN salesCte sales
ON sales.DepartmentId = dep.DepartmentId
WHERE SalesCountRankingInDepartment = 1
OR SalesCountRankingInDepartment IS NULL
ORDER BY EmployeeSalesCount DESC
The use of the temporary table at the beginning of the script is to allow an ordinal/preference to be applied to each employee type.
2. Identifying and Removing Duplicate Rows
When duplicates creep in to a table, for whatever reason, while they are very easy to identify, they can sometimes be very time consuming to remove, for example let’s consider a table to hold addresses. You might identify duplicates using the COUNT
function:
SELECT COUNT(*) AS [AddressCount]
,[Name]
,[HouseNumber]
,[HouseName]
,[Line1]
,[Line2]
,[City]
,[County]
,[PostCode]
FROM Addresses
GROUP BY [Name]
,[HouseNumber]
,[HouseName]
,[Line1]
,[Line2]
,[City]
,[County]
,[PostCode]
HAVING COUNT(*) > 1
But knowing that there are duplicates is only half the problem, they still need to be fixed.
Before we can just delete the duplicates however, it’s probably best to try and fix any tables referencing the rows using foreign keys! This script does just that.
IF OBJECT_ID('tempdb..#addressIdChanges') IS NOT NULL
DROP TABLE #addressIdChanges;
WITH addressDuplicatesCte AS
(
SELECT [ID]
,[Name]
,[HouseNumber]
,[HouseName]
,[Line1]
,[Line2]
,[City]
,[County]
,[PostCode]
, ROW_NUMBER() OVER (PARTITION BY [Name]
,[HouseNumber]
,[HouseName]
,[Line1]
,[Line2]
,[City]
,[County]
,[PostCode]
ORDER BY [HouseNumber]) AS [RowNum]
FROM Addresses
)
SELECT ID AS OldID
, (
SELECT ID
FROM addressDuplicatesCte d
WHERE d.HouseNumber = dupes.HouseNumber
AND d.PostCode = dupes.PostCode
AND d.RowNum = 1
) AS NonDupeID
INTO #addressIdChanges
FROM addressDuplicatesCte dupes
WHERE RowNum > 1
UPDATE Orders
SET AddressID = #addressIdChanges.NonDupeID
FROM Orders
INNER JOIN #addressIdChanges
ON Orders.AddressID = #addressIdChanges.OldID
DELETE
FROM Addresses
WHERE ID IN (SELECT OldID FROM #addressIdChanges)
DROP TABLE #addressIdChanges
A few notes on this script:
- Duplicate rows won’t always be exact copies, it depends on the application domain. In this instance UK addresses can be uniquely identified from their HouseNumber, AddressLine1 (only useful for flats etc) and PostCode since these will correspond to one property only.
- You can’t use the result of
ROW_NUMBER()
in either a WHERE
or HAVING
clause, which is why I’ve wrapped it up in a CTE. - I opted to use a CTE purely for readability, it’s not really needed but it reduces the annoyance of reading large nested sub-queries!
- The reason for dropping the temp table at the start is for those times when the script has been edited and fails to run.
- If the temp table hasn’t been dropped, the query will error complaining the temp table already exists!
- This script doesn’t remove parent-less records, this is deliberate (The reason of this script is purely to remove duplicates!)
3. Paginating Results
This one would probably be the most common use case for ROW_NUMBER
. Sometimes you might have hundreds of thousands of rows, and almost never will you want to show all of these at one point and so there’s little to no point in returning them all at once, this is where pagination comes in.
Pagination is basically splitting datasets into pages in a predictable way in which row numbers can be applied.
CREATE PROCEDURE [Orders_Fetch]
(
@customerId UNIQUEIDENTIFIER
, @pageNumber INT
, @pageSize INT = 10
)
AS
DECLARE @firstRecord INT = @pageSize * (@pageNumber - 1)
, @lastRecord = (@pageSize * @pageNumber) - 1;
WITH ordersCTE AS
(
SELECT OrderDate
, OrderId
, ROW_NUMBER() OVER (ORDER BY OrderDate DESC) AS OrderNumber
FROM Orders
WHERE CustomerId = @customerId
)
SELECT *
FROM ordersCTE
WHERE OrderNumber BETWEEN @firstRecord AND @lastRecord
Links