A followup to a post about subqueries with an estimation for how long it would take for the query to run.
In a recent post about subqueries, I compared 4 queries and how they handled 4 million rows of data. The queries involved:
1) Joins with AND conditions.
2) Subqueries with AND conditions.
3) Joins with OR conditions.
4) Subqueries with OR conditions.
I compared the speed differential between all of them [on SQL Server], but the last of the queries never finished. As I indicated I might do, I left it running overnight – it still never finished!
Well, my curiosity about how long it would take for the query to finish wasn’t diminished – it was only enhanced. How could I find out and at least be able to estimate how long it would take? I needed to have some sort of understanding of how the performance was degrading as more records were added to the table. So I did that. I ran the query using a smaller number of records in the StudentTestData table. Here’s the results:
My anticipation is SQL Server was doing some sort of cross-join type comparison in the background, and the numbers above seem to support this. The efficiency, in Big-O notation, is around O(n^2). Since I want to work with actual numbers to estimate how long the query would run with N = 4 million rows, here’s a rough function:
Number of Seconds = 6*(N/25000)^2, where N is the number of rows.
This function underestimates the time requirement as you add more records, as it is slightly too low for N=100,000 and it is about 2 minutes off for N=200,000. Despite this, it gives me something to work with.
As a spot-check, I tried N = 80,000 students. The function expectation is 61 seconds, and the actual time the query ran was 60 seconds. That’s about as close as I can ask for.
But what about if N is 4 million rows as it was in the previous post? Based on the function, the expectation is over 43 hours. Also, this is an understatement given the results above for N=200,000. It’s a good thing I didn’t wait for it to finish!