The Product Aggregate in T-SQL Versus the CLR





5.00/5 (13 votes)
An exercise in algorithm analysis and design.
Contents
- Introduction
- The Basic Aggregate in the CLR
- The Aggregate in T-SQL
- Choosing Invariance To Nulls
- Invariance: CLR
- Invariance: CLR SIMULATION
- Invariance: RECURSION
- Invariance: EXP LOG
- Option: Keyword DISTINCT
- Okay (oh no) Looping
- A Performance Comparison--and Caveats
- Execution Plans: CLR, CLR SIMULATION, and EXP LOG
- Execution Plans: RECURSION
- RECURSION, Looping, and Safety
- Grouping
- Accuracy
- T-SQL Solutions and Aggregate Functions
- The Report Card
- Finally
- Afterword
Introduction
Noticeably absent from the SQL Server built-in aggregate
functions is product
—the
multiplication of numbers in a set of values. Perhaps one reason is because it can so often fail; I demonstrate this
with a simple loop, meant to simulate the product over 309 values in a column:
DECLARE @product float = 1.79, @loopRowPosition int = 1;
while( @loopRowPosition <= 308 )
begin
select @product *= 10, @loopRowPosition += 1;
end
The result is 1.79E+308, the maximum positive number for the float data type. But change the start value to 1.798 and you get an arithmetic overflow. Put another way: multiply small numbers in a small column expression and meltdown.
Still the aggregate is useful in limited situations, and so we’ll develop it the expected way—in the CLR—but then also several ways in T-SQL. The ground rule for T-SQL versions is simple: no loops.
To make the solutions in both environments more flexible, we’ll extend them to support a choice over invariance to nulls.
In our analysis, topics we’ll cover include the following:
- How they work with grouping
- How they rate as to accuracy, safety, and performance
- What happens when we try to make generic aggregate-simulation functions from ad hoc code
This article is as much about good algorithm development as it is about the product aggregate. As our tactics evolve, at any point a wrong turn can harm performance or introduce subtle errors.
The Basic Aggregate in the SQL CLR
Below are the four methods of the product solution in C#.
private SqlDouble product;
private SqlInt16 countOfMultipliers;
public void Init()
{
product = 1;
countOfMultipliers = 0;
}
public void Accumulate(SqlDouble multiplier)
{
product *= (multiplier.IsNull) ? 1 : multiplier;
countOfMultipliers += (SqlInt16)((multiplier.IsNull) ? 0 : 1);
}
public void Merge(ProductAggregate mergeProductAggregate)
{
Accumulate(mergeProductAggregate.product);
}
public SqlDouble Terminate()
{
return (0 == countOfMultipliers) ? SqlDouble.Null : product;
}
The Accumulate
method is invoked once for each number in
the input column and computes the product. The ternary operator, which tests for null values, gives us our
invariance to nulls property—i.e. the
ability to ignore nulls and return a result over non-null numbers, as do the
built-in aggregates.
The Merge
method is called when the number set is
partitioned and the product computed over multiple threads.
The result is returned in the Terminate
method.
The data member
countOfMultipliers
is incremented during accumulation using the same
test for nulls done for the product
member, allowing it to return null when all input values are null or the input
is empty.
C’est simple, ce n’est pas?
The Aggregate in T-SQL
Remember the rule for our T-SQL versions: no loops. We can devise quicker, more elegant solutions.
Here is the second and final rule: the code must always return a value, possibly null, just as the system aggregates do.
This is the sample table for all code examples. The first five columns are populated from system view sys.messages. Table population is limited to 2000 rows—much more and arithmetic overflow occurs for my sample values. I added four columns, whose meanings are as follows:
- multiplier. Our multiplicands of random values in the range 0.214-2.382, scale 15.
- groupcol. A computed, persisted column of values either ‘a’, ‘b’ or ‘c’ for group testing.
- yearcol. A computed, persisted column of values ‘2001’ – ‘2004’ also for group testing.
- ID. An IDENTITY surrogate key that aids the performance of one of our query forms.
Three T-SQL Solutions
The three basic T-SQL solutions follow. I’ll refer to them throughout by the labels given.
CLR Simulation
The first solution is a kind of mirror of the CLR code: where data member product accumulates the result, we’ll employ a scalar variable, also initialized to 1.
DECLARE @product AS float= 1;
SELECT @product *= multiplier FROM aggr.T_ProductTest;
Hardly a hat trick—until we consider that the product will
be one when aggr.T_ProductTest
has no rows (we want a null result). Where the CLR solution solves the problem by setting and later testing a
second data member, countOfMultipliers
, we’ll substitute an outer join on a derived
table, which I prefer over introducing another variable outside the central
query:
select
@product *= ( prod.multiplier * onz.one )
from
aggr.T_ProductTest prod
RIGHT OUTER JOIN
(
select 1 as one
) onz
on
1 = 1; -- any tautology will do
When the table has rows, each number is multiplied by
one. When it doesn’t, prod.multiplier
isn’t even a null
value for the multiplication, so the action in the SELECT
statement cannot be
applied. In this case the right
outer join forces the SELECT
clause to be evaluated over one row and @product becomes null because multiplicand
prod.multiplier
is now null.
Starting with our second technique we make a clean break from the CLR approach. The float variable is not needed, although we still need to accommodate an empty result set:
WITH cteRecursiveProduct( level, product ) as
(
select level = ID, product = multiplier
from aggr.T_ProductTest
where ID = ( select max( ID ) from aggr.T_ProductTest )
UNION ALL
select level = prodCTE.level - 1, product = ( prodCTE.product * prod.multiplier )
from cteRecursiveProduct prodCTE
inner join
aggr.T_ProductTest prod
on
prodCTE.level - 1 = prod.ID
)
select product = product * onz.one
from cteRecursiveProduct
RIGHT OUTER JOIN
(
select 1 as one
) onz
on
isnull(level, 1) = onz.one
OPTION ( MAXRECURSION 0 );
What is required in many cases is that we put
more than 100 frames on the call stack—the maximum allowed by default—and so we
allow unlimited frames with the query hint MAXRECURSION 0
.
In recursion, there is one row returned for each intermediate product for
the multiplicands seen so far, so we get the final product at level 1 in the reverse ID order
strategy. The join condition on
onz
is modified to return the
level 1 value regardless of
whether it exists.
Our third solution involves an arithmetic trick, but shares with the recursive technique the advantage of being able to be placed wholly within a larger query.
Let f(x) be a function that transforms each multiplier in a column into a common base number and sums their logs:
f(x) = sum( log( multiplier ) )
In this case, SQL Server system function log() uses the number e as the base by default. If the multipliers are 8 and 10, for example, they would be represented as e2.0794 and e2.3026 to four decimal places, and f(x) would return 4.382.
Let's extend the composition:
g(f(x)) = ef(x)
The number e raised to the f(x) can of course be represented as a decimal number, and in our example, e4.382 = 80 (adjusted for rounding error); system function exp(), which is the inverse of log() (exp(log(x)) = log(exp(x)) = x), does this:
product = exp( sum( log( multiplier ) ) )
This all works because of the laws of exponents (multiplication case):
xa * xb * … * xn = xa+b+…+n
If the explanation is a little dense, don’t worry. What we do need to worry about is finding a negative number or zero in the input column, because the log for these is undefined. Here is the error you get when you try a log(0) or log(-5) operation:
Msg 3623, Level 16, State 1, Line 2
An invalid floating point operation occurred.
We circumvent the problem by adding the nullif()
function to substitute nulls for zeros and the abs()
function to ensure that all numbers are positive, but we’ll need additional code to get the correct
answer—zero whenever zeros occur in the column expression and a negative value when the count of negative multipliers is odd:
exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) )
*
iif( sum( iif( multiplier = 0, 1, null ) ) > 0, 0, 1 )
*
iif( sum( iif( multiplier < 0, 1, 0 ) ) % 2 = 1, -1, 1 )
Note that unlike the other solutions, EXP LOG doesn’t require the outer join or other strategy to return a null on null input.
Now we’ll explore an option not available to the built-in aggregates.Choosing Invariance to Nulls
The popular built-in aggregates count, sum, min, max, avg, and the others are invariant to nulls, meaning that having nulls in the column expressions over which they operate does not affect the outcome. The user cannot change this property. We, however, can, and so let’s see how it would be done, starting with the CLR.
CLR
Here is the line in the Accumulate
method from the sample
code above that does the computation:
product *= (multiplier.IsNull) ? 1 : multiplier;
The ternary operator throws out nulls from the input, making the implementation invariant to nulls. Remove the operator and the product is null whenever the input has a null value, making it variant to nulls. Because aggregates—system or CLR—don’t expose a parameterized constructor, a single struct cannot give the user the option; simply introduce a second struct.
Set the property IsInvariantToNulls
on required attribute
class SqlUserDefinedAttribute
to true
on one and false on the other, keeping in mind that enforcement is up to you.
By contrast, were we to place the T-SQL implementations into aggregate-simulating functions—more on this later—a bit parameter specifying invariance would obviate the need for duplication of code/effort.
You may have realized that the code snippets from the previous
section differ in variance, the first and second being variant, and the last
invariant (because the log()
function itself is invariant to nulls). Let’s start with the code that flips the property for the CLR simulation:
CLR Simulation
DECLARE @product float = 1;
select @product *= ( isnull( prod.multiplier, 1 ) * niladj.adjustor )
from aggr.T_ProductTest prod
RIGHT OUTER JOIN
(
select iif( count( * ) > 0, 1, null )
from aggr.T_ProductTest
where multiplier is not null
) niladj( adjustor )
on
1 = 1 -- any tautology will do
select @product;
This rewrite of the derived table is correct over two boundary cases: when
aggr.T_ProductTest
is empty—as before; and when the
multiplier column in all rows has null
values. But the code is not optimal because the count()
aggregate requires a
full (table or index) scan.
DECLARE @product float = 1, @countOfMultipliers smallint = 0;
select @product *= isnull( multiplier, 1 ),
@countOfMultipliers += iif( multiplier is not null, 1, 0 )
from aggr.T_ProductTest;
select product = @product * iif( 0 = @countOfMultipliers, null, 1 );
The solution now more closely simulates the CLR implementation. Remove the
isnull()
function and it becomes an alternate solution for the variant case.
Recursion
Because the recursion strategy doesn't depend upon outside variables, it must use the derived table or equivalent CTE to achieve invariance, and pay the performance penalty.
select level = ID, product = isnull( multiplier, 1 )... -- anchor
select ..., product = ( prodCTE.product * isnull( prod.multiplier, 1 ) ) -- recursive
...
select product = product * niladj.adjustor
from cteRecursiveProduct
RIGHT OUTER JOIN
(
select iif( count( * ) > 0, 1, null )
from aggr.T_ProductTest
where multiplier is not null
)
niladj( adjustor )
on
level = 1
EXP LOG
Flipping the property in the opposite direction for EXP LOG means one more multiplier:
exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) ) -- no log( <= 0 )
*
iif( sum( iif( multiplier = 0, 1, null ) ) > 0, 0, 1 ) -- 0-adjust
*
iif( sum( iif( multiplier < 0, 1, 0 ) ) % 2 = 1, -1, 1 ) -- neg no adjust
*
iif( sum( iif( multiplier is null, 1, 0 ) ) > 0, null, 1 ) -- null-variant
DISTINCT Keyword
Finally, another important option we
could implement is the
DISTINCT
keyword. I won’t expand on it, but
suffice it to say that the CLR would need more effort: e.g., a data member vector
could buffer all numbers in Accumulate
and the vector could be sorted to bypass
duplicates during multiplication in Terminate
.
Extensions to the T-SQL strategies vary in complexity and performance. Do you see the error in this code? (Hint: let the column contain values 5 and -5.)
exp( sum( DISTINCT log( nullif( abs( multiplier ), 0 ) ) ) )
Okay (oh no) Looping
While the looping tactic is verboten, I reference its metrics as the baseline in the section on performance next, so here is one optimal form.
DECLARE @product float = 1.0, @countOfMultipliers smallint = 0,
@next_row smallint = 1, @max_row smallint;
select @max_row = max( ID ) from aggr.T_ProductTest;
while( @next_row <= @max_row )
begin
DECLARE @next_multiplier float;
select @next_multiplier = multiplier
from aggr.T_ProductTest
where ID = @next_row;
select @product *= isnull( @next_multiplier, 1 ),
@countOfMultipliers += iif( @next_multiplier is null, 0, 1 );
SET @next_row += 1;
end
select product = @product * iif( @countOfMultipliers = 0, null, 1 );
For your eyes only. Destroy the code after reading.
A Performance Comparison—and Caveats
The test I performed for all solutions in their
invariant to null forms was a product
over the multiplier column for all
2,000 rows in the aggr.T_ProductTest
table. I ran each solution code 100
times at an interval 1/20 second apart to get the logical disk reads and rough
averages over CPU and duration from the Profiler.
The average times clearly show that the recursive and looping techniques are not viable. The first pair of numbers for their measurements are the values for the code as was displayed, and the second, for their safer versions, to be explained.
The first cost for recursion is very good, and for looping, exceptional. So why the disconnect between the optimizer's estimates and actual performance?
Execution Plans: CLR, CLR SIMULATION, and EXP LOG
Above is the execution plan for the SQL CLR aggregate. If you add one Compute Scalar operator on each side of the Stream Aggregate, you essentially have the plan for the EXP LOG code; subtract the Stream Aggregate, the CLR SIMULATION. In all cases, the index scan on the clustered primary key is known to the optimizer to return a fixed number of rows--the Estimated Number of Rows = the Actual Number of Rows = 2000--and so it can make accurate estimated costs.
The graphic above depicts the operators that start one branch of the recursive part in the estimated execution plan followed by those that start in the actual execution plan. Where the other strategies employ a one-pass index scan, recursion and also looping must seek on the same index to get multipliers from successive IDs, once for each recursion/iteration. This accounts for more page touches in looping, and for recursion, seek must re-fetch all the multipliers seen so far plus the current one for each stack frame, so the reads skyrocket.
By visually inspecting the code, we can see that the anchor gets the multiplier at ID 2000 (the last row in the sample table), and each recursion operates at the next lower contiguous ID stopping at ID one for a total of 2000. In fact, in the actual plan, the Actual Number of Rows returned by the seek operator is 1999, as indicated by the much thicker outbound arrow.
But the optimizer can't deduce the row count from the recursive definition, and so it puts in a placeholder value of one for Estimated Number of Rows, as indicated by the thin arrow. It is for this reason that the optimizer cannot give a reasonable estimated cost for recursion or for looping as well. (The estimated branch cost may accurately reflect the effort to get the multipliers at IDs 2000 and 1999 or just one multiplier.)
Recursion, Looping, and Safety
When I introduced the T-SQL solutions, I noted that I added
column ID as an int IDENTITY
clustered primary key to aid the performance of one
of the solutions (the compact natural key is message_id, which otherwise would
be clustered). That solution of
course is
RECURSION (add looping). But for this to
work, we must guarantee the following: 1) that the minimum ID is one; and 2)
that there are no gaps in the ID sequence, such as those resulting from row
deletions and rolled-back transactions. And often this is not the case.
Let’s add a CTE that gives us our contiguous IDs starting at one:
WITH cteMultiplierRank( rankNo, multiplier ) as
(
select CAST( ROW_NUMBER( ) OVER( ORDER BY ID ) as int ), multiplier
from aggr.T_ProductTest
),
The recursive CTE is rewritten to reference this CTE instead of the sample table, and the execution plan shows that this CTE as well as the anchor and recursive parts of the recursive CTE all share the starting operators below. The graphic is the start of the recursive branch of the actual execution plan:
In the estimated plan, each operator for all branches outputs 2,000 rows, but run-time information shows that the recursive branch operators each produced 4,000,000 actual rows—2,000 sample table rows times 2,000. The operators essentially set up 2,000 groups of all (rankNo, multiplier) pairings, and a Filter operator to come applies the recursive condition to determine the set of multipliers to use for each group. The logical disk reads go from 22,010 to 156,147, the estimated cost balloons to an unacceptable (untrustworthy! but still...) 2.3517, and the user experience degrades proportionately. This next attempt fares better.
Performance Tactic: Table Variable
DECLARE @tblMultiplier TABLE( rowno int IDENTITY PRIMARY KEY, multiplier float NULL );
After we rewrite the recursive CTE to reference the table variable, the execution plan is identical to the original and the cost comes back to a healthy 0.016537. But in practice 85% of the total cost comes from populating the table variable from the sample table, bringing the cost to 0.1077. The second sets of numbers in RECURSION and Looping show the additional cost of this safety.
Safer certainly--but is it safe now? Our revised solutions share with the CLR SIMULATION a potential problem inherent in its not being expressible in one atomic statement—a problem to be addressed in the section T-SQL Solutions and Aggregate Functions.
Grouping
Success with EXP LOG
Built-in and user-defined CLR aggregates can be used in
SELECT
, HAVING
, and ORDER BY
clauses. Of our T-SQL solutions, only EXP LOG is a wholly self-contained
expression, and so it too can be used in these clauses:
select groupcol, yearcol,
product = exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) )
from aggr.T_ProductTest
group by groupcol, yearcol
having exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) ) > 0
order by exp( sum( log( nullif( abs( multiplier ), 0 ) ) ) );
I’ve left off the part of the calculation that adjusts for zeros and negative numbers for brevity.
Aggregate Window Functions: A Simple Fix
product_by_year_exp = sum( exp( sum( log( multiplier ) ) ) )
OVER( PARTITION BY yearcol ),
product_by_year_clr = sum( aggr.PRODUCT( multiplier ) )
OVER( PARTITION BY yearcol )
if we add these column expressions to the select list in
the query, exp is rejected by the
compiler as a window function because it is not an aggregate or other acceptable
function type. But
sum()
is, and we can use it as the outer
function to get the intended result.
Function aggr.PRODUCT
is also an
aggregate—it is the local name for the CLR aggregate—but this too is rejected
(for an unknown reason), and so we reuse the trick.
Poor Grouping Choices
Neither of our remaining strategies, RECURSION or the CLR SIMULATION, is suitable for grouping. For our sample query, either we would need to know in advance the (groupcol, yearcol) paired values of interest—or employ more code to get the pairings—and windowing makes no sense. In particular, recursion is not a solution for grouping.
The CLR SIMULATION, with its individual variable technique,
is marginally better but not necessarily safe. Here
products for several years are set in one SELECT clause, invariant to
nulls form; grouping is implied in the SELECT
clause:
select
@product2001 *= iif( '2001' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2001' = niladj.yearcol, niladj.adjustor, 1 ),
@product2002 *= iif( '2002' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2002' = niladj.yearcol, niladj.adjustor, 1 ),
@product2003 *= iif( '2003' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2003' = niladj.yearcol, niladj.adjustor, 1 ),
@product2004 *= iif( '2004' = niladj.yearcol, isnull( multiplier, 1 ), 1 ) * iif( '2004' = niladj.yearcol, niladj.adjustor, 1 )
from
aggr.T_ProductTest proTest
RIGHT OUTER JOIN
(
select niladj.yearcol, cnt_year_non_null.cnt
from
(
select yearcol, iif( count( * ) > 0, 1, null )
from aggr.T_ProductTest
where multiplier is not null
group by yearcol
) cnt_year_non_null( yearcol, cnt )
RIGHT OUTER JOIN
(
select yearcol, nilAdj = null
from
(
select [2001] = 1, [2002] = 1, [2003] = 1, [2004] = 1
) p
UNPIVOT
(
nilAdj FOR yearcol in( [2001], [2002], [2003], [2004] )
) as unpvt
) niladj( yearcol, adjustor )
on
cnt_year_non_null.yearcol = niladj.yearcol
)
niladj( yearcol, adjustor )
on
proTest.yearcol = niladj.yearcol;
Oh myyyyyyyyyy! The derived table must now do its own
outer join on an unpivot relational operator or employ a similar strategy (think
UNION ALL in the second derived table) to ensure that each year has its
own adjustor row (with possibly null adjustor), not just those years having rows
in the sample table. Of course, the more optimal second form for
invariance should have been used, but any developer may decide against using one
@countOfMultiplier
variable per year while not thinking of the derived table
problem. Increment the failure point column.
Accuracy
All product strategies except one agree that the total non-grouped product for the sample data is 6.04851066640848E-310. The exception is LOG EXP, which evaluates to 6.04851066640616E-310. This is a small difference over a tiny number—clearly a rounding error in exp or log or both. In other testing with very small numbers, including grouping, sometimes it matched exactly with the others and sometimes not. With limited testing over small samples and larger numbers, it always agreed. You make the call.
T-SQL Solutions and Aggregate Functions
An aggregate is a scalar function whose input is a column expression of suitable data type. CLR solutions are certainly that, and though we can put any of our T-SQL solutions into scalar functions, they are certainly not.
CREATE TYPE aggr.tblMultiplier AS TABLE( multiplier float NULL );
select
product_clr = aggr.PRODUCT( multiplier ),
product_tsql = aggr.sf_PRODUCT( cast( multiplier as aggr.tblMultiplier ) )…
In the above, the compiler recognizes the first
function, aggr.PRODUCT
, as a
user-defined aggregate built from a .NET object, but no sleight of hand can make
the compiler accept the second function, aggr.sf_PRODUCT
, written in T-SQL, as an aggregate, or allow a
column to be cast as a user-defined table type. If we want to use the T-SQL function as a generic aggregate, we must, for
each product desired, fill a table variable (of type
aggr.tblMultiplier
) and set a variable to its return in a separate
statement. Aside from being
inefficient and inelegant, this opens up the door to problems arising from
unrepeatable reads.
A Read-Write Conflict Example
Let’s look at a basic problematic scenario, keeping in mind that the error is less likely to happen with aggregates, system or user-defined in the SQL CLR, because they can always be placed into larger (atomic) statements (and locks are held for the duration of the statement).
-- T1
begin tran;
DECLARE @tblMultiplier aggr.tblMultiplier, @product float, @count int;
INSERT INTO @tblMultiplier
select multiplier from aggr.T_ProductTest where groupcol = 'a';
select @product = aggr.sf_PRODUCT( @tblMultiplier, 1 ); -- 1 is 'invariant to nulls'
<context switch to T2: INSERT a row having groupcol value = ‘a’>
select @count = count( * ) from aggr.T_ProductTest where groupcol = 'a';
...
The non-serializable schedule represents a READ-WRITE
transaction conflict, and is demonstrated by the sample code. After T1 reads the rows in aggr.T_ProductTest
falling under
groupcol ‘a,’ T2 commits a row to the group, making T1’s second read a
phantom read (a type of unrepeatable read). Without the context switch, (@product, @count) is <1.3322913590615E-104,
675>, but with it, the values are <3.86364494127789E-104,
676>, making the scalar variable values <1.3322913590615E-104,
676>, out of sync with each other.
To prevent the phantom read, we could up the transaction isolation level to
SERIALIZABLE
or force
serializable access to aggr.T_ProductTest
only with an appropriate table hint (TABLOCKX e.g.). But that tactic—pessimistic
locking—potentially decreases concurrency and increases the likelihood of
deadlocks. For this particular
example, it would be better to get the count from the table variable; in
practice, subtle errors are made.
Extending the Module
If we persist in the code module strategy, we should optionally make it support grouping as well as the HAVING and ORDER BY clauses. For example, if we group by groupcol and yearcol, we would want the result set in one invocation rather than one for each (groupcol, yearcol) pairing, with possible constraints on the groupcol/yearcol groups. Reducing the number of calls lessens the risk of unrepeatable reads but doesn't eliminate it.
For it to be generic, it must work for grouping over any column list from any table with a numeric column. Another desideratum is that it determine the grouping columns without needing a parameter. You may have noticed: the function has morphed into a stored procedure using dynamic SQL.
Finally, as per the code sample above, it should have a parameter for specifying invariance to null behavior. My solution is in the download in the Product Aggregate Generic Procedure folder.
Without a mechanism to pass columns as parameters, the problems—all puns intended—multiply, and our attempts to make aggregate-simulating functions from some good ad-hoc T-SQL code are for naught.
The Report Card
If the grading seems arbitrary, think of me as being some of the teachers you had when you were in school.
Finally
The MSDN library demonstrates the CLR SQL Server user-defined aggregates with an example that counts the number of vowels in a column of strings (http://msdn.microsoft.com/en-us/library/91e6taax(v=vs.90).aspx). I've written a T-SQL scalar function that counts the values for one input string; the code is in the download. Following is its invocation that matches the CLR functionality along with performance metrics when run over the text column from our sample table:
select cntVowels = sum( aggr.sf_CountTheVowels( <some_string_column> ) )...
Function aggr.sf_CountTheVowels
approximates the Accumulate
method, is easy to write, and doesn’t try to be an aggregate—the sum()
is the aggregate--making it safe
by our standard. This time the optimizer knows up front to expect 2000
rows from the index scan, but probably because of the logical disk reads
involved, gives a better cost to the T-SQL solution even though it runs 10 times
slower. As the rule, CLR code gives better performance; in the product
aggregate example we may have hit a rare exception.
Afterword
After the article appeared, I verified that the search engines would find it. They do—and they also list another Code Project article (Tip/Trick) that discusses computing the product with logarithms (click here[^] ). The tip, by Dr. Alexander Bell, references an earlier work of his that details his research (ours are independent); read the tip’s referenced article for a second view of the problem.
In that latter article, he discusses the performance-universality dilemma, which means
that the more cases handled by a solution, the costlier it is. For EXP LOG that implies addressing nulls,
zeros, and negative numbers that may occur in the column expression, which we
did. I’ll call the code that doesn’t bare bones.