With the advent of CLR integration into SQL Server 2005, it has become incredibly easy to extend the power of the T-SQL programming language. Two of the areas that can be improved upon by way of CLR integration are
string matching and
T-SQL has a handful of basic
string matching functions (e.g.
string matching operators (e.g.
LIKE). These are insufficient for tasks such as creating a check constraint on a column of email addresses where you want to ensure only valid email addresses exist. The most common way to validate
strings like this is by using regular expressions. This is where CLR integration comes into play.
SQL Server 2005 now allows you to create user defined functions (among other things) using your .NET language of choice. This article will demonstrate how to develop a set of general purpose, user-defined, regular expression functions for consumption in T-SQL using C#.
Using the Code
My objective here is to wrap some of the more commonly used
static methods of the
RegEx class in the .NET Framework into something useable in a T-SQL environment. I felt the best approach was to develop this wrapper as a set of user defined functions that closely mirror the inputs and outputs of these methods in the
All four of the functions listed in this article share the same first two parameters:
This is the
string to be analyzed. You may pass either a literal
string, or a column name. This is the
string on which the regular expression will be executed.
This is the regular expression which will be executed against the
In addition, all four functions share the same last parameter.
This is a boolean parameter which when set to "
1" will instruct the regular expression engine to ignore cases when executing the regular expression against the input
string. If this parameter is set to "
0", a case-sensitive analysis will be performed.
The purpose of this function is to mirror the functionality of the
Regex.IsMatch method. In short, if the pattern specified in the
@Pattern parameter is found within the
string specified by
@Input, the return value will be "
1". Otherwise, the return value will be "
ufn_RegExIsMatch will tell you if there is a match of
@Input, this function will tell you what the matches are, where they are located in the
string and how long each match is. This function wraps the functionality of the
Regex.Matches method in the .NET Framework. This function returns a table. The columns of this table are as follows:
This function mirrors the functionality of
Regex.Replace in the .NET Framework and closely resembles the functionality of the
REPLACE function in T-SQL. The primary difference between
REPLACE is that the matching in
REPLACE is done solely on literal
string match comparison while
ufn_RegExReplace matches based on the regular expression specified in the
ufn_RegExReplace also takes an additional parameter that none of the other functions in this library contain which is the
@Replacement parameter. This is another
NVARCHAR(MAX) parameter which specifies the literal
string used to replace matches identified in
@Pattern is executed.
This function returns an
NVARCHAR(MAX)which represents the input
string with the specified replacements in place.
Splitting a delimited
string into its elements is a common task among T-SQL developers. The
ufn_RegExSplit function does exactly that. This implementation uses the
Regex.Split method from the .NET Framework.
ufn_RegExSplit, much like the
ufn_RegExMatches function returns a table as its output. This table however has only one column which is of the data type
NVARCHAR(MAX). The column name is
Match and it contains the elements of the
string split out by the delimiter specified in
Note: I realize that for most implementations, simply delimiter splitting is sufficient and for such a situation, I would recommend a similar function that uses the
String.Split method rather than regular expressions. However, since the focus of this article is regular expressions in T-SQL, I decided to stick with the regular expression implementation.
SQL Server Implementation
Now that the assembly has been created, we need to implement it into SQL Server. The steps below outline the process.
Determine the Database where the Assembly will Reside
Assemblies are database level objects – not server level objects. You will need to select the database that will contain the assembly:
--Set the database to which these functions will be installed
Enable CLR Integration
CLR integration is disabled by default. To enable it, you must run
sp_configure and set the "
clr enabled" property to
1. You must then issue the "
RECONFIGURE" command for the setting to take effect. Otherwise, the setting will not take effect until SQL Server is restarted.
--Enable CLR Integration
exec sp_configure 'clr enabled', 1
Create the Assembly
Once you've enabled CLR integration, the next step is to "create the assembly" by importing it from the actual compiled DLL file. SQL Server stores the assembly as a stream of bytes inside of the database. Once the assembly has been imported using the "
CREATE ASSEMBLY" command, you no longer need the actual DLL file.
CREATE ASSEMBLY [SqlRegEx] FROM 'C:\SqlRegEx.dll' WITH PERMISSION_SET = SAFE
Create the Functions
Once the assembly has been created, you can create the actual functions. The syntax to do this is very similar to creating a T-SQL function. The primary difference is instead of writing out the function body, you simply specify "
AS EXTERNAL NAME" followed by [AssemblyName].[Namespace.Class].[Method] where [AssemblyName] is the name of the assembly in SQL Server and Namespace, Class and Method all refer to the namespace, class and method inside of the assembly.
CREATE FUNCTION [dbo].[ufn_RegExIsMatch]
(@Input NVARCHAR(MAX), @Pattern NVARCHAR(MAX), @IgnoreCase BIT)
AS EXTERNAL NAME SqlRegEx.[SqlClrTools.SqlRegEx].RegExIsMatch
Call the Functions
At this point, the functions are loaded and ready to go. To call them, you simply use the same syntax that you would to call any other T-SQL function.
SELECT dbo.ufn_RegExIsMatch('Hello World', 'w', 1) --Ignores Case
SELECT dbo.ufn_RegExIsMatch('Hello World', 'w', 0) --Case Sensitive
Points of Interest
When creating an assembly in SQL Server, you must specify the requested permission set for the assembly.
In most instances, you will select "
SAFE" as this article does. This means that you are not directly accessing system resources such as the disk, network, etc. This is the most restrictive and safest set of permissions.
If you require access to the disk or network or other resources you may need to use the "
EXTERNAL_ACCESS" permission set.
In the most rare of circumstances, you may need to use "
UNSAFE". This permission set grants everything that
EXTERNAL_ACCESS grants and also allows access to unmanaged code. Use this setting with great caution.
Unicode and NVARCHAR
strings are handled in Unicode. As such, all
string data into and out of these functions is of the
NVARCHAR data type. I chose to use the "
MAX" length as we have no idea how large the
strings are that are being passed to these functions. Of course, you may wish to limit this in your own implementation.
You may notice that each of the methods included in the
SqlRegEx class are decorated with certain attributes. The list below will outline the purpose of each.
DataAccess attribute signals to SQL Server whether this method will access any user data on the current instance of SQL Server. In the case of these functions, none of them do so directly, so the value for this attribute is
false for each.
To quote the SQL Server Books Online "Deterministic functions always return the same result any time they are called with a specific set of input values and given the same state of the database. Non-deterministic functions may return different results each time they are called with a specific set of input values even if the database state that they access remains the same." All of the functions listed in this article are deterministic and as such this property has been marked
IsPrecise attribute indicates if the result is precise or not. For example, a function calculating values using the
FLOAT data type would not be precise as the
FLOAT data type can lose information during calculations. The functions in this library are precise and as such, this attribute is marked as
true for each function.
Name attribute contains a
string indicating what this function should be called when registered in SQL Server. This field is not required and has no practical value in SQL Server, but I have set it for each of these functions for my own personal documentation.
SystemDataAccess attribute is much the same as the
DataAccess attribute with the exception that it refers to system data rather than user data.
FillRowMethodName is only set for the functions that return tables. You will notice that for each function that returns a table, the actual return type in the .NET code is
IEnumerable. That means that whatever you return from this function must implement the
IEnumerable interface. The
IEnumerable interface allows the .NET Framework to loop through your results and for each result in your result set, it will call the function listed in the
FillRowMethodName attribute. If you look at the functions that are listed in the
FillRowMethodName attribute, you will see that they each take one object input parameter which represents the current element in the above mentioned loop. The rest of the parameters are output parameters and they will represent the columns of table that is output to SQL Server.
One thing that bothers me about this particular setup is that it does not appear that you can use
System.Collections.Generic.IEnumerable<T> for a return type from your function. Instead you must use
System.Collections.IEnumerable. This means that you must accept an object for the first parameter in your fill row method instead of being able to specify the actual type. This of course means boxing and un-boxing operations each time you create a new row. Since generics are available in .NET 2.0, I think it would be appropriate to allow them here.
Deterministic vs. Non-Deterministic
It is worth noting that if you plan to use either of the scalar functions listed in this article in a computed column and you wish to persist that computed column, the function must be marked as "Deterministic". You cannot persist a computed column that contains anything that is not deterministic. If you cannot persist your column, you will not be able to apply any indexes to it and it will be re-computed every time it is called. If your function is truly deterministic, you would want to make sure that it is not recomputed every time since by definition deterministic functions always return the same information, you would waste CPU cycles.
The CLR is a powerful new tool available to SQL Server developers. It opens up nearly the entire .NET Framework to SQL Server and if used appropriately, can bring tremendous power and value to any SQL Server application.