T-SQL: Most Practical Split Function

Brad Joss

Rate me:

5.00/5 (6 votes)

11 Oct 2013CPOL5 min read

81.5K

1.9K

A SQL String Split function for the real world

Download FN_SPLIT.zip - 5.8 KB

Introduction

This SQL split function is intended for users who want a string split to function like a human wants it to, not how a computer thinks it should. It addresses several problems with split functions that have vexed me over the years, including:

Trims the input for each returned item, so a comma separated list of ints that looks like "1, 2, 3" (note the spaces after the commas) will be properly parsed down to ints.
Returns the input as nvarchar(max) or int with low overhead so items are not bound by varchar(8000) or forced to one datatype.
Supports delimiters of a size greater than 1 character (at the cost of performance).
Removes blank entries, so "1,,,2" will just return 1 and 2.

There are moderately rigorous unit tests included in the script that highlight use cases.

Background

The code should compile and run without modification, however, review the TODO comments for additional details. Specifically:

If you are on SQL2012, there is a better int parse that can be used.
The current code is case sensitive, if you want to make it insensitive, follow the steps in the TODO comments.
The concept of using a non-persistent computed column to pre-parse the data can be extended to other types, I just wired it up for ints since that was the most common use on our applications.

Using the Code

There are two functions, [dbo].[FN_SPLIT](@TextToSplit, @Delimiter) and an overloaded version to make splitting ints easier, [dbo].[FN_SPLIT_INT] (@TextToSplit). Both return a table containing the split values.

Example 1

Note in this example that items that can be parsed as an int are returned in the computed column of data type int [item_int]. Note that the length of things like " x " are trimmed so the returned length will be 1, and that the blank entries between 4 and 5 (,,,) are removed.

SQL

SELECT *, LEN(item) as [item_len] 
FROM [dbo].[FN_SPLIT]('1,  x  ,-2, y, 3.0, 4,,,5',',');

item	item_int	item_len
1	1	1
x	NULL	1
-2	-2	2
y	NULL	1
3.0	NULL	3
4	4	1
5	5	1

Example 2

Note in this example, using the overloaded int version, we do not need to pass in the delimiter, comma is assumed, and only values that are valid integers are returned.

SQL

SELECT * 
FROM [dbo].[FN_SPLIT_INT]('1,  x  ,-2, y, 3.0, 4,,,5');

item_int
1
-2
4
5

Example 3

In this example, we are splitting based on a multi-character delimiter.

SQL

select [item] FROM [dbo].[FN_SPLIT]('a[tag]b[tag][tag]d', '[tag]');

item
a
b
d

Example 4

In this example, we are double splitting on two separate delimiters, [tag] and [/tag] (imagine a case for pseudo-XML). Note the white space between the tags is properly trimmed off. While this type of string manipulation in SQL is not ideal, if you have to do it at the database layer, as opposed to the application or some other layer, this may be a practical solution.

SQL

select f2.item , len(f2.item) as item_len
FROM [dbo].[FN_SPLIT]('  [tag]a[/tag]  [tag]b[/tag]  ', '[tag]') f1
CROSS APPLY [dbo].[FN_SPLIT](f1.item, '[/tag]') f2

item	item_len
a	1
b	1

Explanation of the Code

There is a giant "IF" block that parses using 1 of 2 methods based on delimiter size. If the delimiter is 1 character, CHARINDEX is used, more than one and the slower PATINDEX is used.

SQL

DECLARE @del_len int -- Delimiter length
SELECT @del_len = LEN(@Delimiter)
IF @del_len = 1  -- CHARINDEX MODE (Much faster than PATINDEX mode)
    SELECT @ix = CHARINDEX(@Delimiter, @Text)
ELSE -- PATINDEX Mode (SLOW!) Try to pass in text that uses single char delimiters when possible
    SELECT @ix = PATINDEX(@del_pat, @Text)

The return type is always nvarchar(max) for each item, so there is not a data type dependency for what is returned.

In the returned table, there is a non-persistent computed column to cast the returned value to an int. This is low overhead in that if you do not select this column in the results, as would be the case if you do not need it cast to an int, that code is not executed. This is how we achieve the balance between flexibility, durability and convenience.

SQL

RETURNS @retTable TABLE 
(
    -- Output table definition
    [item] nvarchar(max) COLLATE DATABASE_DEFAULT NOT NULL,

    -- Since most of the times we split, we are splitting lists of ints, 
    -- this makes that process easier.
    -- Since the column is non persistent 
    -- it should only be evaluated when requested (low overhead).
    [item_int] as (
        -- SQL 2012 version, better, use if possible
        -- TRY_CONVERT([int], NULLIF([item],'')) -- SQL 2012 Format, 
        -- faster and safer, but pre-2012 code provided as well...

        -- Pre SQL 2012 syntax.  Not as reliable, so use 2012 when possible 
        -- by commenting out this CAST and using the TRY_CONVERT above        
        CAST(
            CASE 
                WHEN LEN(item) > 11 THEN NULL  -- LEN OF (-2147483648) is 11.  
                        -- Simple out of bounds checking.
                WHEN ISNUMERIC([item]) = 1 AND [item] NOT LIKE '%.%' _
                    THEN [item] -- Ensure value conforms to int
                ELSE null
            END 
        as int)
    )
)

For each returned item, the value is trimmed, so we can eliminate null/blank/whitespace values that are usually not wanted after a split operation.

SQL

SET @item = LTRIM(RTRIM(SUBSTRING(@Text,@pix,@text_len)))
IF @item <> '' AND @item IS NOT NULL  -- Only save non empty values
    INSERT INTO @retTable(item) VALUES (@item)

When doing the PATINDEX split, we escape the delimiter so if it contains [,% or _, those values will be treated as literals, not as wildcards.

SQL

-- Escape characters that will mess up the like clause, and wrap in wild cards %
SELECT @del_pat = '%' + REPLACE(REPLACE(REPLACE(@Delimiter
    , '[','[[]')
    , '%','[%]')
    , '_', '[_]') 
    + '%'

We do a pre-check, so if the item has no delimiters, it is returned with very low overhead, or if the input parameters are invalid, we return quickly with no results.
SQL
```
IF RTRIM(ISNULL(@Text,'')) = '' OR RTRIM(ISNULL(@Delimiter,'')) = ''
    RETURN
```
It may sound silly, but a big value-add is the unit tests at the bottom of the script. In all of the other split functions I ran through those tests, they would not return the results I was desiring, and in many cases, a flat-out error. Feel free to test your own functions against the cases outlined and see if the results are what your project would expect.

Points of Interest

There are quite a few unit tests included in the script. Some of the performance can be tested via unit tests as well, adjust @BigTextItemCount to examine how the function scales on your system. As noted above, feel free to extend the return table to support other data types depending on your application needs.

History

10 October 2013: Initial public draft
10 October 2013: Updated to include download link, link was missing
11 October 2013: Added two additional examples to highlight other features of the function
11 October 2013: As requested, added more of an explanation of the code
01 February 2015: Default to SQL 2012 version of TRY_CONVERT for item_int column
25 August 2015: Convert input into TRY_CONVERT to varchar(8000) to avoid a truncation error on [item] values greater than TRY_CONVERT's input max value.
12 November 2015: Bug Fix: Space delimiter is not honored correctly. Updated code and tests to support a delimiter of space and tab. Added tests for space, tab, new line and carriage return. Thanks to Jesse who found the bug!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Brad Joss

Software Developer (Senior)

United States

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.