65.9K
CodeProject is changing. Read more.
Home

Case-Insensitive Sort of UTF8 Data Using System.Data.SQLite

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.92/5 (19 votes)

Nov 22, 2009

CPOL

4 min read

viewsIcon

79212

downloadIcon

826

SQLite lacks case-insensitive sort of UTF8 data. In this article, you will see how to get rid of this limitation in .NET.

Introduction

If you are using SQLite in .NET, perhaps you could have the problem of case-insensitive sorting of UTF8 strings.

Background

This code is useful if you use System.Data.SQLite, which can be downloaded here: Download System.Data.SQLite. I have been using SQLite for a long time and stored Russian strings in it. As it's stated in SQLite FAQ, case-insensitive matching of Unicode characters does not work. This problem has become very significant when the amount of stored data has grown. System.Data.SQLite provides a cool solution for that but it's unclear how to implement it.

Using the Code

Using the code is pretty simple: just add this line in the beginning of your application:

SQLiteFunction.RegisterFunction(typeof(SQLiteCaseInsensitiveCollation));

The Database

Okay, let us start. First of all, SQLite has three built-in collations (if you need more information, you can read it on SQLite's site):

  • BINARY – All the characters are treated as bytes
  • NOCASE – Compares ASCII characters in case-insensitive manner; others are treated as bytes
  • RTRIM – Just skips trailing spaces

We want to create the following table:

ID Name Val (some value stored in the table) Comment
1 Foo 123  
2 foo 234  
3 Bar 345  
4 bar 456  
5 Буква 567 A sample Russian word starting from uppercase letter
6 буква 678 The same Russian word starting from lowercase letter
7 Тест 789 Another Russian word starting from uppercase letter
8 тест 890 The same Russian word starting from lowercase letter

Here's SQL to create it:

CREATE TABLE `testtbl` (
    `ID` INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    `Name` TEXT NOT NULL COLLATE NOCASE, 
    `Val` REAL)   

No we'll run a simple SELECT:

SELECT * FROM `testtbl` ORDER BY `Name`

We'll get the result:

ID Name Val (some value stored in the table) Comment
2 Bar 345  
3 bar 456  
1 Foo 123  
2 foo 234  
5 Буква 567 Incorrect sort!
7 Тест 789
6 буква 678
8 тест 890

As you see, data in Russian is sorted incorrectly. Correct would be the following order: Буква, буква, Тест, тест.

Now we'll fix this. It would be perfect if SQLite could sort by this expression (in C#):

string.Compare(x, y, CultureInfo.CreateSpecificCulture("ru-RU"), 
	CompareOptions.IgnoreCase);

Defining Case-Insensitive Collation in C# Code

I have created a simple class SQLiteCaseInsensitiveCollation that adds collation to SQLite engine:

using System.Data.SQLite;
using System.Globalization;

namespace SQLiteUTF8CIComparison {
    /// 
    /// This function adds case-insensitive sort feature to SQLite engine 
    /// To initialize, use SQLiteFunction.RegisterFunction() 
    /// before all connections are open 
    /// 
    [SQLiteFunction(FuncType = FunctionType.Collation, Name = "UTF8CI")]
    public class SQLiteCaseInsensitiveCollation : SQLiteFunction {
        /// 
        /// CultureInfo for comparing strings in case insensitive manner 
        /// 
        private static readonly CultureInfo _cultureInfo = 
			CultureInfo.CreateSpecificCulture("ru-RU");

        /// 
        /// Does case-insensitive comparison using _cultureInfo 
        /// 
        /// Left string
        /// Right string
        /// The result of a comparison
        public override int Compare(string x, string y) {
            return string.Compare(x, y, _cultureInfo, CompareOptions.IgnoreCase);
        }
    }
}   

Please pay your attention to the attribute:

[SQLiteFunction(FuncType = FunctionType.Collation, Name = "UTF8CI")]

Here we define that:

  • This is a function that must be added to SQLite engine
  • It's a collation function
  • Its name is UTF8CI which means UTF8 Case Insensitive

To activate the collation, you need register it in SQLite engine – put this line of code before you create all SQLite connections in your application:

SQLiteFunction.RegisterFunction(typeof(SQLiteCaseInsensitiveCollation));

Please note that the function must be called only once, e.g. when your application starts.

Using UTF8CI Collation in SQLite

How to use this collation in SQLite?

CREATE TABLE `testtbl` (
    `ID` INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    `Name` TEXT NOT NULL COLLATE UTF8CI, 
    `Val` REAL)

As you see, everything we need is just replacing the collation. Now we also need create an index with this collation:

CREATE INDEX `IDX_testtbl_Name` ON `testtbl` (`Name` COLLATE UTF8CI)

Performing a Test

Now the SELECT will give us the result:

ID Name Val (some value stored in the table) Comment
2 Bar 345  
3 bar 456  
1 Foo 123  
2 foo 234  
5 Буква 567 That's it!
6 буква 678
7 Тест 789
8 тест 890

Yep! The Russian data is sorted in case-insensitive way, as we wanted. Our index is used because it's case-insensitive too.

Compatibility 

What happens if you open SQLite file with UTF8CI encoding without adding our custom encoding? Let's see the result of SELECTs:

SELECT * FROM `testtbl`

It executes normally – neither sort collation, nor index is used.

SELECT * FROM `testtbl` ORDER BY `Name`

Oops. We get an error: No such collation: UTF8CI. To avoid this, you can use the following SQL:

SELECT * FROM `testtbl` ORDER BY `Name` COLLATE BINARY

This executes normally and gives the result as usual case-sensitive (binary) sort. Our index with UTF8CI encoding cannot be used: SQLite has no information about the collation of the index.

Points of Interest

There are several things we have learned:

  • It is possible to use case-insensitive sort for UTF8 data in SQLite.
  • You can also create case-insensitive UTF8 index in SQLite.
  • You can use my function (with your locale) for creating UTF8CI collation in SQLite.
  • Please note: if you will forget to register this function, you will get an exception when reading data.
  • The database will be usable without the code of the function; but sort by field will be unavailable only with built-in encodings.
  • Any index created with UTF8CI encoding will be disabled.

Revision History

  • Nov. 2009 - Initial revision