![]() |
Database »
SQL Reporting Services »
General
Intermediate
Implementation of Unicode in SQL ServerBy josekonoorInformation about unicode implementation in SQL Server |
SQL, Windows, Visual-Studio, SQL2000, Architect, DBA, Dev, Design
|
||||||||
|
Advanced Search Add to IE Search |
|
|
|
||||||||||||||||
On both Windows/DOS and UNIX systems, the 128 most commonly-used characters are each represented by a sequence of 7 bits known as the character's ASCII code. They are traditionally stored as bytes (8 bits), i.e. the 7-bit ASCII code plus a leading zero. The characters include letters, digits, punctuation marks, and nonprintable control characters such as the backspace, tab, carriage return, etc.
Traditional character sets (like the ANSI alphabet) base on 8 bit characters called a byte. A single byte can represent up to 256 different values and, thus, characters. This is good enough to represent Western scripts such as those used in the English, French or German language. However, if it comes to more complex languages like Japanese or Korean, 256 different characters are simply insufficient.
So, users of these languages have developed so-called double byte character sets, called DBCS. In DBCS, each character is represented by one or more bytes. Character encoding specifies how to interpret the byte values and whether or not a byte is a single character or just part of a larger set of bytes representing a multi-byte character.
Unfortunately, there are many different DBCS encodings for a given language. To make matters worse, different operating systems and different programming languages tend to use different DBCS encodings. Also, programming is relatively complex because of byte parsing operations.
Unicode's goal is to solve this issue by using more than one byte for each character. So with Unicode, all different characters can be stored in one string. As all characters have a fixed width, programming complexity is greatly reduced.
Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems. Unicode consists of a repertoire of about 100,000 characters. The Unicode Consortium is based in California and the members include main computer software and hardware companies with any interest in text-processing standards, such as Adobe Systems, Apple, HP, IBM, Microsoft, Xerox and many others.
Several mechanisms have been specified for implementing Unicode. Which one implementers choose depends on available storage space, source code compatibility and interoperability with other systems.
Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Character Set (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode code points to sequences of values in some fixed-size range, termed "code values." The Universal Character Set (UCS-2) format is a fixed-length character string where each character is represented by 2 bytes.
UTF encodings include:
UTF-7 � a relatively unpopular 7-bit encoding, often considered obsolete (not part of The Unicode Standard but rather an RFC)
UTF-8 � an 8-bit, variable-width encoding, which maximizes compatibility with ASCII.
UTF-EBCDIC � an 8-bit variable-width encoding, which maximizes compatibility with EBCDIC. (not part of The Unicode Standard)
UTF-16 � a 16-bit, variable-width encoding
UTF-32 � a 32-bit, fixed-width encoding
SQL Server 7.0 and SQL Server 2000 use a different Unicode encoding (UCS-2) and do not recognize UTF-8 as valid character data.
In a computer, characters are represented by different patterns of bits being either ON or OFF. There are 8 bits in a byte, and the 8 bits can be turned ON and OFF in 256 different patterns. A program that uses 1 byte to store each character can therefore represent up to 256 different characters by assigning a character to each of the bit patterns.
The physical storage of character strings in SQL Server is controlled by collations. A collation specifies the bit patterns that represent each character and the rules by which characters are sorted and compared. SQL Server supports objects that have different collations being stored in a single database. Separate SQL Server 2000 collations can be specified down to the level of columns. Each column in a table can be assigned different collations.
Microsoft SQL Server collation settings depend on the type of installation. In general, choose a SQL Server collation that supports the Microsoft Windows system locale most commonly used at your site. To identify your Windows system locale on computers running Microsoft Windows 2000, click Regional Settings in Control Panel, and then click the General tab if necessary to display the current system locale.
In most cases, a computer runs the Windows system locale that matches the language requirements of the user, so Microsoft SQL Server Setup automatically detects the Windows system locale and chooses the appropriate collation.
One problem with data types that use 1 byte to encode each character is that the data type can only represent 256 different characters. This forces multiple encoding specifications (or code pages) for different alphabets such as European alphabets, which are relatively small. It is also impossible to handle systems such as the Japanese Kanji or Korean Hangul alphabets that have thousands of characters.
The Unicode specification addresses this problem by using 2 bytes to encode each character. There are enough different patterns (65,536) in 2 bytes for a single specification covering the most common business languages. Because all Unicode systems consistently use the same bit patterns to represent all characters, there is no problem with characters being converted incorrectly when moving from one system to another. You can minimize character conversion issues by using Unicode data types throughout your system.
Unicode support was introduced to SQL Server starting with SQL Server 7.0. Microsoft SQL Server allows the storage of Unicode data only in UCS-2 encoding.
In Microsoft SQL Server, these data types support Unicode data:
nchar nvarchar ntext Note: The n prefix for these data types comes from the SQL-92 standard for National (Unicode) data types. Use of nchar, nvarchar, and ntext is the same as char, varchar and text, respectively, except that:
N: N'A Unicode string' It returns the integer value, as defined by the Unicode standard, for the first character of the input expression. The Unicode function has the following syntax:
UNICODE ( 'ncharacter_expression' )
The 'ncharacter_expression' is an nchar or nvarchar expression and the return type is int.
The following options are available to store UTF-8 encoded data in SQL Server.
Sample code for this type of conversion is located at the Unicode Consortium's site. A high-level description of the algorithm to convert UCS-2 to UTF-8 can be found in the Internet Request For Comments document RFC2279.
On Windows NT or Windows 2000, you may use the Win32 functions MultiByteToWideChar and WideCharToMultiByte to convert UTF-8 to and from UCS-2 by passing the constant CP_UTF8 (65001) as the first parameter to the functions.
Store the actual UTF-8 data on the server using the BINARY/VARBINARY/IMAGE columns. Storing UTF-8 data on SQL Server means that you can not use SQL Server to sort or find ranges of these values as if the data were valid character data. The types of operations on columns containing UTF-8 data that would not return expected results include ORDER BY, greater-than ">" and less-than "<" comparisons, and the built-in SQL Server string manipulation functions such as SUBSTRING().
However, equality comparisons will work as long as the strings being compared are equivalent at a byte level. Note that if you store UTF-8 data in SQL Server, you should not use character columns (CHAR/NCHAR/VARCHAR and so forth). UTF-8 is not valid character data to SQL Server and by storing non-character data in character columns, you risk encountering problems such as the issues discussed in the following Microsoft Knowledge Base articles:
155723 INF: SQL Server Truncation of a DBCS String
234748 PRB: SQL Server ODBC Driver Converts Language Events to Unicode
If you are considering this option, keep in mind that if you ever need to access UTF-8 data stored inside SQL Server from any application other than a Web browser (for example, from a non-Web-based ODBC application), you will need to do a conversion from UTF-8 to UCS-2 within this application. This is because ODBC, OLEDB, COM, Win32 API calls, VB, and C runtime string manipulation functions do not work with UTF-8 data. This moves the burden of translation to a different application.
| You must Sign In to use this message board. | |||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||
General
News
Question
Answer
Joke
Rant
Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads.
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 25 Oct 2007 Editor: |
Copyright 2007 by josekonoor Everything else Copyright © CodeProject, 1999-2010 Web09 | Advertise on the Code Project |