Click here to Skip to main content
13,800,882 members
Click here to Skip to main content
Add your own
alternative version

Tagged as

Stats

15.3K views
15 bookmarked
Posted 22 Oct 2013
Licenced CPOL

Simple Character Encoding Detection

, 22 Oct 2013
Rate this:
Please Sign up or sign in to vote.
Detecting character encoding in just 4 lines of code

Introduction

One very commonly asked question in programming is how to detect the character encoding of a string. Well, I'm going to share a cool method I came up with that can detect if a string is UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE in just 4 lines of code.

Explanation

We'll be working with null terminated strings, so the first rule is that we must terminate all strings with a quadruple null, regardless of encoding. You may wish to add a definition such as the following:

#define NT "\0\0\0" 
 
char *exampleString = "This is UTF-8" NT; 

Next is an explanation of how the checking works.

1.===== If a string doesn't contain nulls, its UTF-8
 :
else
 :
2:===== If a string doesn't contain double nulls, it's UTF-16
 :--.
 : 3:== If the nulls are on odd numbered indices, it's UTF-16LE
 :  :
 : else
 :  :
 : 4'== The string defaults to UTF-16BE
 :
else
 :
5:===== If the index modulo 4 is 0 and the character is greater than
 :      0x7F, the string is UTF-32LE. This is because the range of
 :      UTF-32 only goes up to 0x7FFFFFFF, meaning approximately 22%
 :      of the characters that can be represented will validate that
 :      the string is not big endian; including a BOM.
 :
else
 :
6'===== The string defaults to UTF-32BE 

The Code

We check every byte until we reach a quadruple null:

int String_GetEncoding(char *string)
  {
    unsigned c, i = 0, flags = 0;
    while (string[i] | string[i + 1] | string[i + 2] | string[i + 3])
      flags = (c = string[i++]) ? flags | ((!(flags % 4) && 
      c > 0x7F) << 3) : flags | 1 | (!(i & 1) << 1) 
      | ((string[i] == 0) << 2);
    return (flags & 1) + ((flags & 2) != 0) + 
    ((flags & 4) != 0) + ((flags & 8) != 0);
  }   

The output:

0  = UTF-8
1  = UTF-16BE
2  = UTF-16LE
3  = UTF-32BE
4  = UTF-32LE   

Notes

Since UTF-32 encoding can contain several null bytes, its byte order checking is done through an alternative method that doesn't work 100% of the time, e.g., if all the characters are within the ASCII range and there isn't a BOM, it'll return UTF-32BE when it might actually be UTF-32LE.

This isn't really a big issue since UTF-32 is never used for storage, so chances are anyone that might use it will already know the byte ordering without having to check. However, if you're OCD, you could perform an additional check by treating UTF-32BE as UTF-16 and determining that string's byte ordering.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Ghosuwa Wogomon
United States United States
No Biography provided

You may also be interested in...

Comments and Discussions

 
GeneralThis is interesting! Pin
Chao Sun23-Oct-13 20:34
memberChao Sun23-Oct-13 20:34 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web05 | 2.8.181215.1 | Last Updated 23 Oct 2013
Article Copyright 2013 by Ghosuwa Wogomon
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid