Posting Unicode Characters via AJAX

Naruki

4.50/5 (7 votes)

Mar 26, 2009

CPOL

7 min read

40879

When the website only speaks Latin-1, how to make AJAX post like a real browser?

Introduction

This article shows a very specific technique for submitting Unicode text via AJAX in the same manner as the web browser does it. If you thought that was already the case, you need to read this article.

It will not explain standard AJAX usage, since you can get that anywhere.

Background

I frequent an old web forum that serves its pages with a charset of "iso-8859-1", otherwise known as Latin-1. It allows people all over the world to post messages using whatever language they want, but their HTML form is pure vanilla. There is a textbox and a submit button. Everything else is up to you.

It does accept a certain subset of HTML codes, but you have to type those in by hand, and mistakes are big trouble. Is this good for the site? Definitely not, but abuses are rare and almost always by mistake. The users are, by and large, decent sorts.

The site owner is not interested in changing things, so I decided to take matters into my own hands via Opera's UserJavaScript and Firefox's Greasemonkey plugins. In short, I wrote my own code to take user submissions and post them to the site, but my JavaScript forms include lots of little formatting buttons and safety checks.

Plus, it submitted via AJAX so I wouldn't have to reload the page after every comment. When your forum has 400+ posts with images and embedded videos, you really don't want to reload after every "funny pic, man" comment.

After much wrangling, I got it working pretty smoothly in both browsers. Up until the day, I tried to post some Japanese text. Anybody know what "mojibake" means? Garbage characters. But why? I know that I could post the very same text in the regular browser form and it would display perfectly, so why was AJAX getting hosed? Every web reference I'd seen to that point said that AJAX worked just like the browser!

Using the Code

I wrote this code in a user JavaScript file, which is pure JavaScript with no HTML or CSS or other languages mixed in (save as embedded in JavaScript), and it ends with ".js". You can of course use the code anywhere you think it fits, but it was only my special case that led me to find out that a problem even exists, and later how to fix it.

We will assume an extremely simple HTML form as shown below:

<form method="post" action="http://www.site.com/submit/path">
  <textarea id="userTxt"></textarea>
  <input type="submit" value="Submit">
</form>

As I said, very simple. This is pretty much what that web forum uses. If you type Japanese text (e.g., "ハロー、ワールド！") in Opera/Firefox/IE/Safari/whatever browser, your comment is submitted and comes back on page reload exactly as you typed it.

Now let's write a JavaScript function to submit that same form via AJAX. There are a ton of details that are out-of-scope here, so they will be glossed over or omitted. First, assume the form has been altered to call JavaScript instead of submitting directly:

<form method="post">
  <textarea id="userTxt"></textarea>
  <input type="submit" value="Submit" onClick="sendByAJAX(); return false;">
</form>

I won't bother writing a perfect, cross-browser compatible AJAX function. Just pretend Internet Explorer follows the standards for this part, or adjust it on your own time. :-)

function sendByAJAX() {
   // get the user text and make it safe for HTTP transmission
   var userTxt = encodeURIComponent( document.getElementById('userTxt').value );
   // create the AJAX object
   var xmlhttp = new XMLHttpRequest();
   // assume successful response -- do NOT actually make this assumption in real code
   xmlhttp.onreadystatechange = function() {
      if (xmlhttp.readyState==4 && xmlhttp.status>=200 && xmlhttp.status<300) {
         // You'll probably want to do something more meaningful than an alert dialog
         alert('POST Reply returned: status=[' + xmlhttp.status + 
	    ' ' + xmlhttp.statusText + ']\n\nPage data:\n' + xmlhttp.responseText);
      }
   }
   xmlhttp.open('POST', 'http://www.site.com/submit/path');
   // here we are overriding the default AJAX type, 
   // which is UTF-8 -- this probably seems like a stupid thing to do
   xmlhttp.setRequestHeader('Content-type', 
	'application/x-www-form-urlencoded; charset=ISO-8859-1;');
   xmlhttp.setRequestHeader('User-agent'  , 'Mozilla/4.0 (compatible) Naruki');
   xmlhttp.send(userTxt);
}

The above code will fail when you add Japanese text, Russian script, special math symbols, etc. Well, technically it will succeed and no errors will be thrown, but you will get back garbage characters from the server.

The reason is Unicode. In fact, if you force your browser to change the page encoding to Unicode, your characters will now be readable. They aren't truly garbage characters at all, but the site-specified page encoding conflicts with the user supplied text, and there's nothing you can do about... Oh, wait, there is.

I knew something was funny because the browser itself could post the same exact text and it came back properly, without forcing a new encoding via browser settings. The trick is one that Google could not tell me, but a very helpful Opera employee could.

When a browser encounters Unicode characters in text that is to be posted, it silently converts them to HTML entities.

As you may or may not know, the Unicode standard can hold 1,114,112 characters, which is just about enough to cover everything we humans will need. The first 65,535 characters have 1-to-1 mappings to HTML numeric entities. Technically all the characters do, but there's a gotcha, so forget about that for now.

What does that mean to me? Well, suppose you have the burning desire to include some funky math characters in your post. You look them up in the Unicode books and find their number, but you cannot find an HTML named entity to match them. Say Unicode character 0x24EA, for example. How can you post that? Well, simply type in the numeric entity reference ⓪ and you're golden.

Since the characters in an HTML entity are all standard Latin-1 characters, there is no special encoding needed to transmit that text. But if I type Japanese text using my IME (Input Method Editor), it won't be generating HTML entities. It will be putting the Unicode characters into the form directly.

The browser knows how to pre-parse this, so let's teach our JavaScript the same thing. We'll need to add a new function, and then change the line where we read the userTxt field.

   // get the user text and make it safe for HTTP transmission
   var userTxt = encodeURIComponent
		( uni2ent( document.getElementById('userTxt').value ) );
...
function uni2ent1stTry(snippet) {
  var uSnip = '';
  for (var c=0, val; val = snippet.charCodeAt(c); c++) {
    if (val < 256) {
      uSnip += snippet.charAt(c);
    }
    else {
      uSnip += "&#" + val + ";"
    }
  }
  return uSnip;
}

The function uni2ent() [note the 1^st Try in the code sample - it's not ready yet] parses the text character by character using JavaScript's built-in string function charCodeAt(). This gives the Unicode value of that character.

If the value is below 256, then it is safe to use the raw character as-is, since Unicode and Latin-1 use the same set at that point. If the value is higher, then you need to make it an HTML numeric entity. The entities can be written in base 10 (叶) or in base 16 (叶).

JavaScript itself uses Unicode, not Latin-1, and you should be aware of this when dealing with interoperability issues.

Now for the gotcha! The function is almost perfect, and indeed most people will never see a problem. But it's there, and it's called Surrogate Pairs.

Remember how I said the first 65535 characters have 1-to-1 mappings to HTML numeric entities? In Hex, that is 0xFFFF, which takes up two bytes. Once you go above that, you need an extra two bytes (depending on the particular Unicode encoding scheme) to make a single character. JavaScript and most Unicode programs typically use UTF-16 flavors, so they need those extra bytes for higher numbered characters.

There is a lot of web discussion available on this, but it's hard to find until you know the magic phrase "Surrogate Pairs". Essentially, when a Unicode character must be represented as surrogate pairs, you use a tricky little formula to go from the Unicode number to two UTF-16 numbers. It's complicated, and you need to put your brain to it for a while before it begins to make sense.

I used two primary references that helped me finally crack this last nut. The first was Wikipedia's example UTF-16 encoding procedure. This went in the opposite direction I wanted to go, so I reverse engineered it into the following:

function uni2ent2ndTry(srcTxt) {
   var entTxt = '';
   var c, hi, lo;
   var len = 0;
   for (var i=0, code; code=srcTxt.charCodeAt(i); i++) {
      // need to convert to HTML entity
      if (code > 255) {
         // values in this range are surrogate pairs
         if (0xD800 <= code && code <= 0xDBFF) {
            hi = code;
            lo = srcTxt.charCodeAt(i+1);
            lo &= 0x03FF;
            hi &= 0x03FF;
            hi = hi << 10;
            code = (lo + hi) + 0x10000;
         }
         // wrap it up as a Hex entity
         c = "&#x" + code.toString(16).toUpperCase() + ";";
      }
      // smaller values can be used raw
      else {
         c = srcTxt.charAt(i);
      }
      entTxt += c;
   }
   return entTxt;
}

Don't get too fond of that function. It has no error checking, it assumes the input is perfect, and bit-shifting still gives me the willies.

Later, I discovered that Mozilla had published an interesting example on how to get whole unicode characters from a surrogate pair, exactly what I needed! A few tweaks and...

function uni2ent(srcTxt) {
  var entTxt = '';
  var c, hi, lo;
  var len = 0;
  for (var i=0, code; code=srcTxt.charCodeAt(i); i++) {
    var rawChar = srcTxt.charAt(i);
    // needs to be an HTML entity
    if (code > 255) {
      // normally we encounter the High surrogate first
      if (0xD800 <= code && code <= 0xDBFF) {
        hi  = code;
        lo = srcTxt.charCodeAt(i+1);
        // the next line will bend your mind a bit
        code = ((hi - 0xD800) * 0x400) + (lo - 0xDC00) + 0x10000;
        i++; // we already got low surrogate, so don't grab it again
      }
      // what happens if we get the low surrogate first?
      else if (0xDC00 <= code && code <= 0xDFFF) {
        hi  = srcTxt.charCodeAt(i-1);
        lo = code;
        code = ((hi - 0xD800) * 0x400) + (lo - 0xDC00) + 0x10000;
      }
      // wrap it up as Hex entity
      c = "" + code.toString(16).toUpperCase() + ";";
    }
    else {
      c = rawChar;
    }
    entTxt += c;
    len++;
  }
  return entTxt;
}

And that's the final version! You really should try to understand it, because I want other people to be as miserable as I was. But if you have any questions, don't look at me. I haven't got a clue.

If you looked at the Mozilla example carefully, you'll note I ditched the error checking, because I'm a bad programmer. Do what I say, not what I do.

Points of Interest

This is my first article here, so I'm sure I'll be making lots of edits (and getting lots of bug reports). Please bear with me.

I've not done much in the way of formal testing, so if you note any heinous errors or misstatements, please let me know. Thanks.

History

26^th March, 2009: Initial post