Introduction
This is my first CodeProject article, a short and sweet answer to a problem I couldn't find the solution to anywhere else.
Background
It is not terribly difficult to find a way to take a string
containing HTML and "unescape" it to its proper characters (in fact HttpUtility.HtmlDecode()
works perfectly). However, it's apparently another matter to take a complex string
and escape it for HTML. At least, HttpUtility.HtmlEncode()
doesn't fit the bill, as it only encodes the bare minimum of characters, ignoring such common things as curly quotes.
That said, of course, there's probably a better solution out there, so please direct me to it if you know of it, and I'll kindly remove or update this article accordingly.
Using the Code
The logic is quite simple. Basically, HTTP supports only standard ASCII characters for transmission, and extended characters are represented by named or numbered entity references. For example, the common ampersand (&) symbol is "escaped" as &
. You can also use numbered entities, for example ®
results in the registered (®) symbol.
Anyhow, the idea is to figure out what these numerical values are and manually encode extended characters (characters beyond 127, which are not standard ASCII) using the numbered entity format above.
The trick (hack) I discovered is simply to take a char
and pass it to Convert.ToInt32()
, which results in an apparently Unicode* value, which we can then convert to a string
and wrap with the &#
and ;
characters.
* See Points of Interest below.
Here is the resulting method:
public static string HtmlEncode( string text ) {
char[] chars = HttpUtility.HtmlEncode( text ).ToCharArray();
StringBuilder result = new StringBuilder( text.Length + (int)( text.Length * 0.1 ) );
foreach ( char c in chars ) {
int value = Convert.ToInt32( c );
if ( value > 127 )
result.AppendFormat("&#{0};",value);
else
result.Append( c );
}
return result.ToString();
}
Note that we call the built-in System.Web.HttpUtility.HtmlEncode()
beforehand which is a prerequisite since it covers the encoding of ASCII-based control characters such as the ampersand (&) and less-than/greater-than symbols (<,>).
Points of Interest
This code is admittedly very hackish, however the only point of concern I have with it is ensuring that the numerical representation of the character is in fact standard Unicode values. In my tests, I found this to be the case, however I would like to guarantee it. If anyone knows the answer to this, please post a message in the comments and I'll update the article.
History
- Aug 28, 2007 - First post
- Aug 29, 2007 - Thanks to J4amieC for the
StringBuilder
optimization. I decided to start it off with a buffer of 10% on top of the passed string
length, which should be more than enough for typical usage.