Click here to Skip to main content
15,867,568 members
Articles / Programming Languages / C#

Full HTML Character Encoding in C#

Rate me:
Please Sign up or sign in to vote.
4.67/5 (22 votes)
28 Aug 2007CPOL2 min read 254.8K   37   29
This article shows how to take a String object and encode it as HTML using Unicode character entities for extended characters.

Introduction

This is my first CodeProject article, a short and sweet answer to a problem I couldn't find the solution to anywhere else.

Background

It is not terribly difficult to find a way to take a string containing HTML and "unescape" it to its proper characters (in fact HttpUtility.HtmlDecode() works perfectly). However, it's apparently another matter to take a complex string and escape it for HTML. At least, HttpUtility.HtmlEncode() doesn't fit the bill, as it only encodes the bare minimum of characters, ignoring such common things as curly quotes.

That said, of course, there's probably a better solution out there, so please direct me to it if you know of it, and I'll kindly remove or update this article accordingly.

Using the Code

The logic is quite simple. Basically, HTTP supports only standard ASCII characters for transmission, and extended characters are represented by named or numbered entity references. For example, the common ampersand (&) symbol is "escaped" as &. You can also use numbered entities, for example ® results in the registered (®) symbol.

Anyhow, the idea is to figure out what these numerical values are and manually encode extended characters (characters beyond 127, which are not standard ASCII) using the numbered entity format above.

The trick (hack) I discovered is simply to take a char and pass it to Convert.ToInt32(), which results in an apparently Unicode* value, which we can then convert to a string and wrap with the &# and ; characters.

* See Points of Interest below.

Here is the resulting method:

C#
public static string HtmlEncode( string text ) {
    char[] chars = HttpUtility.HtmlEncode( text ).ToCharArray();
    StringBuilder result = new StringBuilder( text.Length + (int)( text.Length * 0.1 ) );

    foreach ( char c in chars ) {
        int value = Convert.ToInt32( c );
        if ( value > 127 )
            result.AppendFormat("&#{0};",value); 
        else
            result.Append( c );
    }

    return result.ToString();
}

Note that we call the built-in System.Web.HttpUtility.HtmlEncode() beforehand which is a prerequisite since it covers the encoding of ASCII-based control characters such as the ampersand (&) and less-than/greater-than symbols (<,>).

Points of Interest

This code is admittedly very hackish, however the only point of concern I have with it is ensuring that the numerical representation of the character is in fact standard Unicode values. In my tests, I found this to be the case, however I would like to guarantee it. If anyone knows the answer to this, please post a message in the comments and I'll update the article.

History

  • Aug 28, 2007 - First post
  • Aug 29, 2007 - Thanks to J4amieC for the StringBuilder optimization. I decided to start it off with a buffer of 10% on top of the passed string length, which should be more than enough for typical usage.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
President The Little Software Company
Canada Canada
My name is Logan Murray and I'm a Canadian. I'm interested primarily in C# and Windows desktop application development (learning WPF at the moment and then hopefully LINQ), though I hope to branch-out more to the web side of things eventually. I am the president and owner of The Little Software Company and am currently working on the Windows version of OrangeNote, to be released soon. Check out my RSS reader, FeedBeast™.

Comments and Discussions

 
QuestionAmazing article with description !! Pin
Maheera Jazi3-Oct-15 21:32
Maheera Jazi3-Oct-15 21:32 
Questionhow to decode that? Pin
Mohsen Solhnia5-Jun-15 6:40
Mohsen Solhnia5-Jun-15 6:40 
Questioncharacters < 31? Pin
itamar8211-Mar-13 2:49
itamar8211-Mar-13 2:49 
GeneralGreat Pin
Member 774517327-Apr-11 5:09
Member 774517327-Apr-11 5:09 
Generalamazing ! thanks so much Pin
daniel77021-Feb-11 9:43
daniel77021-Feb-11 9:43 
GeneralThe same but as a method extension Pin
Geni18-Feb-10 23:23
Geni18-Feb-10 23:23 
GeneralPrevent conversion of HTML Tags Pin
gibberishh13-Jan-10 1:19
gibberishh13-Jan-10 1:19 
GeneralRe: Prevent conversion of HTML Tags Pin
chaiguy133713-Jan-10 6:50
chaiguy133713-Jan-10 6:50 
GeneralRe: Prevent conversion of HTML Tags Pin
gibberishh14-Jan-10 17:54
gibberishh14-Jan-10 17:54 
Generalnew lines, spacing [modified] Pin
gordonmt9-Jun-09 11:47
gordonmt9-Jun-09 11:47 
GeneralProperly Encoding for HTML Pin
JHankins4-Sep-07 6:50
JHankins4-Sep-07 6:50 
GeneralRe: Properly Encoding for HTML Pin
chaiguy13374-Sep-07 9:13
chaiguy13374-Sep-07 9:13 
GeneralUnicode point values Pin
jluber4-Sep-07 1:22
jluber4-Sep-07 1:22 
GeneralRe: Unicode point values Pin
chaiguy13374-Sep-07 4:40
chaiguy13374-Sep-07 4:40 
GeneralNice and neat Pin
craigd3-Sep-07 10:21
craigd3-Sep-07 10:21 
GeneralRe: Nice and neat Pin
chaiguy13373-Sep-07 11:19
chaiguy13373-Sep-07 11:19 
GeneralConvert.ToInt32() Pin
aprenot29-Aug-07 6:47
aprenot29-Aug-07 6:47 
GeneralRe: Convert.ToInt32() Pin
chaiguy133729-Aug-07 9:49
chaiguy133729-Aug-07 9:49 
Good to know. Do you think there's any overhead to calling Convert.ToInt32()? It is a static method after all.

Also this still doesn't do anything to guarantee that the value is, in fact, the standard Unicode value for the character (unless this is stated somewhere in documentation that I am unaware of).
QuestionNeed to call HttpUtility.HtmlEncode? Pin
Jcmorin29-Aug-07 1:40
Jcmorin29-Aug-07 1:40 
AnswerRe: Need to call HttpUtility.HtmlEncode? Pin
chaiguy133729-Aug-07 6:13
chaiguy133729-Aug-07 6:13 
GeneralString concatenation Pin
J4amieC28-Aug-07 22:39
J4amieC28-Aug-07 22:39 
GeneralRe: String concatenation Pin
chaiguy133729-Aug-07 6:30
chaiguy133729-Aug-07 6:30 
GeneralRe: String concatenation [modified] Pin
Arjan Einbu3-Sep-07 23:15
Arjan Einbu3-Sep-07 23:15 
GeneralRe: String concatenation Pin
chaiguy13374-Sep-07 4:15
chaiguy13374-Sep-07 4:15 
AnswerRe: String concatenation Pin
Arjan Einbu4-Sep-07 12:19
Arjan Einbu4-Sep-07 12:19 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.