Click here to Skip to main content
13,556,495 members
Click here to Skip to main content
Add your own
alternative version


19 bookmarked
Posted 20 Dec 2008
Licenced CPOL

TypograFix: A Tool for Typographic HTML Beautification

, 20 Dec 2008
Rate this:
Please Sign up or sign in to vote.
Presents a script/tool for typographic HTML reprocessing


One of the problems with authoring HTML content – such as CodeProject articles – is the immediate loss of typographic fidelity present in paper-based publishing. Supporting advanced typographic features is probably impossible without significantly altering the HTML standard, but there are things which are difficult to do even in existing HTML – such as getting the correct dashes and quotes. In this article, I want to show a general-purpose solution to this problem.

The best way to understand what this article is about is to download and run the sample app (requires .NET 3.5). Type in some HTML into the top-left window and see how the other windows are updated.


Many typographic features in HTML are achieved using special entities. For example, the ampersand (&) is made using the & entity. Typing these entities is inconvenient, and I doubt anyone (except professional Web designers) remembers them by heart. Let’s take a look at what these entities are used for.

  • Correct ‘single quotes’ and “double quotes” (instead of ugly vertical ones)

  • Proper en and em dashes (one – two — three)

  • Copyright signs – ©, ®, ™

  • The ‘times’ sign instead of the letter ‘x’ – 10×

  • Arrows ← here → there

  • Ellipsis (one of those hard-to-see features) …

  • Vertical ellipsis – something that’s useful for presenting code

    some code
    more code

    CodeProject has trouble rendering these.

  • Numeral superscripts – 1st 2nd 3rd 28,374th

  • Ordinals 1, 2, and 3… not that useful, but then you can write x³ + x² + x + 1 = 0

So basically I want to easily add these features into my CP articles (and other HTML documents) without having to remember any HTML entity names.


The solution to this problem is simple, and is already implemented (at least in part) by many blogging engines. Basically, we take the original HTML (the one with ugly dashes, quotes, etc.) and perform blanket reformatting on all content (except content in script, code and pre tags, of course). For example, a single dash surrounded by spaces becomes an en dash, and two dashes in a row become the em dash.

The above may sound easy, but is really complex and involves manipulating stacks. The source HTML is parsed character-by-character in order to ensure that reformatting is not applied wrongly. The ‘improved’ HTML is then put together by a StringBuilder.


Before I even got the idea to use the script for CodeProject articles and the like, I wrote it as an extension to, so that posts and comments would be reprocessed with these typographic rules applied. Of course, it is a lot easier to preprocess the user’s posts and comments once, whereas BlogEngine’s idea is to post-process them every time a post or comment is served. Meanwhile, realizing the need for the capability in a stand-alone application, I coded up a tiny WPF app which would allow me to test the script and use it for my own benefit.


I've been using the app extensively, and I don't know what else I can add to it that wouldn't cause it to become a fully-featured HTML editor (which is definitely not the idea). I think that further syntactic shortcuts (such as using the backquote character to act as a <code> tag) would allow me to write even faster. However, in terms of HTML entities, I'm more or less satisfied with what’s implemented so far. If necessary, I can always add support for more.

This is it. This project is on Google Code. Thanks for reading!


  • 20th December, 2008: Initial post


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Dmitri Nеstеruk
Founder ActiveMesa
United Kingdom United Kingdom
I work primarily with the .NET technology stack, and specialize in accelerated code production via code generation (static or dynamic), aspect-oriented programming, MDA, domain-specific languages and anything else that gets products out the door faster. My languages of choice are C# and F#, though I'm open to suggestions.

I'm a Microsoft MVP (Visual C#) since 2009. I run a collective tech blog at I use my own editor called TypograFix to typeset articles and blog posts.

Like the article and want this implemented in your product? Got a project that can benefit from Microsoft.Net goodness? Then get in touch!

You may also be interested in...


Comments and Discussions

GeneralLatest version Pin
Dmitri Nesteruk7-Jan-09 7:09
memberDmitri Nesteruk7-Jan-09 7:09 
GeneralRe: Latest version [modified] Pin
Kraeved16-Apr-09 23:46
memberKraeved16-Apr-09 23:46 
GeneralRe: Latest version Pin
Dmitri Nesteruk18-Apr-09 22:45
memberDmitri Nesteruk18-Apr-09 22:45 
GeneralCool. Pin
alxxl22-Dec-08 23:32
memberalxxl22-Dec-08 23:32 
GeneralRe: Cool. Pin
alxxl22-Dec-08 23:36
memberalxxl22-Dec-08 23:36 
GeneralRe: Cool. Pin
Dmitri Nesteruk23-Dec-08 1:10
memberDmitri Nesteruk23-Dec-08 1:10 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web04-2016 | 2.8.180515.1 | Last Updated 20 Dec 2008
Article Copyright 2008 by Dmitri Nеstеruk
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid