One of the problems with authoring HTML content – such as CodeProject articles – is the immediate loss of typographic fidelity present in paper-based publishing. Supporting advanced typographic features is probably impossible without significantly altering the HTML standard, but there are things which are difficult to do even in existing HTML – such as getting the correct dashes and quotes. In this article, I want to show a general-purpose solution to this problem.
The best way to understand what this article is about is to download and run the sample app (requires .NET 3.5). Type in some HTML into the top-left window and see how the other windows are updated.
Many typographic features in HTML are achieved using special entities. For example, the ampersand (&) is made using the
& entity. Typing these entities is inconvenient, and I doubt anyone (except professional Web designers) remembers them by heart. Let’s take a look at what these entities are used for.
Correct ‘single quotes’ and “double quotes” (instead of ugly vertical ones)
Proper en and em dashes (one – two — three)
Copyright signs – ©, ®, ™
The ‘times’ sign instead of the letter ‘x’ – 10×
Arrows ← here → there
Ellipsis (one of those hard-to-see features) …
Vertical ellipsis – something that’s useful for presenting code
CodeProject has trouble rendering these.
Numeral superscripts – 1st 2nd 3rd 28,374th
Ordinals 1, 2, and 3… not that useful, but then you can write x³ + x² + x + 1 = 0
So basically I want to easily add these features into my CP articles (and other HTML documents) without having to remember any HTML entity names.
The solution to this problem is simple, and is already implemented (at least in part) by many blogging engines. Basically, we take the original HTML (the one with ugly dashes, quotes, etc.) and perform blanket reformatting on all content (except content in
pre tags, of course). For example, a single dash surrounded by spaces becomes an en dash, and two dashes in a row become the em dash.
The above may sound easy, but is really complex and involves manipulating stacks. The source HTML is parsed character-by-character in order to ensure that reformatting is not applied wrongly. The ‘improved’ HTML is then put together by a
Before I even got the idea to use the script for CodeProject articles and the like, I wrote it as an extension to BlogEngine.net, so that posts and comments would be reprocessed with these typographic rules applied. Of course, it is a lot easier to preprocess the user’s posts and comments once, whereas BlogEngine’s idea is to post-process them every time a post or comment is served. Meanwhile, realizing the need for the capability in a stand-alone application, I coded up a tiny WPF app which would allow me to test the script and use it for my own benefit.
I've been using the app extensively, and I don't know what else I can add to it that wouldn't cause it to become a fully-featured HTML editor (which is definitely not the idea). I think that further syntactic shortcuts (such as using the backquote character to act as a
<code> tag) would allow me to write even faster. However, in terms of HTML entities, I'm more or less satisfied with what’s implemented so far. If necessary, I can always add support for more.
This is it. This project is on Google Code. Thanks for reading!
- 20th December, 2008: Initial post