Click here to Skip to main content
Click here to Skip to main content

TypograFix: A Tool for Typographic HTML Beautification

, 20 Dec 2008
Rate this:
Please Sign up or sign in to vote.
Presents a script/tool for typographic HTML reprocessing

Introduction

One of the problems with authoring HTML content – such as CodeProject articles – is the immediate loss of typographic fidelity present in paper-based publishing. Supporting advanced typographic features is probably impossible without significantly altering the HTML standard, but there are things which are difficult to do even in existing HTML – such as getting the correct dashes and quotes. In this article, I want to show a general-purpose solution to this problem.

The best way to understand what this article is about is to download and run the sample app (requires .NET 3.5). Type in some HTML into the top-left window and see how the other windows are updated.

Problem

Many typographic features in HTML are achieved using special entities. For example, the ampersand (&) is made using the & entity. Typing these entities is inconvenient, and I doubt anyone (except professional Web designers) remembers them by heart. Let’s take a look at what these entities are used for.

  • Correct ‘single quotes’ and “double quotes” (instead of ugly vertical ones)

  • Proper en and em dashes (one – two — three)

  • Copyright signs – ©, ®, ™

  • The ‘times’ sign instead of the letter ‘x’ – 10×

  • Arrows ← here → there

  • Ellipsis (one of those hard-to-see features) …

  • Vertical ellipsis – something that’s useful for presenting code

    some code
    ⋮
    more code

    CodeProject has trouble rendering these.

  • Numeral superscripts – 1st 2nd 3rd 28,374th

  • Ordinals 1, 2, and 3… not that useful, but then you can write x³ + x² + x + 1 = 0

So basically I want to easily add these features into my CP articles (and other HTML documents) without having to remember any HTML entity names.

Solution

The solution to this problem is simple, and is already implemented (at least in part) by many blogging engines. Basically, we take the original HTML (the one with ugly dashes, quotes, etc.) and perform blanket reformatting on all content (except content in script, code and pre tags, of course). For example, a single dash surrounded by spaces becomes an en dash, and two dashes in a row become the em dash.

The above may sound easy, but is really complex and involves manipulating stacks. The source HTML is parsed character-by-character in order to ensure that reformatting is not applied wrongly. The ‘improved’ HTML is then put together by a StringBuilder.

Application

Before I even got the idea to use the script for CodeProject articles and the like, I wrote it as an extension to BlogEngine.net, so that posts and comments would be reprocessed with these typographic rules applied. Of course, it is a lot easier to preprocess the user’s posts and comments once, whereas BlogEngine’s idea is to post-process them every time a post or comment is served. Meanwhile, realizing the need for the capability in a stand-alone application, I coded up a tiny WPF app which would allow me to test the script and use it for my own benefit.

Futures

I've been using the app extensively, and I don't know what else I can add to it that wouldn't cause it to become a fully-featured HTML editor (which is definitely not the idea). I think that further syntactic shortcuts (such as using the backquote character to act as a <code> tag) would allow me to write even faster. However, in terms of HTML entities, I'm more or less satisfied with what’s implemented so far. If necessary, I can always add support for more.

This is it. This project is on Google Code. Thanks for reading!

History

  • 20th December, 2008: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Dmitri Nеstеruk
Founder ActiveMesa
United Kingdom United Kingdom
I work primarily with the .NET technology stack, and specialize in accelerated code production via code generation (static or dynamic), aspect-oriented programming, MDA, domain-specific languages and anything else that gets products out the door faster. My languages of choice are C# and F#, though I'm open to suggestions.
 
I'm a Microsoft MVP (Visual C#) since 2009. I run a collective tech blog at DevTalk.net. I use my own editor called TypograFix to typeset articles and blog posts.
 
Like the article and want this implemented in your product? Got a project that can benefit from Microsoft.Net goodness? Then get in touch!
Follow on   Twitter

Comments and Discussions

 
GeneralLatest version PinmemberDmitri Nesteruk7-Jan-09 7:09 
GeneralRe: Latest version [modified] PinmemberKraeved16-Apr-09 23:46 
GeneralRe: Latest version PinmemberDmitri Nesteruk18-Apr-09 22:45 
Hi, there was a mix-up with me compiling PostSharp 1.0-decorated assemblies in PostSharp 1.5, which is why the ClickOnce was complaining. I fixed it now (migrated to 1.5 completely), so everything should work. And no, you don't need to have PostSharp or EntLib on your machine. Just go to the ClickOnce page and grab the latest version. Or, if you have one installed already, just run it and it should update automatically.
 
Please let me know if things go wrong for some reason.
 
-- Dmitri
GeneralCool. Pinmemberalxxl22-Dec-08 23:32 
GeneralRe: Cool. Pinmemberalxxl22-Dec-08 23:36 
GeneralRe: Cool. PinmemberDmitri Nesteruk23-Dec-08 1:10 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140721.1 | Last Updated 20 Dec 2008
Article Copyright 2008 by Dmitri Nеstеruk
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid