Click here to Skip to main content
15,867,453 members
Articles / Web Development / HTML
Article

TypograFix: A Tool for Typographic HTML Beautification

Rate me:
Please Sign up or sign in to vote.
4.19/5 (12 votes)
20 Dec 2008CPOL3 min read 32.1K   377   19   6
Presents a script/tool for typographic HTML reprocessing
Image 1

Introduction

One of the problems with authoring HTML content – such as CodeProject articles – is the immediate loss of typographic fidelity present in paper-based publishing. Supporting advanced typographic features is probably impossible without significantly altering the HTML standard, but there are things which are difficult to do even in existing HTML – such as getting the correct dashes and quotes. In this article, I want to show a general-purpose solution to this problem.

The best way to understand what this article is about is to download and run the sample app (requires .NET 3.5). Type in some HTML into the top-left window and see how the other windows are updated.

Problem

Many typographic features in HTML are achieved using special entities. For example, the ampersand (&) is made using the & entity. Typing these entities is inconvenient, and I doubt anyone (except professional Web designers) remembers them by heart. Let’s take a look at what these entities are used for.

  • Correct ‘single quotes’ and “double quotes” (instead of ugly vertical ones)

  • Proper en and em dashes (one – two — three)

  • Copyright signs – ©, ®, ™

  • The ‘times’ sign instead of the letter ‘x’ – 10×

  • Arrows ← here → there

  • Ellipsis (one of those hard-to-see features) …

  • Vertical ellipsis – something that’s useful for presenting code

    some code
    ⋮
    more code

    CodeProject has trouble rendering these.

  • Numeral superscripts – 1st 2nd 3rd 28,374th

  • Ordinals 1, 2, and 3… not that useful, but then you can write x³ + x² + x + 1 = 0

So basically I want to easily add these features into my CP articles (and other HTML documents) without having to remember any HTML entity names.

Solution

The solution to this problem is simple, and is already implemented (at least in part) by many blogging engines. Basically, we take the original HTML (the one with ugly dashes, quotes, etc.) and perform blanket reformatting on all content (except content in script, code and pre tags, of course). For example, a single dash surrounded by spaces becomes an en dash, and two dashes in a row become the em dash.

The above may sound easy, but is really complex and involves manipulating stacks. The source HTML is parsed character-by-character in order to ensure that reformatting is not applied wrongly. The ‘improved’ HTML is then put together by a StringBuilder.

Application

Before I even got the idea to use the script for CodeProject articles and the like, I wrote it as an extension to BlogEngine.net, so that posts and comments would be reprocessed with these typographic rules applied. Of course, it is a lot easier to preprocess the user’s posts and comments once, whereas BlogEngine’s idea is to post-process them every time a post or comment is served. Meanwhile, realizing the need for the capability in a stand-alone application, I coded up a tiny WPF app which would allow me to test the script and use it for my own benefit.

Futures

I've been using the app extensively, and I don't know what else I can add to it that wouldn't cause it to become a fully-featured HTML editor (which is definitely not the idea). I think that further syntactic shortcuts (such as using the backquote character to act as a <code> tag) would allow me to write even faster. However, in terms of HTML entities, I'm more or less satisfied with what’s implemented so far. If necessary, I can always add support for more.

This is it. This project is on Google Code. Thanks for reading!

History

  • 20th December, 2008: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Founder ActiveMesa
United Kingdom United Kingdom
I work primarily with the .NET technology stack, and specialize in accelerated code production via code generation (static or dynamic), aspect-oriented programming, MDA, domain-specific languages and anything else that gets products out the door faster. My languages of choice are C# and C++, though I'm open to suggestions.

Comments and Discussions

 
GeneralCool. Pin
alxxl22-Dec-08 23:32
alxxl22-Dec-08 23:32 
GeneralRe: Cool. Pin
alxxl22-Dec-08 23:36
alxxl22-Dec-08 23:36 
GeneralRe: Cool. Pin
Dmitri Nеstеruk23-Dec-08 1:10
Dmitri Nеstеruk23-Dec-08 1:10 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.