General Purpose Colorizer






4.62/5 (8 votes)
A rule driven engine for colorizing HTML, CSS, JavaScript, etc.

Introduction
The General Purpose Colorizer colorizes an input file so the syntax is easier to read. Here's a simple example. It's also used extensively by AJAX-LWB. The overall project will be housed at SourceForge, so check this link for the latest version. GPC is released under the GNU Library or Lesser General Public License (LGPL) GNU Library or Lesser General Public License (LGPL).
In Brief
There are a lot of code colorizers out there. To help you choose, these are the most significant features of GPC:
- The appearance of Syntax Pieces is entirely customizable using a stylesheet
- The parsing is entirely customizable using rules stored in an XML file
- The parser takes any file as input and produces HTML as output.
- In addition to syntax coloring, pieces can be highlighted so they can be referred to in documentation
- The parser itself should not need to be changed when adding support for another language- just the rules
Rule Definition
To give you a better idea how it operates, here is the set of rules that govern the colorizing of CSS (stylesheets):
<rule from="css" pattern="\{" to="cssAttribList" />
<rule from="cssAttribList" pattern=":" transient="cssAttribList" to="cssAttribValue" />
<rule from="cssAttribList" pattern="[_A-Za-z0-9\-]+" transient="cssAttribName" />
<rule from="cssAttribValue" pattern="[^;}]+"
transient="cssAttribValue" to="cssAttribList" />
<rule from="cssAttribList" pattern="\}" transient="cssAttribList" to="css" />
The colorizer itself is a state machine. These rules govern how transitions occur between the various states as can be seen in the diagram below. The rule can be applied if its "From State" matches the current state and its pattern (regular expression) matches text at the current position.

The name of the current state is also used as a stylesheet class name in the output from the colorizer:
.css { color:Maroon }
.cssAttribName { color:Red }
.cssAttribList { color:Black }
.cssAttribValue { color:Blue }
Rule Attributes
In the example above, the attributes from
, pattern
, to
and transient
made an appearance. The complete list is as follows:
Name | Type | Description | Required? |
---|---|---|---|
Pattern |
Regular Expression | The pattern that must be found for this rule to execute | Yes |
From |
State | The initial state to which this rule applies | Yes |
To |
State | The final state when this rule completes its transition | No |
Transient |
State | The state that exists during this rule's transition | No |
Push |
State | Push the specified state on the stack | No |
Pop |
Flag | Pop a state from the stack | No |
Add |
State | Add the specified state to the set of current states | No |
Remove |
Flag | Remove this rule's initial state from the set of current states | No |
Debug |
Flag | Invoke the debugger before this rule is executed | No |
More Advanced Features
If you just want to use the colorizer, you can skip this section, but if curiosity gets the better of you, read on...
The rule engine is capable of managing multiple concurrent states. An additional state is used to handle the highlighting of code marked sections of code. The two rules below operate as a pair. The first adds hilite1
to the set of current states when /*[hilite1]*/
is encountered. The second removes this state when /*[/hilite1]*/
is encountered. (noemit
is a special state that produces no output for the syntax piece.) There are similar sets of rules to cover four different types of highlighting for the three languages which bulk out the rules a bit (45 rules with almost half- 24 for highlighting). As these rules are all very similar, a smarter definition can reduce this in the future.
<rule from="js" pattern="/\*\[hilite1\]\*/" add="jsHilite1" transient="noemit"/>
<rule from="jsHilite1" pattern="/\*\[/hilite1\]\*/" transient="noemit" remove="true" />
It is possible to embed two other languages within HTML, namely: CSS (stylesheets) and JavaScript. These are triggered by the tags: <style>
and <script>
respectively. When these tags are encountered, HTML mode must continue in order to colorize any attributes properly. The switch to CSS or JavaScript should only take place when the tag is closed. This behaviour is managed by the following pair of rules. The first rule pushes a state corresponding to the embedded language that is about to be encountered. The second rule pops this state and makes it current when the tag is closed.
<rule from="htmlOpenTagName" pattern=" |>"
to="htmlTag" push="=style:css,script:js,html"/>
<rule from="htmlTag" pattern="/?>" transient="htmlEndTag" pop="true" />
Help Wanted
Currently the rule file supports HTML, CSS and JavaScript (although XML can also be colorized with HTML rules). I'll be extending it to address my own needs from time to time. However if you add an additional language yourself, please email me the rules so I can maintain a master copy. Similarly, please let me know about any bugs or suggested features in the Tracker section located here.
The rule execution engine itself is currently C# only. However it's a fairly concise piece of code (<300 lines excluding comments) that is crying out to be ported to other environments such as Java and JavaScript.
Of course, if you just want to use the colorizer as is, that's fine too.
History
- 7th October, 2007: Initial post