|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
New languages supported: JScript, VBScript, C, XML !
OutlineThis is the first of a 2 articles serie. In this article, the techniques and ideas are discussed and a Javascript solution is given. In Part 2, a Unfortunately for IntroductionHave you ever wondered how the CP team highlights the source code in their edited article ? I suppose it's not by hand and they must have some clever code to do it. However, if you look around in the forums on the web, you will see that there are few if any who have this feature. Sad thing, because colored source code is much easier to read. In fact, it would be great to have source code in forums automatically colored with your favorite coloring scheme. The last-but-not-least reason for writing this article was to learn regular expressions, javascript and DOM in one project. The source code entirely written in JScript so it can be included server-side or client-side in your web pages. The techniques used are:
When reading this article, I will assume that you have little knowledge of regular expressions, DOM and XSLT although I'm also a newbie in those 3 topics. Live DemoCP does not acceptscript or form tags in the article. To play with the live demo, download the "JScript" enabled page (see download section).
Transformation Overview
All the boxes will be discussed in details in the next chapter. I will give here an short overview of the process. First, a language syntax specification file is loaded (Language specification box). This specification is a plain xml file given by the users. In order to speed up things, preprocessing is made on this document (Preprocessing box). Let us suppose for simplicity that we have the source code to colorize (Code box). Note that I will show how to apply the coloring to a whole html page later on. The parser, using the preprocessed syntax document, builds an XML document representing the parsed code (Parsing box). The technique used by the parser is to split up the code in a succession of nodes of different types: keyword, comment, litteral, etc... At last, an XSTL transformation are applied to the parsed code document to render it to HTML and a CSS style is given to match the desired appearance. Parsing ProcedureThe philosophy used to build the parser is inspired from the Kate documentation (see [1]). The code is considered as a succession of contexts. For example, in C++,
For each context, we define rules that have 3 properties:
The rules have priority among them. For example, we will first look for a /* ... */ comment, then a // ... line comment, then litteral, etc... When a rule is matched using a regular expression, the string matched by the rule is assigned with the attribute context, the current context is updated as context and the parsing continues. The diagram show the possible path between contexts. As one can see, some rule do not lead to a need context.
Let me explain a bit the schema below. Consider that we are in the Once we find a match, we look for the rule that triggered that match (always following the priority of the rules). Therefore, pathological like is well parsed: // a keyword while in a commentwhile is not considered as a keyword since it is in a comment.
Rules AvailableThere are 5 rules currently available:
regexp is by far the most powerful rule of all as all other rules are represented internally by regular expressions. Language SpecificationFrom the rules and context above, we derive an XML structure as described in the XSD schema below (I don't really understand xsd but .Net generates this nice diagram...)
I will breifly discuss the language specification file here. For more details, look at the xsd schema or at Nodes
PreprocessingIn the preprocessing phase, we are going to build the regular expressions that will be used later on to match the rules. This section makes an extensive use of regular expressions. As mentionned before, this is not a tutorial on regular expressions since I'm also a newbie in that topic. A tool that I have found to be really useful is Expresso (see [3]) a regular expression test machine.Keyword FamiliesBuilding the keyword families regular expressions is straightforward. You just need to concatenate the keywords togetter using |:<keywordlist ...>
<kw>if</kw>
<kw>else</kw>
</keywordlist>will be matched by \b(if|else)\b
The generated regular expression is added as an attribute to the keywordlist node: <keywordlist regexp="\b(if|else)\b">
<kw>if</kw>
<kw>else</kw>
</keywordlist>
When using libraries of function, it is usual to have a common function header, like for OpenGL: glVertex2f, glPushMatrix(), etc...
You can skip the hassle of rewritting gl in all the kw items by using the attribute pre which takes a regular expression as a parameter: <keywordlist pre="gl" ...>
<kw>Vertex2f</kw>
<kw>PushMatrix</kw>
</keywordlist>will be matched by \bgl(Vertex2f|PushMatrix)\bYou can also add regular expression after the keyword using post. Still working on our OpenGL example, there are some methods that have characters at the end to tell the type of parameters:
post and regular expression, we can match it easily: <keywordlist pre="gl" post="[2-4]{1}(f|v){1}" ...>
<kw>Vertex</kw>
<kw>Raster</kw>
</keywordlist>will be matched by \bgl(Vertex2f|PushMatrix)[2-4]{1}(f|v){1}\b
String LiteralsThis is a little exercise on regular expression: How to match a literal string in C++? Remember that it must support My answer (remember I'm a newbie) is "(.|\\"|\\\r\n)*?((\\\\)+"|[^\\]{1}")I tested this expression on the following string: "a simple string"
---
"a less \" simple string"
---
"a even less simple string \\"
---
"a double line\
string"
---
"a double line string does not work without
backslash"
---
"Mixing" string "can\"" become "tricky"
---
"Mixing \" nasty" string is \" even worst"
ContextsThe context regular expression is also build by concatenating the regular expression of the rules. The value is added as an attribute to the context node: <context regexp="(...|...)">
Controlling if Preprocessing is NeccessaryIt is possible to skip the preprocessing phase or to save the "preprocessed" language specification file. This is done by specifying the following parameters in the root node highlight
Javascript callThe preprocessing phase is done through the javascript method // language specification file
var sXMLSyntax = "highlight.xml";
// loading is done by loadXML
// preprocessing is done in loadAnd... It returns a DOMDocument
var xmlDoc = loadAndBuildSyntax( loadXML( sXMLSyntax ) );
ParsingWe are going to use the language syntax above to build an XML tree out of the source code. This tree will be made out of successive context nodes.We can start parsing the string (pseudo-code below): source = source code;
context = code; // current context
regExp = context.regexp; // regular expresion of the current context
while( source.length > 0)
{
Here we follow the procedure:
match = regExp.execute( source );
// check if the rules matched something
if( !match)
{
// no match, creating node with the remaining source and finishing.
addChildNode( context // name of the node,
source // content of the node);
break;
}
else
{
The source before the match has to be stored in a new node: addChildNode( context, source before match);
We now have to find the rule that has matched. This is done by the method
// getting new node
ruleNode = findRule( match );
// testing if matching string has to be stored
// if yes, adding
if (ruleNode.attribute != "hidden")
addChildNode( attribute, match);
// getting new context
context=ruleNode.context;
// getting new relar expression
regExp=context.regexp;
}
}
At the end of this method, we have build an XML tree containing the context. For example, consider the classic of the classic "Hello world" program below: int main(int argc, char* argv[])
{
// my first program
cout<<"Hello world";
return -1;
};
This sample is translated in the following xml structure: <parsedcode lang="cpp" in-box="-1">
<reservedkeyword>int</reservedkeyword>
<code> main(</code>
<reservedkeyword>int</reservedkeyword>
<code> argc, >Here is the specification of the resulting XML file:
Javascript CallThe algorithm above is implemented in the applyRules( languageNode, contextNode, sCode, parsedCodeNode);
where
XSLT TransformationOnce you have the XML representation of your code, you can basically do whatever you want with it using XSLT transformations. HeaderEvery XSL file starts with some declarations and other standard options: <?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet
Since source code indenting has to be conserved, we disable automatic indenting and, also the xml declaration is omitted: <xsl:output encoding="ISO-8859-1" indent="no" omit-xml-declaration="yes"/>
Basic Templates<xsl:template match="cpp-linecomment">
<span class="cpp-comment">//<xsl:value-of select="text()"
This template appies to the node <xsl:value-of select="text()" disable-output-escaping="yes" /></span>
The Parsedcode TemplateIt gets a little complicated here. As everybody knows, XSL quicly becomes really complicated once you want to do more advanced stylesheets. Below is the template for <xsl:template match="parsedcode">
<xsl:choose>
<xsl:when test="@in-box[.=0]">
<xsl:element name="span">
Javascript CallThis is where you have to customize a bit the methods. The rendering is done in the method highlightCode( sLang, sRootTag, bInBox, sCode)where
The file names are hardcoded inside the Applying Code Transformation to an Entire HTML Page.So now you are wondering how to apply this transformation to an entire HTML page? Well surprisingly, this can be done in... 2 lines! In fact, there exist the method For example, we want to match the code enclosed in // this is javascript
var regExp=/<pre>(.|\n)*?<\/pre>/gim;
// render xml
var sValue = sValue.replace( regExp,
function( $0 )
{
return highlightCode("cpp", "cpp",$0.substring( 5, $0.length-6 ));
}
);
In practice, some checking are made on the language name and all these computations are hidden in the Using the Methods in your Web WiteASP PagesTo use the highlightin scheme in your ASP web site:
Demonstration ApplicationThe demonstration application is a hack of the CodeProject Article Helper. Type in code in Update History
References
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||