|
|||||||||||||||||||||
|
|||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
A Quick Note on the DownloadsIf you have previously seen this article, the v2 downloads contain a complete rewrite of the parser, which required a completely new model. The v1 downloads are provided for those who are dependant on the previous model (the v1 parser was last updated on 11/25/2008). If this is the first time you are seeing this article, use the v2 downloads that this article now describes. IntroductionMy motivation for this project came from an interest in importing color information. I needed to import from CSS files into an application I was working on for managing color sets of Web and other projects. However, on import I didn't just want the color, but the elements it is associated with as well. So, Regex was out. This left me with the need for a full CSS parser that could be used for lots of things, not just my nutty little color manager. So all I need is a grammar that describes the structure of the CSS 2.1 language. Unfortunately, the only grammar definition for CSS I could find was the W3C grammar written for Lex/Yacc. Due to my inexperience with Yacc, and my apathy towards learning it, I had to write a grammar from scratch. Ultimately, I chose to write a Coco/R attributed grammar. With version 2, the parser is now based on the 2.1 spec, with support for some 3.0 elements. Because it now handles "@" rules generically, most browser specific extensions should also now be supported. The object model was changed completely for the 2nd version of this parser. Hopefully this will cover most uses you may have for a CSS parser. If you have further requirements, it may be a good base to work from. ParsersI don't claim to be an expert in the fine arts of deterministic finite automaton. Having said that, the basic concept of parsing text, interpreting the contents and building some useful result from it can be understood without a deep understanding of the nitty gritty of how it is accomplished. When working with a parser generator like Coco, it's important to understand the basic concepts. What Parsers DoA parser is a working piece of software that 'knows' how to recognize a set of language rules. While in the process of recognizing a language, a parser usually acts on the information recognized. A parser doesn't have to act on the information, but it wouldn't be very interesting if all it did was say "yep, that's my language all right..." Some parsers build an object graph as they parse. Some call methods on another piece of software passing them parameters from the language. And some parsers just fire events as parts are recognized. With Coco, you are free to use any of these methods through attributed code. The beauty of tools like Coco is that you only need to specify your language in a format they recognize (the grammar), and they produce the parser for you. Parsers usually don't work alone. Most parsers require a scanner (sometimes called a lexer or tokenizer) to supply them with lexical chunks or tokens. When a parser starts, it asks the scanner for the first token, which it checks against its rules to determine if that token was expected. If the token does fit a rule, it uses it to determine what should be expected next. If the grammar that produced the parser says: Doc = Word [ ":" Word ] ";" .
The parser will report an error if the first token is not a Understanding how the parser uses tokens from the scanner will help you better understand how to express your language's grammar. GrammarsA grammar is the textual representation of a language, its pieces, its rules, and how they all work together. A grammar defines how text should be broken up into tokens (lexical analysis), with such details as whether or not to consider underscores as letters, or to treat multi-symbol operators as a single entity. It also defines more complex language structures called productions which make up the rules for your language (what may follow what, how many times, and in what order). With Coco, you may also define attributed code that will be embedded in the parser code at the location it is defined in the grammar. A simple example of a production might be like C# variable declarations, where you could say: variableDecl means a TypeName followed by a Name
followed optionally by an Initialization.
In EBNF, which I'll detail more below, this would look more like: varialbeDecl = TypeName Name [ "=" ( Literal | "new" TypeName ) ] ";" .
In this example, Coco/RCoco is a relatively new parser generator that has been ported to several languages, including our favorite C#. If you're not familiar with Coco, go download a copy and read the manual, it's a wonderful tool. Coco takes ATG (attributed grammar) files containing language grammar definitions written in EBNF, and generates a Scanner (sometimes called a Lexer) and a Parser that understand your language. ATG EBNFEBNF is a very simple language with a small number of simple rules. However, it does take some work to express a language with it. When doing this, there are several things to consider... First of all, Coco is an LL(1) recursive decent parser, which means that it only looks ahead one token at a time, so all token declarations and all productions must start with a unique token so that Coco can determine at any given point which path to take when recognizing your language. If you define something that causes an LL(1) conflict, the compiler will tell you. Technically, Coco is an LL(1), but it does provide a couple of mechanisms for you to resolve conflicts on your own, in effect making it an LL(n) parser. I won't get into too much detail with these, because this article is not specifically about how to use Coco... But, you can do a manual multi-token look ahead in your attributed code (more on attributed code later), and you can also provide a call in your grammar to a custom method where you can analyze tokens to decide which path to take, (which you can find a few examples of in the v2 grammar). The rules for defining productions are simple, similar in fact to regular expressions but simpler. Most texts on the subject of parsers recognize three basic non-terminal structures:
Many parser discussions will also add:
A grammar, and by extension a parser, represents a tree of rules (an abstract syntax tree to be exact). It's a maze of rules through which there are many valid paths. Every step of the way, the parser has to determine if the current token fits one of it's options for moving forward. In EBNF, Decl = TypeName Name . // means a Decl is a TypeName followed by a Name.
Init = ( Literal | "new" TypeName ) .
// means Init is a Literal or "new" followed by TypeName.
Sentence = { Word } "." .
// means a sentence is any number of Words followed by a period.
And Decl = Type Name [ Init ] .
// means a Decl is a Type followed by a Name followed
// optionally by an Init.
Coco's EBNF has some specific rules for the files structure, how and where to include lexical structures and token definitions, and some other things like comment characters and pragmas. I won't get into the full requirements of a Coco ATG file, but I would suggest downloading Coco and reading its manual if you are interested in knowing more. Attributed CodeAs stated earlier, parsers usually do something with the information they recognize. At certain points in your grammar, you will want to fill the properties of an object or objects, fire events, or call external methods. That's where attributed code comes in. Coco's grammar files are named with the ATG extension, which is the abbreviation for Attributed Grammar. Attribution code is enclosed within booby CSSDoc =
{ SelectorProd
(. Console.WriteLine("SelectorProd recognized"); .)
}
.
Coco would produce a method that would look similar to this... void CSSDoc()
{
while (la.kind == 5)
{
SelectorProd();
Console.WriteLine("SelectorProd recognized!");
}
}
Coco turns all productions into a The CodeMy entire language definition (minus attributed code) looks like this: CSS2 =
{ ( ruleset | directive ) }
.
QuotedString =
( "'" {ANY} "'" | '"' {ANY} '"' )
.
URI =
"url" [ "(" ] ( QuotedString | {ANY} ) [ ")" ]
.
medium =
(
"all" | "aural" | "braille" | "embossed"
| "handheld" | "print" | "projection"
| "screen" | "tty" | "tv"
)
.
identity =
(
ident
| "n" | "url" | "all" | "aural" | "braille"
| "embossed" | "handheld" | "print"
| "projection" | "screen" | "tty" | "tv"
)
.
directive =
'@' identity
[ expr | medium ]
(
'{' [ {
(
declaration { ';' declaration } [ ';' ]
| ruleset
| directive
)
} ] '}'
|
';'
)
.
ruleset =
selector
{ ',' selector }
'{' [ declaration { ';' declaration } [ ';' ] ] '}'
.
selector =
simpleselector { [ ( '+' | '>' | '~' ) ] simpleselector }
.
simpleselector =
( identity | '*'
| ('#' identity | '.' identity | attrib | pseudo )
)
{ ('#' identity | '.' identity | attrib | pseudo ) }
.
attrib =
'[' identity [
( '=' | "~=" | "|=" | "$=" | "^=" | "*=" )
( identity | QuotedString )
] ']'
.
pseudo =
':' [ ':' ] identity [ '(' expr ')' ]
.
declaration =
identity ':' expr [ "!important" ]
.
expr =
term { [ ( '/' | ',' ) ] term }
.
term =
(
QuotedString
| URI
| "U\\" identity
| HexValue
| identity
[ { (
':' [ ':' ] identity
| '.' identity
| '=' ( identity | { digit } )
) } ]
[ '(' expr ')' ]
|
[ ( '-' | '+' ) ]
{ digit }
[ '.' { digit } ]
[ (
"n" [ ( "+" | "-" ) digit { digit } ]
| "%"
| identity
) ]
)
.
HexValue =
'#'
[ { digit } ]
[ ident ]
.
The object model that this parser builds is similar in structure to the grammar. To get the image down to the required width, the following class diagram excludes
Describing this model is a little more difficult than the original model. It's much easier to show some examples of how the CSS structure fits into the model. A /* RuleSet */ table tr td { color: Red; } A /* Selectors */ table.one, tr#head, td { top: 0px; } span.one.two { color: #0066CC; } table tr td { color: Red; } A /* SimpleSelectors */ table, tr, td { top: 0px; } span.one.two { color: #0066CC; } table tr td { color: Red; }
/* Declaration */ table tr td { border: 1px solid #FFFFFF; } A /* Expression */ table tr td { border: 1px solid #FFFFFF; } An /* Term */ table tr td { border: 1px solid #FFFFFF; } The Demo ApplicationThe demo application is very simple. Selecting "Open CSS" from the "File" menu allows you to browse for a CSS file. Once opened, it automatically uses the Coco generated parser to parse the CSS file, building a
ConclusionIf you've been looking for a simple CSS parser, this may fill your requirements. If you are looking for a more robust CSS parser, this may give you a good start to producing one. If you are just interested in seeing a simple example of using Coco/R, this project is a light introduction with a realistic example. If nothing else, at least now you have a valid reason to use booby tags. Like I said at the beginning of this article, it is a very simple representation of CSS, but it did what I needed. I hope you can find a use for it, or can make improvements to share with us. History09/18/2007Changed the scanner spec in the CSS grammar to not ignore whitespaces, to better handle selector names. 09/20/2007At the end of the main production, I added a:link { color: #DD79DD; text-decoration: none; } /*comment*/
Previously in this situation, a parser error was reported halting parsing. 11/07/2007I changed the tr td * { color: #FF0000; }
Previously I was unaware of the wild card. 11/08/2007I changed the 12/01/2007There was an issue with URL properties. I was using the 12/02/2007I originally omitted the 03/14/2008Apparently I didn't test too thoroughly with URL properties. They should work now for most reasonable relative paths. 03/19/2008The parser and the model have been updated to support two CSS 2.0 constructs not previously supported.
11/25/2008As pointed out by Andrew Stellman, compound classes were originally overlooked. The parser is looking at the various types of selectors that can be chained together (i.e. id, class, pseudo-class) recusively. However, not expecting to see more than one of any of those in a single selector, with compound classes it was overwriting previous classes with the next. So the last class in a compound would be displayed. CSS compound classes look like this: .boldtext.redtext {
font-style: italic;
}
/*
<!-- Matches -->
<div class="boldtext redtext" />
*/
Since this wasn't accounted for in the original design, the model doesn't have a representation of such a construct. As a temporary fix, I made a small change in the attribute code of the SelectorName production in the ATG grammar file to append class names on subsequent discoveries. (This didn't require a change to the grammar, just the attributed code in the grammar) So, when parsing the above example, you will have a selector named ".boldtext.redtext". This works, and renders correctly. I'm not completely satisfied with this solution since it doesn't really represent the structure, but it does solve the problem. When time permits, I will see about working up a new parser that will handle CSS 3.0 as well as possibly the Mozilla and Microsoft extension. 12/05/2008This update includes the release of an entirely re-written grammar and model. This version is much closer to the actual CSS 2.1 spec, with some support for CSS 3.0 elements. If you are dependant on the previous model, the original v1.0 downloads are still available. The parser now supports CSS 2.1 & 3 combinators, attribute selectors, any functions, any @ rules, CSS 3 units, and most extensions. | ||||||||||||||||||||