Well, I finally did it. LALR(1) parser support is now in PCK.
This background assumes at least a passing familiarity with LL(1) parsing. If you need to develop that, you can fiddle around with my tutorial on creating an LL(1) parser. This also assumes some small familiarity with PCK.
LALR(1) parsing is a form of parsing that works almost "inside out" compared to LL(1) parsing; it builds the tree from the leaves to the root, guiding the parse using the input tokens. Contrast this with LL(1) parsers, which build the tree from the root down to the leaves, using the grammar to guide the parse. In the LALR(1)/LR case, the input tells us what rule to use which the parser compares to the grammar. In the LL(1)/LL case, the grammar tells us what rule to use, which the parser compares to the input.
Sound confusing? It is, a little at first. Luckily, the particulars of the LALR(1) algorithm aren't critical to using it. If it were, most people would be out of luck!
That having been said, we have to start somewhere. Here's the upshot of LALR(1) parsing compared to LL(1) parsing:
- LALR(1) parses a much larger subset of CFG grammars. It's significantly more powerful than LL(1) in terms of what it can recognize.
- LALR(1) can be left recursive or right recursive. Left recursion is encouraged even, since it doesn't take as much stack space.
- The grammars don't require factoring before they can be used
Here are the downsides of LALR(1) parsing:
- The learning curve is slightly higher than using an LL(1) parser, as it's more complicated.
- The algorithm to generate the tables is next to impossible for mortals to comprehend, no matter how many comments there are.
- It doesn't lend itself to natural tree traversal. Instead, it builds trees from the bottom up.
- Error recognition isn't quite as soon as it is in LL(1) due to the way the algorithm works.
Here's my recommendation, given the above: Use an LL(1) parser if you can, or an LALR(1) parser only if you need the extra parsing power that comes with it. PCK provides both.
Other parser generators that use the LALR(1) algorithm include YACC and Gold Parser.
Using the Code
First, we need to build the code before we can use it. Get
pckw (and the supporting assemblies) into your path and let's get started.
Create an XBNF grammar file:
expr = expr "+" term | term;
term= term "*" factor | factor;
factor= "(" expr ")" | int;
Convert it to a
pckw xlt expr.xbnf expr.pck
Now take the
pck specification and use it to generate a tokenizer/lexer:
pckw fagen expr.pck ExprTokenizer.cs /namespace Demo
Finally, take the same
pck specification file and use it to generate a LALR(1) parser:
pckw lalr1gen expr.pck ExprParser.cs /namespace Demo
Now include those files in your project, and reference the
Demo namespace. Add a reference to the assembly
pck. Add the following code somewhere, probably in your
Main() routine if this is a console app:
var parser = new ExprParser(new ExprTokenizer("3*(4+7)"));
That creates a new parser and tokenizer over the expression
If you want streaming access to the pre-transformed parse tree, you can call the parser's
Read() method in a loop, very much like Microsoft's
However, that doesn't take advantage of trimming or transformation on the parse tree. It's also harder to use in many situations that a parse tree itself.
Fortunately, it's really easy to get a parse tree:
var tree = parser.ParseReductions(); // pass true if you want the tree to be trimmed.
That returns a node with other nodes as children, representing the parse tree. Each node has all the information about the parsed element. Nodes that were collapsed in the grammar are not in the parse tree. Nodes that were hidden are not in the parse tree unless
true on the parser.
You can also create parsers at runtime, like you can with the LL(1) parser. You don't have to generate code.
To do so, you'll have to reference the
var cfg = CfgDocument.ReadFrom(@"..\..\..\expr.pck");
var lex = LexDocument.ReadFrom(@"..\..\..\expr.pck");
var tokenizer = lex.ToTokenizer("3*(4+7)", cfg.EnumSymbols());
var parser = cfg.ToLalr1Parser(tokenizer);
This code does the equivalent of the generated code, without actually generating any code. It can take some time to generate the LALR(1) tables and the FA tokenizer, so pre-generating is recommended. Furthermore, from the above, the set up is obviously a bit more complicated.
- 14th August, 2019 - Initial submission