Complex parsers often need to support backtracking, which is a way to revisit items you've already encountered. The trick with this is twofold, doing it efficiently, and doing it transparently. The
LookAheadEnumerator<T> class provides both.
Update: Bug fix, more robust, doc comments, and 90% slangified the code (removed
nameof and the like) so that Slang can cook it and I can use it in my generated parser code.
LookAheadEnumerator is now used extensively in my Parsley parser generator which can parse C# (an example of parsing a large subset is at the link)
It's often desirable in parser code to use some sort of "streaming" interface for its input like a
TextReader class or an class implementing
IEnumerator<T>. I prefer enumerators because of their ubiquity and simplicity. However, it can be difficult to backtrack on streaming sources without preloading it into memory before parsing. This is fine for small text, but not reams of say, bulk JSON.
Normally with an enumerator, all you can do is
MoveNext() and sometimes
Reset() if you're lucky. There is no way to seek back to a particular previous position, and even if there was, it probably wouldn't work of a true streaming source, like an HTTP response, or console input.
A backtracking parser on the other hand, needs to "bookmark" its current position, before trying several alternatives until it finds one that parses. That means revisiting the same sequence of text several times.
Backtracking parsers are inherently less efficient but far more flexible than non-backtracking parsers. I've made a fair effort to optimize this class to make it as efficient as possible for this purpose.
Conceptualizing This Mess
I've embedded an array backed queue into this class, which it uses to back the the lookahead buffer. The queue starts with 16 elements and grows as needed (almost doubling in size each time to avoid too many reallocations - heap in .NET is cheaper than CPU) depending on how much lookahead is needed. When a
LookAheadEnumeratorEnumerator<T> (the lookahead cursor) is advanced, it often requires the primary class to read more data into the queue in order to satisfy it. When the main cursor is advanced, it will discard items in the queue (simply incrementing
_queueHead which is really fast) . It's not a good idea to advance or reset the main cursor while using the lookahead cursor. The results in this case, are undefined, as I haven't implemented versioning in these enumerators.
Using This Mess
You use the code like a standard
IEnumerator<T> with an additional property -
LookAhead that allows you to
foreach from your current position without advancing the cursor. There's also
TryPeek() which look ahead a specified number of positions, and
DiscardLookAhead() which simply moves the cursor to the physical position and clears the buffer.
var text = "fubarfoobaz";
var la = new LookAheadEnumerator<char>(text.GetEnumerator());
foreach (var ch in la.LookAhead)
foreach (var ch in la.LookAhead)
This would print the following to the console:
As you can see, we're incrementing the primary cursor by one in each iteration, and then we're enumerating over
LookAhead from there. Enumerating over
LookAhead does not affect the primary cursor*.
* The underlying physical read cursor is advanced, as it must be, but a facade is presented using a queue that buffers the already read input for you, presenting it as the next input.
Typically in a parser, you'll use it over tokens, like
LookAheadEnumerator<Token> and then each time you need to do backtracking, you parse along
LookAhead instead of the primary cursor. When you find a match, you'll have to discard all the tokens you matched, either by reparsing along the primary cursor or by counting and advancing if you know how many tokens you parsed. If you're only parsing one alternative to the main parse, you can simply use
DiscardLookAhead() once you've matched the alternate.
That's about all. This is an extremely specialized class, but when you need it, you really need it.
- 19th December, 2019: Initial submission
- 23rd December, 2019: Update