Click here to Skip to main content
15,892,643 members
Articles / Programming Languages / C#

YAML Parser in C#

Rate me:
Please Sign up or sign in to vote.
4.83/5 (26 votes)
17 Feb 2011CPOL2 min read 330.2K   4.5K   70  
An almost feature complete YAML parser.
{- YAML Grammar -}

PrintableChar <- "\u9\uA\uD" / '\u20'...'\u7E' 
               / "\u85" / '\uA0'...'\uD7EF' 
               / '\uE000'...'\uFFFD' / '\u10000'...'\u10FFFF';

CommentIndicator <- '#';
DirectiveIndicator <- '%';
IndentationIndicator <- ' ';
SequenceIndicator <- '-';
MappingKeyIndicator <- '?';
MappingValueIndicator <- ':';

FlowCollectionEntryEnd <- ',';
FlowSequenceStart <- '[';
FlowSequenceEnd <- ']';
FlowMappingStart <- '{';
FlowMappingEnd <- '}';

AnchorIndicator <- '&';
AliasIndicator <- '*';
TagIndicator <- '!';

LiteralBlockIndicator <- '|';
FoldedBlockIndicator <- '>';

SingleQuoteFlowIndicator <- '\'';
DoubleQuoteFlowIndicator <- '\"';

ReservedIndicators <- '@' / '`';

Indicators <- "-?:,[]{}#&*!|>'\"%@`";

GenericLineBreaks  <- '\r\n' / '\r' / '\n' / '\u85';
LineSeparator      <- '\u2028';
ParagraphSeparator <- '\u2029';

BreakChar <- "\r\n\u85\u2028\u2029";
NonBreakChar <- -"\r\n\u85\u2028\u2029";

SpaceChar <- " \t";


{-
FlowContent
NodeProperty
NodeProperty Indentation FlowContent

FlowContent / NodeProperty (Indentation FlowContent)?
NodeProperty / (NodeProperty Indentation)? FlowContent


flow-in-block(n,c) ::=  ns-flow-node(n+1,flow-out) s-l-comments  

ns-l+block-in-block(n,c) ::=  ( c-ns-properties(n+1,c) s-separate(n+1,c) )? c-l+block-content(n,c)    

ns-l+block-node(n,c) ::=    ns-l+block-in-block(n,c)| ns-l+flow-in-block(n,c)    

s-l+block-node(n,c) ::=  s-separate(n+1,c) ns-l+block-node(n,c)  

-}

YAML Stream:

	A YAML character stream may contain several YAML documents, denoted by document boundary markers. 
	Each document presents a single independent root node and may be preceded by a series of directives. 
	Note that the stream may contain no documents, even if it contains a non-empty prefix. 
	In particular, a stream containing no characters is valid and contains no documents.

	YAML streams use document boundary markers to allow more than one document to be contained in the same stream. 
	Such markers are a presentation detail and are used exclusively to convey structure. 
	
	A line beginning with ��---�� may be used to explicitly denote the beginning of a new YAML document. 
	When YAML is used as the format of a communication channel, it is useful to be able to indicate the end of a document 
	without closing the stream, independent of starting the next document. 
	To support this scenario, a YAML document may be terminated by an explicit end line denoted by ��...��, followed by optional comments.

	The first document may be implicit (omit the document start marker). 
	In such a case it must not specify any directives and will be parsed using the default settings. 
	If the document is explicit (begins with an document start marker), it may specify directives to control its parsing. 

	Each following document must be explicit (begin with a document start marker). 
	If the document specifies no directives, it is parsed using the same settings as the previous document. 
	If the document does specify any directives, all directives of previous documents, if any, are ignored. 

YAML Document:

	A YAML document is a single native data structure presented as a single root node. 
	Presentation details such as directives, comments, indentation and styles are not considered part of the content information of the document. 

	An explicit document begins with a document start marker followed by the presentation of the root node. 
	The node may begin in the same line as the document start marker. 
	If the explicit document��s node is completely empty, it is assumed to be an empty plain scalar with no specified properties. 
	Optional document end marker(s) may follow the document. 

	An implicit document does not begin with a document start marker. 
	In this case, the root node must not be presented as a completely empty node. 
	Again, optional document end marker(s) may follow the document. 

Directives:
	Directives are instructions to the YAML processor. 
	Like comments, directives are presentation details and are not reflected in the serialization tree (and hence the representation graph).
	This specification defines two directives, ��YAML�� and ��TAG��, and reserves all other directives for future use. 
	A YAML processor should ignore unknown directives with an appropriate warning. 

	It is an error to specify more than one ��YAML�� directive for the same document, even if both occurrences give the same version number.
	A version 1.1 YAML processor should accept documents with an explicit ��%YAML 1.1�� directive, as well as documents lacking a ��YAML�� directive. 
	Documents with a ��YAML�� directive specifying a higher minor version (e.g. ��%YAML 1.2��) should be processed with an appropriate warning. 
	Documents with a ��YAML�� directive specifying a higher major version (e.g. ��%YAML 2.0��) should be rejected with an appropriate error message. 

	The ��TAG�� directive establishes a shorthand notation for specifying node tags. 
	Each ��TAG�� directive associates a handle with a prefix, allowing for compact and readable tag notation.
	It is an error to specify more than one ��TAG�� directive for the same handle in the same document, even if both occurrences give the same prefix. 

Scalar:
	The double-quoted scalar capable of expressing arbitrary strings, by using ��\�� escape sequences.
	Therefore, the ��\�� and ��"�� characters must also be escaped when present in double-quoted content. 

	A single line double-quoted scalar is a sequence of (possibly escaped) non-break Unicode characters. 
	All characters are considered content, including any leading or trailing white space characters.

	In a multi-line double-quoted scalar, line breaks are subject to line folding, and any trailing white space is excluded from the content. 
	However, an escaped line break (using a ��\��) is excluded from the content, while white space preceding it is preserved. 
	This allows double-quoted content to be broken at arbitrary positions.

	Line folding allows long lines to be broken for readability. Folding only applies to line breaks that end non-empty lines.
	If the following line is not empty, the line break is converted to a single space.
	If the following line is empty line, the line break is ignored, any line break ending an empty line is preserved.

	A single line single-quoted scalar is a sequence of non-break printable characters. 
	All characters are considered content, including any leading or trailing white space characters. 

	In a multi-line single-quoted scalar, line breaks are subject to line folding. 
	Leading white space in the first line is considered content only if followed by a non-space character. 
	All leading and trailing white space of inner lines is excluded from the content. 
	Single-quoted scalars lines can only be broken where a single space character separates two non-space characters.
	The leading prefix white space of the last line is stripped in the same way as for inner lines. 
	Trailing white space is considered content only if preceded by a non-space character. 

{- Not supported -}

ByteOrderMark is not detected before ImplicitDocument and ExplicitDocument.
Sepecial line break and paragraph break chars.
At most one docuement end marker is allowed.
Comma after last entry in flow collection is not allowed.


{- Yaml Document -}

	YamlStream
		<- Documents : (ImplicitDocument? ExplicitDocument*) <end>;

	YamlDocument ImplicitDocument
		<- Comment* IgnoredSpace @{currentDocument = yamlDocument; currentIndent = -1;} Root:BlockNode ('...' InlineComments)?;

	YamlDocument ExplicitDocument
		<- Comment* Directives:Directive* '---' @{currentDocument = yamlDocument; currentIndent = -1;} (SeparationLines Root:BlockNode / Root:EmptyBlock) ('...' InlineComments)?;

{- Directives -}

	Directive
		<- YamlDirective
		 / TagDirective
		 / ReservedDirective
		 ;

	ReservedDirective
		<- '%' Name:DirectiveName Parameters:(SeparationSpace DirectiveParameter)* InlineComments;

		string DirectiveName
			<- $NonSpaceChar+;

		string DirectiveParameter
			<- $NonSpaceChar+;

	YamlDirective
		<- 'YAML' SeparationSpace Version:YamlVersion InlineComments;

		YamlVersion
			<- Major:Integer '.' Minor:Integer;

	TagDirective
		<- 'TAG' SeparationSpace Handle:TagHandle SeparationSpace Prefix:TagPrefix InlineComments;

		TagHandle
			<- NamedTagHandle
			 / SecondaryTagHandle
			 / PrimaryTagHandle
			 ;

			PrimaryTagHandle
				<- '!';

			SecondaryTagHandle
				<- '!!';

			NamedTagHandle
				<- '!' Name:WordChar+ '!';

		TagPrefix
			<- LocalTagPrefix
			 / GlobalTagPrefix
			 ;

			LocalTagPrefix
				<- '!' Prefix:UriChar*;

			GlobalTagPrefix
				<- Prefix:UriChar+;

{- Node -}

	DataItem
		<- (Property:NodeProperty SeparationLines)? (Scalar / Sequence / Mapping);

	DataItem BlockNode
		<- { IncreaseIndent(); } (Indent FlowNodeInBlock InlineComments) @{ DecreaseIndent(); }
		 / BlockContent
		 / property:NodeProperty SeparationLines BlockContent { SetDataItemProperty(dataItem, property); }
		 ;

	DataItem FlowNodeInBlock
		<- AliasNode 
		 / FlowContentInBlock
		 / property:NodeProperty {dataItem = new Scalar();} (SeparationLines FlowContentInBlock)? { SetDataItemProperty(dataItem, property); }
		 ;

	DataItem FlowNodeInFlow
		<- AliasNode 
		 / FlowContentInFlow
		 / property:NodeProperty {dataItem = new Scalar();} (SeparationLinesInFlow FlowContentInFlow)? { SetDataItemProperty(dataItem, property); }
		 ;

	DataItem AliasNode 
		<- '*' name:AnchorName { return GetAnchoredDataItem(name); };

	DataItem FlowContentInBlock
		<- { IncreaseIndentIfZero(); } FlowScalarInBlock @{ DecreaseIndent(); }
		 / FlowSequence
		 / FlowMapping;

	DataItem FlowContentInFlow
		<- { IncreaseIndentIfZero(); } FlowScalarInFlow @{ DecreaseIndent(); }
		 / FlowSequence
		 / FlowMapping;

	DataItem BlockContent
		<- { IncreaseIndent(); } BlockScalar @{ DecreaseIndent(); }
		 / { IncreaseIndent(); } BlockSequence @{ DecreaseIndent(); }
		 / { IncreaseIndent(); } BlockMapping @{ DecreaseIndent(); };

	DataItem EmptyFlow <- <empty> { return new Scalar(); };
	DataItem EmptyBlock <- EmptyFlow InlineComments;

{- Node Property -}

	NodeProperty <- Tag:Tag (SeparationLines Anchor:Anchor)? / Anchor:Anchor (SeparationLines Tag:Tag)?;

	string Anchor
		<- '&' $AnchorName;

	string AnchorName
		<- $NonSpaceChar+;

	Tag
		<- VerbatimTag
		 / ShorthandTag
		 / NonSpecificTag
		 ;

	NonSpecificTag
		<- '!';

	VerbatimTag
		<- '!' '<' Chars:UriChar+ '>';

	ShorthandTag
		<- NamedTagHandle Chars:TagChar+
		 / SecondaryTagHandle Chars:TagChar+
		 / PrimaryTagHandle Chars:TagChar+
		 ;

{- Scalar -}

	Scalar
		<- FlowScalarInBlock / FlowScalarInFlow / BlockScalar;

	Scalar FlowScalarInBlock
		<- Text:PlainText/ Text:SingleQuotedText / Text:DoubleQuotedText;

	Scalar FlowScalarInFlow
		<- Text:PlainTextInFlow / Text:SingleQuotedText / Text:DoubleQuotedText;

	Scalar BlockScalar
		<- Text:LiteralText / Text:FoldedText;

	{- Plain Text -}

		string PlainText <- PlainTextMultiLine / PlainTextInFlow;

		string PlainTextMultiLine <- $PlainTextSingleLine $PlainTextMoreLine*;

		string PlainTextSingleLine <- !DocumentMarker $PlainTextFirstChar $(PlainTextChar / SpacedPlainTextChar)*;

		string PlainTextMoreLine <- IgnoredBlank $LineFolding Indent IgnoredSpace $(PlainTextChar / SpacedPlainTextChar)*;

		string PlainTextInFlow <- $PlainTextInFlowSingleLine $PlainTextInFlowMoreLine*;

		string PlainTextInFlowSingleLine <- !DocumentMarker $PlainTextFirstCharInFlow $(PlainTextCharInFlow / SpacedPlainTextCharInFlow)*;

		string PlainTextInFlowMoreLine <- IgnoredBlank $LineFolding Indent IgnoredSpace $(PlainTextCharInFlow / SpacedPlainTextCharInFlow)*;

		string PlainTextFirstChar <- $-"\r\n\t -?:,[]{}#&*!|>'\"%@`" / $"-?:" $NonSpaceChar;

		string PlainTextChar <- $':' $NonSpaceChar / $NonSpaceChar $'#' / { text.Length = 0; } ch2:$-"\r\n\t :#";
		
		string SpacedPlainTextChar <- $' '+ $PlainTextChar;

		string PlainTextFirstCharInFlow <- $-"\r\n\t -?:,[]{}#&*!|>'\"%@`" 
		                          / $"-?:" $NonSpaceSep;

		string PlainTextCharInFlow <- $":" $NonSpaceSep / $NonSpaceSep $'#' / { text.Length = 0; } ch2:$-"\r\n\t :#,[]{}";
		
		string SpacedPlainTextCharInFlow <- $' '+ $PlainTextCharInFlow;

		void DocumentMarker <- <sol> '---' (Space/LineBreak) 
                             / <sol> '...' (Space/LineBreak);
   
	{- Quoted Text -}

		string DoubleQuotedText <- DoubleQuotedSingleLine / DoubleQuotedMultiLine;

		string DoubleQuotedSingleLine <- '"' $(-"\"\\\r\n" / EscapeSequence)* '"';

		string DoubleQuotedMultiLine <- $DoubleQuotedMultiLineFist $DoubleQuotedMultiLineInner* $DoubleQuotedMultiLineLast;

		string DoubleQuotedMultiLineFist <- '"' $(-" \"\\\r\n" / EscapeSequence / ' ' !(IgnoredBlank LineBreak))* IgnoredBlank $DoubleQuotedMultiLineBreak;

		string DoubleQuotedMultiLineInner <- Indent IgnoredBlank $(-" \"\\\r\n" / EscapeSequence / ' ' !(IgnoredBlank LineBreak))+  IgnoredBlank $DoubleQuotedMultiLineBreak;

		string DoubleQuotedMultiLineLast <- Indent IgnoredBlank $(-"\"\\\r\n" / EscapeSequence)* '"';

		string DoubleQuotedMultiLineBreak <- LineFolding / EscapedLineBreak;

		string SingleQuotedText <- SingleQuotedSingleLine / SingleQuotedMultiLine;

		string SingleQuotedSingleLine <- '\'' $(-"'\r\n" / EscapedSingleQuote)* '\'';

		string SingleQuotedMultiLine <- $SingleQuotedMultiLineFist $SingleQuotedMultiLineInner* $SingleQuotedMultiLineLast;

		string SingleQuotedMultiLineFist <- '\'' $(-" '\r\n" / EscapedSingleQuote / ' ' !(IgnoredBlank LineBreak))* IgnoredBlank fold:$LineFolding;

		string SingleQuotedMultiLineInner <- Indent IgnoredBlank $(-" '\r\n" / EscapedSingleQuote / ' ' !(IgnoredBlank LineBreak))+  IgnoredBlank fold:$LineFolding;

		string SingleQuotedMultiLineLast <- Indent IgnoredBlank $(-"'\r\n" / EscapedSingleQuote)* '\'';

		string LineFolding <- LineBreak (IgnoredBlank LineBreak)*;
		
		char EscapedSingleQuote <- { char ch = default(char); } '\'\'' { return '\''; };

		void EscapedLineBreak <- '\\' LineBreak (IgnoredBlank LineBreak)*;

	{- Block Text -}

		string LiteralText <- '|' (modifier:BlockScalarModifier)? Comment $LiteralInner* (str2:$LiteralLast)?;

		string FoldedText <- '>' (modifier:BlockScalarModifier)? Comment EmptyLineBlock* $(FoldedLines / SpacedLines)*  $ChompedLineBreak;

		BlockScalarModifier <- indent:IndentIndicator chomp:ChompingIndicator? / indent:IndentIndicator? chomp:ChompingIndicator;

		string LiteralInner <- EmptyLineBlock* Indent $(NonBreakChar+ ReservedLineBreak);

		string LiteralLast <- EmptyLineBlock* Indent $(NonBreakChar+) ch2:$ChompedLineBreak;

		string FoldedLine <- Indent $NonSpaceChar $NonBreakChar*;
		string FoldedLines <- $(FoldedLine LineFolding)* str2:$FoldedLine;

		string SpacedLine <- Indent Blank $NonBreakChar*;
		string SpacedLines <- ($SpacedLine LineBreak)* str2:$SpacedLine;

		char IndentIndicator <- '1'...'9';

		char ChompingIndicator <- '-' / '+';

{- Sequence -}

	Sequence
		<- FlowSequence / BlockSequence;

	Sequence FlowSequence
		<- '[' SeparationLinesInFlow? Enties:(FlowSequenceEntry (',' SeparationLinesInFlow? FlowSequenceEntry)*) ']';

	DataItem FlowSequenceEntry
		<- FlowNodeInFlow SeparationLinesInFlow?
		 / FlowSingPair
		 ;

	Sequence BlockSequence
		<- Enties:BlockSequenceEntry+ Comments?;

	DataItem BlockSequenceEntry
		<- Indent '-' IndentedBlock;

	DataItem IndentedBlock
		<- InlineComments BlockNode
		 / { IncreaseIndent(); } InlineBlock @{ DecreaseIndent(); }
		 / EmptyBlock
		 ;

	DataItem InlineBlock
		<- SeparationSpace FlowNodeInBlock InlineComments
		 / SeparationSpaceAsIndent (InlineSequence / InlineMapping);

	Sequence InlineSequence
		<- Enties:('-' IndentedBlock BlockSequenceEntry*);

{- Mapping -}

	Mapping
		<- FlowMapping / BlockMapping;

	Mapping FlowMapping
		<- '{' SeparationLinesInFlow? Enties:(FlowMappingEntry (',' SeparationLinesInFlow? FlowMappingEntry)*) '}';

	MappingEntry FlowMappingEntry
		<- Key:ExplicitKey Value:ExplicitValue
		 / Key:ExplicitKey Value:EmptyFlow
		 / Key:SimpleKey Value:ExplicitValue
		 / Key:SimpleKey Value:EmptyFlow
		 ;

	DataItem ExplicitKey
		<- '?' SeparationLinesInFlow FlowNodeInFlow SeparationLinesInFlow?
		 / '?' EmptyFlow SeparationLinesInFlow
		 ;

	DataItem SimpleKey
		<- FlowKey SeparationLinesInFlow?;

	Scalar FlowKey 
		<- Text:PlainTextInFlowSingleLine
		 / Text:DoubleQuotedSingleLine
		 / Text:SingleQuotedSingleLine

	Scalar BlockKey 
		<- Text:PlainTextSingleLine
		 / Text:DoubleQuotedSingleLine
		 / Text:SingleQuotedSingleLine
		 ;

	DataItem ExplicitValue
		<- ':' SeparationLinesInFlow FlowNodeInFlow SeparationLinesInFlow?
		 / ':' EmptyFlow SeparationLinesInFlow
		 ;

	MappingEntry FlowSingPair
		<- Key:ExplicitKey Value:ExplicitValue
		 / Key:ExplicitKey Value:EmptyFlow
		 / Key:SimpleKey Value:ExplicitValue
		 ;

	Mapping BlockMapping
		<- Enties:(Indent BlockMappingEntry)+ Comments?;

	MappingEntry BlockMappingEntry
		<- Key:BlockExplicitKey Value:BlockExplicitValue
		 / Key:BlockExplicitKey Value:EmptyFlow
		 / Key:BlockSimpleKey Value:BlockSimpleValue
		 / Key:BlockSimpleKey Value:EmptyBlock
		 ;

	DataItem BlockExplicitKey
		<- '?' IndentedBlock;

	DataItem BlockExplicitValue
		<- Indent ':' IndentedBlock;

	DataItem BlockSimpleKey
		<- BlockKey SeparationLines? ':';

	DataItem BlockSimpleValue
		<- IndentedBlock;

	Mapping InlineMapping
		<- Enties:(BlockMappingEntry (Indent BlockMappingEntry)*);

{- Comment -}

	void Comment
		<- !<eof> IgnoredSpace ('#' NonBreakChar*)? (LineBreak / <eof>);

	void InlineComment
		<- (SeparationSpace ('#' NonBreakChar*)?)? (LineBreak / <eof>);

	void Comments
		<- Comment+;

	void InlineComments
		<- InlineComment Comment*;


string Integer <- chars:Digit+ { return new string(chars.ToArray()); };

char WordChar <- Letter / Digit / '-';

char Letter <- $'a'...'z' / $'A'...'Z';
	 
char Digit <- $'0'...'9';

char HexDigit <- $'0'...'9' / $'A'...'F' / $'a'...'f';

char UriChar
	<- WordChar 
	 / '%' char1:HexDigit char2:HexDigit { ch = Convert.ToChar(int.Parse(String.Format("{0}{1}", char1, char2), System.Globalization.NumberStyles.HexNumber));} 
	 / ";/?:@&=+$,_.!~*'()[]";

char TagChar
	<- WordChar 
	 / '%' char1:HexDigit char2:HexDigit { ch = Convert.ToChar(int.Parse(String.Format("{0}{1}", char1, char2), System.Globalization.NumberStyles.HexNumber));} 
	 / ";/?:@&=+$,_.~*'()[]";

void EmptyLinePlain <- IgnoredSpace NormalizedLineBreak;

void EmptyLineQuoted <- IgnoredBlank NormalizedLineBreak;

void EmptyLineBlock <- Indent NormalizedLineBreak;

char NonSpaceChar <- -" \t\r\n";
char NonSpaceSep <-  -"\r\n\t ,[]{}";

char NonBreakChar <- -"\r\n";

void IgnoredSpace <- ' '*;
void IgnoredBlank <- " \t"*;
void SeparationSpace <- ' '+;
void SeparationLines <- InlineComments / SeparationSpace;
void SeparationLinesInFlow <- InlineComments IgnoredSpace / SeparationSpace;

void SeparationSpaceAsIndent
	 <- (' ' { currentIndent++; })+;

void Indent <- {
            for (int i = 0; i < currentIndent; i++)
            {
                MatchTerminal(' ', out success);
                if (!success)
                {
                    return;
                }
            }
            if (parseAdditionalIndent)
            {
				int additionalIndent = 0;
				while (true)
				{
					MatchTerminal(' ', out success);
					if (success)
					{
						additionalIndent++;
					}
					else{ break; }
				}
				currentIndent += additionalIndent;
                parseAdditionalIndent = false;
            }
            success = true;};

char Space <- ' ';
char Blank <- " \t";
void LineBreak <- '\r\n' / '\r' / '\n';
string ReservedLineBreak <- $'\r\n' / $'\r' / $'\n';
char NormalizedLineBreak <- { char ch = default(char); } LineBreak { return '\n'; };
char ChompedLineBreak <- { char ch = default(char); } (LineBreak / <eof>) @{ return '\n'; };

char EscapeSequence 
        <- '\\' { return '\\'; }
         / '\'' { return '\''; }
         / '"'  { return '\"'; }
         / 'r'  { return '\r'; }
         / 'n'  { return '\n'; }
         / 't'  { return '\t'; }
         / 'v'  { return '\v'; }
         / 'a'  { return '\a'; }
         / 'b'  { return '\b'; }
         / 'f'  { return '\f'; };


Does EscapeSequence include escaped line break? No.
Does single line double-quoted scalar allow escaped line break? No.

The comment belong to which doc?
First doc
#comment
---
....
#comment
---

Oren Ben-Kiki <oren@ben-kiki.org>
Clark Evans <cce@clarkevans.com>
Brian Ingerson <ingy@ttul.org>

By viewing downloads associated with this article you agree to the Terms of Service and the article's licence.

If a file you wish to view isn't highlighted, and is a text file (not binary), please let us know and we'll add colourisation support for it.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Architect YunCheDa Hangzhou
China China
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions