Introduction

Most of the time, when it is needed to parse a file or stream, programmers tend to depend on "Tokenizer" or "StreamTokenizer" rather than create a parser. Of course, creating a parser is time consuming since it needs iterative testing of all possible states. However, it makes your application robust and error free, particularly when dealing files with a specific format. Once you get a good start with creating a simple parser, sure you will find it as a better alternative in many hectic situations. This article targets beginners to parser development. This will give you an idea of creating a parser through a suitable step by step example. At the end of the tutorial, we will parse a SQL file and extract table specifications (please note that this is for an illustrative purpose; complete SQL formats are not supported).

Prerequisites

JavaCC works fine with Java VM 1.2 or greater. You need to install JavaCC. For help on installation, refer https://javacc.dev.java.net/doc/installhelp.html. If you are using Eclipse, then there are free JavaCC plug-ins available. Download and install a plug-in (Google for easy-javacc-1.5.7.exe, a free Eclipse plug-in for JavaCC).

What is JavaCC?

Java Compiler Compiler is an Open Source parser generator for Java programs. Unlike YACC (Yet Another Compiler Compiler), JavaCC is a top down parser for LL type grammars. So strictly speaking, LR parsing (left recursion) is not possible in JavaCC. JavaCC creates LL parsers for context free grammars (a context free grammar contains production rules in the format NT -->T, where NT is a known terminal and t is a combination of terminals and or non-terminals). An LL parser parses the input from left to right, and creates the leftmost derivation of a sentence when compared to the LR parser where the right most derivation of a sentence is created. These kinds of parsers use next tokens to take the parsing decisions without any back tracing (Look Ahead). So these LL parsers are not much complicated, and hence widely used even if they are fairly restrictive.

Before Starting

A parser and lexer are software components that describe the syntax and semantics of any character stream. A lexical analyzer identifies and separates a set of predefined characters known as tokens. A parser specifies semantics of different tokens in a statement and/or different statements. So a parser can be easily used to check the structure file and to extract specific components from the file. For example, in the following Java statement:

String test = "Testing";

The lexical analysis finds the following tokens:

{ "String"|" "|"Test"|" "|"="|" "| "\"" |"Testing"| "\""|";"}

So after lexical analysis, we will get these group of tokens:

{STRING_TYPE|SPACE|VAR_NAME|SPACE|EQUAL|SPACE|D_QUATES|VAR_VAL|D_QUATES|SEMCOLN}

The parser checks the semantics of the tokens identified by the lexer as specified in the grammar file. In the case of JavaCC, which itself is not a lexical analyzer or parser, it generates the Java code for the lexical analyzer and parser according to the specifications in a context free grammar file. These grammar files are named with the extension .jj by convention. These grammar files yield more modularity, and are easy to read, modify, or write when compared to a handwritten Java parser, and hence it saves a lot of time and effort.

The association of tokens in a file is described by the BNF products. Backus-Naur Form (BNF) is a metasyntax notation used to express context-free grammars. Most parser generators support the BNF production rules for specifying the sequence of token kinds in an error free output. In the next sections, you will see how token associations are specified using BNF productions in a .jj file.

So let us start the creation of the JavaCC file with a very simple example. Suppose we have a file with a SQL Create statement. For this example, we are not considering the fields in the Create Table statement. We will add more items to the file in the coming steps. It is often found to be a good practice to create and modify a grammar in steps, with the increasing complexities of the files.

Create Table test()

In the above example, you have the tokens:

CTCMD :- "Create Table" //the create table command
             TNAME :- "test" //the table name followed
             OBRA :- "(" //Opening bracket
             CBRA: - ")" //Closing bracket
             EOF //End of the file

Structure of a Grammar File

To generate a parser, the only input to JavaCC is a context free grammar file. JavaCC produces 7 Java files in the output. The JavaCC grammar file (.jj files) are composed of four sections.

Options
Class declaration
Specification for lexical analysis
BNF notations

The Options section is to specify the optional parameters such as DEBUG, STATIC etc. In this example, STATIC is set to false to make the parser class non-static, so that multiple instances of the parser can exist at a time. STSTIC is true by default. The class declaration in the grammar file provides the main entry point in the generated parser. The other section in the grammar file is meant for the specification of a token for lexical analysis. The grammar may contain BNF production rules which specify the association of tokens that define the structure of the file. For parsing the above file, our grammar file looks like the following:

/* Sample grammar file*/ 
options 
{STATIC = false ;} 
PARSER_BEGIN(SqlParser) 
package sqlParserDemo; 
class SqlParser { 
{Start () ;} 
} 
PARSER_END (SqlParser) 

SKIP: { "\n" | "\r" | "\r\n" |"\\"|"\t"|" "} 

TOKEN [IGNORE_CASE]: 
{ 
     <CTCMD :("Create Table")> 
  | <TNAME :(["a"-"z"])+ > 
  | <OBRA :("(")>| <CBRA:(")")> 
} 
void Start (): {} 
{<CTCMD><TNAME><OBRA><CBRA><EOF>}

In the above grammar file, the class declaration section is enclosed in the PARSER_BEGIN and PARSER_END keywords. In this section, we define the package, all imports, and the parser class. Here, the "SqlParser" class has the method initParser which serves as the entry point. The parser throws the "ParseException" and "TokenMgrError" exceptions. TokenMgrError is thrown when the scanning encounters undefined tokens. If an undefined state or production is encountered, the parser will throw a "ParseException". By default, the parser code created should have a constructor which accepts a "reader" type.

In many cases, we need to ignore some characters in the file, formatting characters like Newline, Whitespace, Tabs etc. These sequence may not have a relation with the meaning. Such characters can be skipped while scanning if they are specified as SKIP terminals. In the above grammar file, new line, carriage (note that newline representation varies from OS to OS), and whitespace are specified as the SKIPJA terminals. The scanner reads these characters and ignores them. They are not passed to the parser.

The keyword TOKEN is for specifying tokens. Each token is a character sequence associated with a name. The "|" character is used to separate tokens. JavaCC provides a token qualifier, IGNORE_CASE. It is used to make the scanner case insensitive to the tokens. In this example, SQL is case insensitive and hence the "Create Table" command can be written in any case. If you have case sensitive and case insensitive tokens, then you can specify them in different TOKEN statements. Tokens are enclosed within angle brackets.

Here we have created the tokens CTCMD, TNAME, OBRA, and CBRA .The character sequence for the tokens are defined with Regular Expression syntax. (["a"-"z"])+ means a sequence of any number of characters form "a"-"z".

The BNF production rules are specified after the token declaration. These seem almost similar to the method syntax. We can add any Java code in the first set of curly braces. Usually, they contain declarations of variables that are used in the production rule. The return type is the intended return type from the BNF. In this example, we are checking the file structure only. So the return type is void. The Start function in the example initializes the parsing. You should call the Start method from your class declaration. In this example, it defines the structure of the file as:

{<CTCMD><TNAME><OBRA><CBRA><EOF>}

Any change in the file format will throw an error.

Generating the Parser

Once you create the grammar file, save it in a directory. The general naming convention follows a ".jj" extension, say, demogrammar.jj.
Go to the command prompt and change the directory to the demogrammar.jj files directory.
Type the command "javacc demogrammar.jj".

If you are using Eclipse with the JavaCC plug-in installed, then follow the steps below:

Create a new project with the package as specified in the grammar file.
For creating the grammar file, right click the package and go to New-->Other-->JavaCC-->JavaCC Template file. Create the grammar file and save it.
Build the project.

After invoking Javacc on the grammar file (demogrammar.jj) you will get following 7 classes in its own files.

Token: Class for representing a token. Each token is associated with a 'token kind' which represents the type of the token. A string 'image' represents the character sequence associated with the token.
TokenMgrError: Sub class of Error. Throws exceptions on errors in lexical analysis.
ParseException: Sub class of Exception. Throws exceptions on errors detected by the parser.
SimpleCharStream: Implementation of the interface character stream for the lexer.
SqlParserConstants: An interface for defining the token classes used in the lexer and parser.
SqlParserTokenManager: The lexical analyzer class.
SqlParser: The parser class.

Compile the generated classes (javac *.java).

Running the Parser

After successful compilation, you are ready to test a sample file. For testing the example, add a class that has a main function to the package.

 /*for testing the parser class*/ 
public class ParseDemoTest { 
  public static void main(String[] args) { 
    try{SqlParserparser = new SqlParser(new FileReader(FilePath)); 
      parser.initParser () ;} 
    catch (Exception ex) 
    {ex.printStackTrace() ;}}

When creating the parser object, you can specify a reader as the constructor argument. Note that the generated parser code contains a constructor that accepts the reader. The InitParser method initializes the parsing. Now you can build and run "ParseDemoTest". An exception is thrown if the file specified in the given file path is not confined to the grammar. As we have discussed an overall idea of JavaCC operation, now we can add some more items to the file to parse. In this example, we will extract the table specifications provided in the SQL file (note that the example is only for illustrative purposes and hence the grammar doesn't comply with all SQL syntax). The new SQL file is in the following format:

##JavaccParserExample##### 
CREATE TABLE STUDENT 
( 
    StudentName varchar (20), 
    Class varchar(10), 
    Rnum integer, 
) 

CREATE TABLE BOOKS
( 
    BookName varchar(10), 
    Edition integer, 
    Stock integer, 
) 
CREATE TABLE CODES
( 
    StudentKey varchar(20), 
    StudentCode varchar(20), 
)

It is clear from the file that it can have multiple Create statements and each Create statement may contain multiple columns. Also, there are comments enclosed in character "#". For getting the table spec, we need a structure which contains a list of tables and their details. Create a class in the package for representing a table:

public class TableStruct { 
  String TableName; 
  HashMap<String,String> Variables = 
              new HashMap<String, String> ();
}

This class represents the table with the table name, along with column names mapped to their data type. Below is the modified grammar file for the new file.

On examining the new grammar file, you can see a SPECIAL_TOKEN. As mentioned earlier, special tokens are those which don't contribute any meaning, but are still informative, such as comments. Here, the comment is defined by the special token. The lexer identifies and passes the special tokens to the parser. There are no BNF notations for special tokens. You can see that all the variable declarations and other Java code associated with a BNF notation are enclosed in '{}'. An expression may contain other expressions such as TType = DType(). It is a good practice to identify the reusable expressions and specify them separately. In the BNF notation for the variables:

( TName = <TNAME> 
   TType = DType() 
   <COMMA> 
   {var.put(TName.image,TType.image);} 
)*

The "*" specifies the meaning that any number of occurrence of the token sequence is possible. For running and testing this grammar file, change your main class as follows:

public class ParseDemoTest { 
public static void main(String[] args) { 
try{ 
SqlParser parser = new SqlParser (new FileReader("D:\\sqltest.txt"));  
        ArrayList<TableStruct> tableList = parser.initParser(); 
for(TableStruct t1 : tableList) 
{  System.out.println("--------------------------"); 
           System.out.println("Table Name :"+t1.TableName); 
           System.out.println("Field names :"+t1.Variables.keySet()); 
           System.out.println("Data Types :"+t1.Variables.values()); 
           System.out.println("--------------------------");
      } 
}catch (Exception ex) 
{ex.printStackTrace() ;} 
} 
}

Compile the grammar file and run the appilication. For the SQL test file, you will get the output as:

--------------------------
Table Name :STUDENT
Field names :[StudentName, Rnum, Clas
--------------------------
--------------------------
Table Name :BOOKS
Field names :[Stock, Edition, BookName]
Data Types :[integer, integer, varchar]
--------------------------
--------------------------
Table Name :CODES
Field names :[StudentCode, StudentKey]
Data Types :[varchar, varchar]
--------------------------

Conclusion

JavaCC is a widely used tool for lexical and parser component generation which follows Regular Expression and BNF notation syntax for lex and parser specifications. Creating a parser needs an iterative step. Never expect to get the desired output in one pass. After creating your first parser in the example, try to modify it and add other possibilities to the input file. A step by step approach is desired while dealing with a complex file. Also, try to make expressions common and reusable as far as possible. Once you are comfortable with .jj files, you can use more advanced tools like JJTree and JTB with JavaCC to automate the augmentation of a grammar.

/* demo grammar.jj*/
options
{
    STATIC = false ;
}
PARSER_BEGIN (SqlParser)
   package sqlParserDemo;
   import java.util.ArrayList;
   import java.util.HashMap;
   class SqlParser {
         ArrayList<TableStruct> initParser()throws ParseException, TokenMgrError
   { return(init()) ; }
   }
PARSER_END (SqlParser)

SKIP: { "\n" | "\r" | "\r\n" |"\\"|"\t"|" "}

TOKEN [IGNORE_CASE]:
{
 <CTCMD :("Create Table")>
|<NUMBER :(["0"-"9"])+ >
|<TNAME:(["a"-"z"])+ >
|<OBRA:("(")+>
|<CBRA:(")")+>
|<COMMA:(",")>
}

SPECIAL_TOKEN : {<COMMENT:("#")+(<TNAME>)+("#")+>}

ArrayList<TableStruct> init():
{
  Token T;
ArrayList<TableStruct> tableList = new ArrayList<TableStruct>();
TableStruct tableStruct;
}
{
  (
       <CTCMD>
       T =<TNAME>
       {    tableStruct = new TableStruct ();
       tableStruct.TableName = T.image ;}
   <OBRA>
  tableStruct.Variables = Variables()
  <CBRA>
     {tableList.add (tableStruct) ;}
  )*
  <EOF>
  {return tableList;}
}

HashMap Variables():
{
   Token TName;
   Token TType;
   HashMap<String,String> var = new HashMap<String, String>();
 }

(
TName = <TNAME>
TType = DType()
<COMMA>
{var.put(TName.image,TType.image);}
)*
{return var;}
}
Token DType():
{
   Token TDType;
}
{
    TDType=<TNAME>
    [<OBRA><NUMBER><CBRA>]
   {return TDType;}
 }Create Table test
        (
        )