Click here to Skip to main content
12,405,303 members (65,093 online)
Click here to Skip to main content
Add your own
alternative version

Tagged as

Stats

3.3K views
3 bookmarked
Posted

Convert EBNF to BNF Using parsertl & lexertl

, 1 Apr 2016 CPOL
Rate this:
Please Sign up or sign in to vote.

Introduction

Converting EBNF to BNF by hand is not difficult but it soon becomes tedious after a few iterations. Surely there must be a better way? As it turns out, yes, there is!

Background

Having whipped parsertl into shape this year, I wanted to try out some real world grammars. As we use IronPython as part of our product at work this seemed like an ideal candidate. Looking online the grammar is freely available, but it is described using EBNF.

Using the code

In order to use the following code you will need both parsertl and lexertl:

http://www.benhanson.net/parsertl.html

http://www.benhanson.net/lexertl.html

#include "../parsertl/parsertl/generator.hpp"
#include "../lexertl/lexertl/memory_file.hpp"
#include "../parsertl/parsertl/parser.hpp"

void read_ebnf(const char *start_, const char *end_str_)
{
    parsertl::rules grules_;
    parsertl::parser<lexertl::citerator> parser_;
    lexertl::rules lrules_;
    lexertl::state_machine lsm_;

    grules_.token("IDENTIFIER TERMINAL");
    grules_.push("start", "grammar");
    grules_.push("grammar", "rule "
        "| grammar rule");

    const std::size_t rule_idx_ = grules_.push("rule", "lhs rhs_or opt_semi");

    grules_.push("opt_semi", "%empty | ';'");

    const std::size_t lhs_idx_ = grules_.push("lhs", "IDENTIFIER rule_delim");

    grules_.push("rule_delim", "':' | '='");
    grules_.push("rhs_or", "opt_list");

    const std::size_t or_idx_ = grules_.push("rhs_or", "rhs_or '|' opt_list");

    grules_.push("opt_list", "%empty | rhs_list");
    grules_.push("rhs_list", "rhs");

    const std::size_t list_idx_ = grules_.push("rhs_list", "rhs_list opt_comma rhs");

    grules_.push("opt_comma", "%empty | ','");

    const std::size_t id_idx_ = grules_.push("rhs", "IDENTIFIER");
    const std::size_t terminal_idx_ = grules_.push("rhs", "TERMINAL");
    const std::size_t opt1_idx_ = grules_.push("rhs", "'[' rhs_or ']'");
    const std::size_t opt2_idx_ = grules_.push("rhs", "rhs '?'");
    const std::size_t zom1_idx_ = grules_.push("rhs", "'{' rhs_or '}'");
    const std::size_t zom2_idx_ = grules_.push("rhs", "rhs '*'");
    const std::size_t oom1_idx_ = grules_.push("rhs", "'{' rhs_or '}' '-'"); 
    const std::size_t oom2_idx_ = grules_.push("rhs", "rhs '+'");
    const std::size_t brack_idx_ = grules_.push("rhs", "'(' rhs_or ')'");

    parsertl::generator::build(grules_, parser_.sm);

    lrules_.insert_macro("NAME", "[A-Za-z][_0-9A-Za-z]*");
    lrules_.push("{NAME}", grules_.token_id("IDENTIFIER"));
    lrules_.push(":", grules_.token_id("':'"));
    lrules_.push("=", grules_.token_id("'='"));
    lrules_.push(",", grules_.token_id("','"));
    lrules_.push(";", grules_.token_id("';'"));
    lrules_.push("\\[", grules_.token_id("'['"));
    lrules_.push("\\]", grules_.token_id("']'"));
    lrules_.push("[?]", grules_.token_id("'?'"));
    lrules_.push("[{]", grules_.token_id("'{'"));
    lrules_.push("[}]", grules_.token_id("'}'"));
    lrules_.push("[*]", grules_.token_id("'*'"));
    lrules_.push("[(]", grules_.token_id("'('"));
    lrules_.push("[)]", grules_.token_id("')'"));
    lrules_.push("[|]", grules_.token_id("'|'"));
    lrules_.push("[+]", grules_.token_id("'+'"));
    lrules_.push("-", grules_.token_id("'-'"));
    lrules_.push("'(\\\\([^0-9cx]|[0-9]{1,3}|c[@a-zA-Z]|x\\d+)|[^'])+'|"
        "[\"](\\\\([^0-9cx]|[0-9]{1,3}|c[@a-zA-Z]|x\\d+)|[^\"])+[\"]",
        grules_.token_id("TERMINAL"));
    lrules_.push("#[^\r\n]*|\\s+|[(][*](.|\n)*?[*][)]", lrules_.skip());
    lexertl::generator::build(lrules_, lsm_);

    lexertl::citerator iter_(start_, end_str_, lsm_);
    lexertl::citerator end_;
    parsertl::parser<lexertl::citerator>::token_vector productions_;
    std::string lhs_;
    std::stack<std::string> rhs_;
    std::map<std::string, std::size_t> new_rule_ids_;
    std::stack<std::pair<std::string, std::string>> new_rules_;

    parser_.init(iter_);

    while (parser_.entry._action != parsertl::error &&
        parser_.entry._action != parsertl::accept)
    {
        if (parser_.entry._action == parsertl::reduce)
        {
            if (parser_.entry._param == lhs_idx_)
            {
                const parsertl::parser<lexertl::citerator>::parser::token &token_ =
                    parser_.dollar(0, productions_);

                lhs_ = token_.str();
            }
            else if (parser_.entry._param == id_idx_ ||
                parser_.entry._param == terminal_idx_)
            {
                const parsertl::parser<lexertl::citerator>::parser::token &token_ =
                    parser_.dollar(0, productions_);

                rhs_.push(token_.str());
            }
            else if (parser_.entry._param == opt1_idx_ ||
                parser_.entry._param == opt2_idx_)
            {
                std::size_t &counter_ = new_rule_ids_[lhs_];
                std::pair<std::string, std::string> pair_;

                ++counter_;
                pair_.first = lhs_ + '_' + std::to_string(counter_);
                pair_.second = "%empty | " + rhs_.top();
                rhs_.top() = pair_.first;
                new_rules_.push(pair_);
            }
            else if (parser_.entry._param == zom1_idx_ ||
                parser_.entry._param == zom2_idx_)
            {
                std::size_t &counter_ = new_rule_ids_[lhs_];
                std::pair<std::string, std::string> pair_;

                ++counter_;
                pair_.first = lhs_ + '_' + std::to_string(counter_);
                pair_.second = "%empty | " + pair_.first + ' ' + rhs_.top();
                rhs_.top() = pair_.first;
                new_rules_.push(pair_);
            }
            else if (parser_.entry._param == oom1_idx_ ||
                parser_.entry._param == oom2_idx_)
            {
                std::size_t &counter_ = new_rule_ids_[lhs_];
                std::pair<std::string, std::string> pair_;

                ++counter_;
                pair_.first = lhs_ + '_' + std::to_string(counter_);
                pair_.second = rhs_.top() + " | " +
                    pair_.first + ' ' + rhs_.top();
                rhs_.top() = pair_.first;
                new_rules_.push(pair_);
            }
            else if (parser_.entry._param == brack_idx_)
            {
                std::size_t &counter_ = new_rule_ids_[lhs_];
                std::pair<std::string, std::string> pair_;

                ++counter_;
                pair_.first = lhs_ + '_' + std::to_string(counter_);
                pair_.second = rhs_.top();
                rhs_.top() = pair_.first;
                new_rules_.push(pair_);
            }
            else if (parser_.entry._param == list_idx_)
            {
                std::string r_ = rhs_.top();

                rhs_.pop();
                rhs_.top() += ' ' + r_;
            }
            else if (parser_.entry._param == or_idx_)
            {
                const parsertl::parser<lexertl::citerator>::parser::token &token_ =
                    parser_.dollar(1, productions_);
                std::string r_ = token_.str() + ' ' + rhs_.top();

                rhs_.pop();

                if (rhs_.empty())
                {
                    rhs_.push(r_);
                }
                else
                {
                    rhs_.top() += ' ' + r_;
                }

            }
            else if (parser_.entry._param == rule_idx_)
            {
                assert(rhs_.empty() || rhs_.size() == 1);
                std::cout << lhs_ << ": ";
                
                if (!rhs_.empty())
                {
                    std::cout << rhs_.top();
                    rhs_.pop();
                }

                std::cout << ";\n";

                while (!new_rules_.empty())
                {
                    std::cout << new_rules_.top().first;
                    std::cout << ": ";
                    std::cout << new_rules_.top().second;
                    std::cout << ";\n";
                    new_rules_.pop();
                }
            }
        }

        parser_.next(iter_, productions_);
    }

    if (parser_.entry._action == parsertl::error)
        throw std::runtime_error("Syntax error");
}

int main()
{
    try
    {
        lexertl::memory_file mf_("grammars/python/python.ebnf");

        read_ebnf(mf_.data(), mf_.data() + mf_.size());
    }
    catch (const std::exception &e)
    {
        std::cout << e.what() << '\n';
    }

    return 0;
}

Points of Interest

Note that grammars using EBNF often use multi-character literals. Although parsertl accepts these due to the fact it defines ids for tokens automatically, if you want to use the resultant BNF grammar with yacc or bison, you will have to convert these to normal tokens by hand.

Note that most EBNF grammars actually describe LL grammars and whilst LL is a subset of LR, it is not a subset of LALR. This is true of the Python grammar I mentioned in the beginning. Fortunately some manual intervention can resolve the warnings given when running the converted grammar through parsertl or bison:

dot_seq: dot0m dotted_name
       | dot1m;

becomes

dot_seq: dotted_name
       | dot1m dotted_name
       | dot1m;

negating the need for the dot0m rule entirely.

varargslist: fpdef_list star_name
           | fpdef opt_equal_test comma_fpdef_opt_et opt_comma;

becomes

varargslist: star_name
           | fpdef_list star_name
           | fpdef opt_equal_test comma_fpdef_opt_et opt_comma;

and

fpdef_list: %empty
          | fpdef_list fpdef opt_equal_test ',';

becomes

fpdef_list: fpdef opt_equal_test ','
          | fpdef_list fpdef opt_equal_test ',';

Ideally I would like such conversions to be automated also. This will take more research and if it is not reasonably achievable, it may well be worth supporting LL(1) in addition to LALR(1) in parsertl given its popularity with modern real-world grammars such as Python.

History

01/04/16 First Release.

04/04/16 Now copes with empty rules.

05/04/16 Now showing manual conversion of BNF for Python to be LALR(1) compatible (without warnings).

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Ben Hanson
Software Developer (Senior)
United Kingdom United Kingdom
I started programming in 1983 using Sinclair BASIC, then moving on to Z80 machine code and assembler. In 1988 I programmed 68000 assembler on the ATARI ST and it was 1990 when I started my degree in Computing Systems where I learnt Pascal, C and C++ as well as various academic programming languages (ML, LISP etc.)

I have been developing commercial software for Windows using C++ for 22 years.

You may also be interested in...

Comments and Discussions

 
QuestionISO 14977 Pin
Walter Weinmann31-May-16 18:34
memberWalter Weinmann31-May-16 18:34 
AnswerRe: ISO 14977 Pin
Ben Hanson1-Jun-16 9:18
memberBen Hanson1-Jun-16 9:18 
GeneralRe: ISO 14977 Pin
Walter Weinmann1-Jun-16 9:27
memberWalter Weinmann1-Jun-16 9:27 
GeneralRe: ISO 14977 Pin
Ben Hanson2-Jun-16 8:19
memberBen Hanson2-Jun-16 8:19 
GeneralRe: ISO 14977 Pin
Walter Weinmann2-Jun-16 20:59
memberWalter Weinmann2-Jun-16 20:59 
GeneralRe: ISO 14977 Pin
Ben Hanson2-Jun-16 21:39
memberBen Hanson2-Jun-16 21:39 
GeneralRe: ISO 14977 Pin
Ben Hanson2-Jun-16 21:50
memberBen Hanson2-Jun-16 21:50 
GeneralRe: ISO 14977 Pin
Walter Weinmann2-Jun-16 22:33
memberWalter Weinmann2-Jun-16 22:33 
GeneralRe: ISO 14977 Pin
Ben Hanson3-Jun-16 0:41
memberBen Hanson3-Jun-16 0:41 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.160721.1 | Last Updated 1 Apr 2016
Article Copyright 2016 by Ben Hanson
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid