Monday, February 08, 2016

Testdriving Earley: Part 1 - Basics

So, I decided to write a parser. Silly me, I know. So if I am going to do this, I need to do it properly.

So I am going to try and develop it using test driven approach and documenting the whole process in this blog as I go.

This is the first article in the series of articles where I try to establish basics and bootstrap a project.

First things first. To write a parser, you need to know at least some basics of the theory.  Some of things you need to know at least on the basic levels is the basic theory and terminology around formal grammarsChomsky hierarchy of formal grammars and basics of automata theory. You don't have to become an expert (I am certainly not at this point), but the terms and concepts from these theoretical disciplines will crop up here and there and it

There is a lot of literature out there about Earley parsers. Some of it even readable for mere mortal developers not used to the academic vernacular.

The initial set of source information that helped me to conjure up the courage and bootstrap the project can be summed up by this list:
Armed with this information, I can start my project. As I am just learning this stuff, I will start out by first implementing he most naive version of the Earley parser algorithm in Java following the instructions in the Wikipedia article and the information available from the articles mentioned above.

The initial version will have no particular optimization beyond ones outlined in the algorithm itself and I do not expect it to be particularly fast or have a particularly usable API for interfacing with it.

So, with these goals and non-goals set up, I can set out for the initial leg of this journey.

All of the work is public and can be checked out on my GitHub repository. Feel free to browse around and give me feedback.

So after all of the boring administrative work (creating a project, setting up automated builds, etc) has been completed, we can now actually get on to the meat of this whole thing and start implementing a parser (see this commit on Github).

Grammar, Rules and Symbols

So, in order to parse some input we first need to be able to know the grammar of the input. Grammar is a set of rules that tell you what is the valid input.

Formal grammars are usually defined in terms on BNF (Bachus-Naur Form) or more often in EBNF (Extended Backus–Naur Form). Grammars are usually defined as sets of rules, mapping nonterminal symbols to one or more productions (a list of terminal and nonterminal symbols).

An example of a grammar for simple arithmetic expression might look something like this:

    Sum     := Sum     [+-] Product
             | Product
    Product := Product [*/] Factor
             | Factor
    Factor  := '(' Sum ')'
             | Number
    Number  := [0-9]+

(I've borrowed the example from the excellent tutorial by Loup Vaillant, I hope he does not mind)

So according to this definition, the central object of my parser implementation is a Grammar. So this is my first assertion:
  • Grammar is a set of rules (link)
This of course means that I have to define a Rule:
  • Rule maps single (nonterminal) Symbol to a Production of symbols (link)
Going down that whole rabbit hole, I also need to define Production and Symbol:
  • Production is a list of Symbols (link)
  • Symbol has a name*
* Symbol is an atomic unit of the grammar data model and (at least for now) it has no properties to speak of. I've arbitrarily chosen "name" of the symbol to uniquely identify it and differentiate it from other symbols. Symbol equality, hashCode and soString are based on it's name.

In the end, this is the basic class diagram of my Grammar model:

This basic structure will allow me to express any context free grammar as a very simple model. For example the above arithmetic expression grammar can be rewritten like this

    Sum     -> Sum [+-] Product
    Sum     -> Product
    Product -> Product [*/] Factor
    Product -> Factor
    Factor  -> '(' Sum ')'
    Factor  -> Number
    Number  -> [0-9] Number
    Number  -> [0-9]

And the equivalent Java code would look something like this:

Grammar g = new Grammar(
    new Rule(new Symbol("Sum"), 
             new Production(new Symbol("Sum"), new Symbol("[+-]"), 
                            new Symbol("Product"))),
    new Rule(new Symbol("Sum"), 
             new Production(new Symbol("Product"))),
    new Rule(new Symbol("Product"), 
             new Production(new Symbol("Product"), new Symbol("[*/]"),
                            new Symbol("Factor"))),
    new Rule(new Symbol("Product"), 
             new Production(new Symbol("Factor"))),
    new Rule(new Symbol("Factor"),
             new Production(new Symbol("'('"), new Symbol("Sum"),
                            new Symbol("')'"))),
    new Rule(new Symbol("Factor"), 
             new Production(new Symbol("Number"))),
    new Rule(new Symbol("Number"), 
             new Production(new Symbol("[0-9]"), new Symbol("Number"))),
    new Rule(new Symbol("Number"), 
             new Production(new Symbol("[0-9]")))

(A bit awkward if compared to EBNF or even the rewritten textual notation above, but this will have to do for now)

For now, I have completely ignored any semantics of Symbol elements – There is no difference between terminal and nonterminal symbols. Symbols are only defined in terms of their name and even that is only done because it makes debugging and test failures somewhat easier to comprehend.

As an implementation note, I've chosen to use immutable data classes for declaring a grammar. In my experience, this makes the code much easier to reason about and optimize in some scenarios. Also, there is no use case for mutating any of the data structures created so far after they've been created, so YAGNI.

Feel free to browse around and see the code in Github.

No comments: