Monday, February 08, 2016

Testdriving Earley: Part 1 - Basics

So, I decided to write a parser. Silly me, I know. So if I am going to do this, I need to do it properly.

So I am going to try and develop it using test driven approach and documenting the whole process in this blog as I go.

This is the first article in the series of articles where I try to establish basics and bootstrap a project.

First things first. To write a parser, you need to know at least some basics of the theory.  Some of things you need to know at least on the basic levels is the basic theory and terminology around formal grammarsChomsky hierarchy of formal grammars and basics of automata theory. You don't have to become an expert (I am certainly not at this point), but the terms and concepts from these theoretical disciplines will crop up here and there and it

There is a lot of literature out there about Earley parsers. Some of it even readable for mere mortal developers not used to the academic vernacular.

The initial set of source information that helped me to conjure up the courage and bootstrap the project can be summed up by this list:
Armed with this information, I can start my project. As I am just learning this stuff, I will start out by first implementing he most naive version of the Earley parser algorithm in Java following the instructions in the Wikipedia article and the information available from the articles mentioned above.

The initial version will have no particular optimization beyond ones outlined in the algorithm itself and I do not expect it to be particularly fast or have a particularly usable API for interfacing with it.

So, with these goals and non-goals set up, I can set out for the initial leg of this journey.

All of the work is public and can be checked out on my GitHub repository. Feel free to browse around and give me feedback.

So after all of the boring administrative work (creating a project, setting up automated builds, etc) has been completed, we can now actually get on to the meat of this whole thing and start implementing a parser (see this commit on Github).

Grammar, Rules and Symbols

So, in order to parse some input we first need to be able to know the grammar of the input. Grammar is a set of rules that tell you what is the valid input.

Formal grammars are usually defined in terms on BNF (Bachus-Naur Form) or more often in EBNF (Extended Backus–Naur Form). Grammars are usually defined as sets of rules, mapping nonterminal symbols to one or more productions (a list of terminal and nonterminal symbols).

An example of a grammar for simple arithmetic expression might look something like this:

    Sum     := Sum     [+-] Product
             | Product
    Product := Product [*/] Factor
             | Factor
    Factor  := '(' Sum ')'
             | Number
    Number  := [0-9]+

(I've borrowed the example from the excellent tutorial by Loup Vaillant, I hope he does not mind)

So according to this definition, the central object of my parser implementation is a Grammar. So this is my first assertion:
  • Grammar is a set of rules (link)
This of course means that I have to define a Rule:
  • Rule maps single (nonterminal) Symbol to a Production of symbols (link)
Going down that whole rabbit hole, I also need to define Production and Symbol:
  • Production is a list of Symbols (link)
  • Symbol has a name*
* Symbol is an atomic unit of the grammar data model and (at least for now) it has no properties to speak of. I've arbitrarily chosen "name" of the symbol to uniquely identify it and differentiate it from other symbols. Symbol equality, hashCode and soString are based on it's name.

In the end, this is the basic class diagram of my Grammar model:

This basic structure will allow me to express any context free grammar as a very simple model. For example the above arithmetic expression grammar can be rewritten like this

    Sum     -> Sum [+-] Product
    Sum     -> Product
    Product -> Product [*/] Factor
    Product -> Factor
    Factor  -> '(' Sum ')'
    Factor  -> Number
    Number  -> [0-9] Number
    Number  -> [0-9]

And the equivalent Java code would look something like this:

Grammar g = new Grammar(
    new Rule(new Symbol("Sum"), 
             new Production(new Symbol("Sum"), new Symbol("[+-]"), 
                            new Symbol("Product"))),
    new Rule(new Symbol("Sum"), 
             new Production(new Symbol("Product"))),
    new Rule(new Symbol("Product"), 
             new Production(new Symbol("Product"), new Symbol("[*/]"),
                            new Symbol("Factor"))),
    new Rule(new Symbol("Product"), 
             new Production(new Symbol("Factor"))),
    new Rule(new Symbol("Factor"),
             new Production(new Symbol("'('"), new Symbol("Sum"),
                            new Symbol("')'"))),
    new Rule(new Symbol("Factor"), 
             new Production(new Symbol("Number"))),
    new Rule(new Symbol("Number"), 
             new Production(new Symbol("[0-9]"), new Symbol("Number"))),
    new Rule(new Symbol("Number"), 
             new Production(new Symbol("[0-9]")))
);

(A bit awkward if compared to EBNF or even the rewritten textual notation above, but this will have to do for now)

For now, I have completely ignored any semantics of Symbol elements – There is no difference between terminal and nonterminal symbols. Symbols are only defined in terms of their name and even that is only done because it makes debugging and test failures somewhat easier to comprehend.

As an implementation note, I've chosen to use immutable data classes for declaring a grammar. In my experience, this makes the code much easier to reason about and optimize in some scenarios. Also, there is no use case for mutating any of the data structures created so far after they've been created, so YAGNI.

Feel free to browse around and see the code in Github.

Friday, January 15, 2016

Testdriving Earley - Introduction

So, I've been interested in parsers for a while. There are plenty of opportunities to use a parser in our every day jobs, but for some reason, it does not happen very often.  Most of the time, when some task presents us that requires us to recognize some user input, we almost instinctively reach for regular expressions.

Regular expressions are a very powerful tool, to be sure. Maybe even too powerful. With all the fancy additions like non-capturing lookahead and back references — regular expressions have become legendary source of obfuscation and subtle bugs in user input verification. 

Over the years, I've tried my hand with various parser generator and combinator libraries (ANTLR, JavaCC, JParsec to name the few) and hand-rolled my own in couple of instances. Most of the time when using parser generators, I've felt like I am invoking this big machinery just to get something simple up. And then there's specific parsing technology quirks that I just have to know or otherwise my grammar won't work at all. Not to mention adding a new dependency every time I add some new expression syntax (I've seen projects ending up with two or three versions of ANTLR on their classpath).

These are all small issues in the great scheme of things, but they add up, so I too reach for regular expressions and feel a little dirty for it.

I love it that Gavin King is pushing for including proper BNF grammar support in Ceylon language instead of supporting regular expressions. I tend to agree with him — proper context free grammars are more powerful and useful tool for validating and understanding complex input than regexes.

More recently I've come upon several articles describing Earley parser and what I've read sounds very promising. What I find interesting about it, is that it offers to get rid of some of the limitations of  more established parser technologies, allowing to parse grammars that might be considered ambiguous by some of the tools out there.

There is of course a possibility that parsing will take a significant performance hit in the worst case scenario, but it is mostly academical concern at this point, as most use cases I've encountered, have had neither overly complex grammars nor extremely long input strings to parse, so in all practical terms, I see the Earley parser algorithm as a net gain for the developer.

Now there is just this little wrinkle in all of this that there are scant few implementations of Earley parser for Java and none of those are really supported by either a community or their original developers. Of the three parsers two even have project names (PEN and Pep) and all three are terribly out of date in terms of modern coding practices. None of them are available as libraries for Maven, Gradle, Ivy, SBT or any other dependency manager.

So I decided to take up the challenge and write a clean implementation of Earley parser in Java, perhaps even learning something about theory of parsing along the way...

As it is as much a learning experience for me, I will try and share my progress along the way as a series of blog posts.