Friday, January 15, 2016

Testdriving Earley - Introduction

So, I've been interested in parsers for a while. There are plenty of opportunities to use a parser in our every day jobs, but for some reason, it does not happen very often.  Most of the time, when some task presents us that requires us to recognize some user input, we almost instinctively reach for regular expressions.

Regular expressions are a very powerful tool, to be sure. Maybe even too powerful. With all the fancy additions like non-capturing lookahead and back references — regular expressions have become legendary source of obfuscation and subtle bugs in user input verification. 

Over the years, I've tried my hand with various parser generator and combinator libraries (ANTLR, JavaCC, JParsec to name the few) and hand-rolled my own in couple of instances. Most of the time when using parser generators, I've felt like I am invoking this big machinery just to get something simple up. And then there's specific parsing technology quirks that I just have to know or otherwise my grammar won't work at all. Not to mention adding a new dependency every time I add some new expression syntax (I've seen projects ending up with two or three versions of ANTLR on their classpath).

These are all small issues in the great scheme of things, but they add up, so I too reach for regular expressions and feel a little dirty for it.

I love it that Gavin King is pushing for including proper BNF grammar support in Ceylon language instead of supporting regular expressions. I tend to agree with him — proper context free grammars are more powerful and useful tool for validating and understanding complex input than regexes.

More recently I've come upon several articles describing Earley parser and what I've read sounds very promising. What I find interesting about it, is that it offers to get rid of some of the limitations of  more established parser technologies, allowing to parse grammars that might be considered ambiguous by some of the tools out there.

There is of course a possibility that parsing will take a significant performance hit in the worst case scenario, but it is mostly academical concern at this point, as most use cases I've encountered, have had neither overly complex grammars nor extremely long input strings to parse, so in all practical terms, I see the Earley parser algorithm as a net gain for the developer.

Now there is just this little wrinkle in all of this that there are scant few implementations of Earley parser for Java and none of those are really supported by either a community or their original developers. Of the three parsers two even have project names (PEN and Pep) and all three are terribly out of date in terms of modern coding practices. None of them are available as libraries for Maven, Gradle, Ivy, SBT or any other dependency manager.

So I decided to take up the challenge and write a clean implementation of Earley parser in Java, perhaps even learning something about theory of parsing along the way...

As it is as much a learning experience for me, I will try and share my progress along the way as a series of blog posts.