Monday, February 08, 2016

Testdriving Earley: Part 1 - Basics

So, I decided to write a parser. Silly me, I know. So if I am going to do this, I need to do it properly.

So I am going to try and develop it using test driven approach and documenting the whole process in this blog as I go.

This is the first article in the series of articles where I try to establish basics and bootstrap a project.

First things first. To write a parser, you need to know at least some basics of the theory.  Some of things you need to know at least on the basic levels is the basic theory and terminology around formal grammarsChomsky hierarchy of formal grammars and basics of automata theory. You don't have to become an expert (I am certainly not at this point), but the terms and concepts from these theoretical disciplines will crop up here and there and it

There is a lot of literature out there about Earley parsers. Some of it even readable for mere mortal developers not used to the academic vernacular.

The initial set of source information that helped me to conjure up the courage and bootstrap the project can be summed up by this list:
Armed with this information, I can start my project. As I am just learning this stuff, I will start out by first implementing he most naive version of the Earley parser algorithm in Java following the instructions in the Wikipedia article and the information available from the articles mentioned above.

The initial version will have no particular optimization beyond ones outlined in the algorithm itself and I do not expect it to be particularly fast or have a particularly usable API for interfacing with it.

So, with these goals and non-goals set up, I can set out for the initial leg of this journey.

All of the work is public and can be checked out on my GitHub repository. Feel free to browse around and give me feedback.

So after all of the boring administrative work (creating a project, setting up automated builds, etc) has been completed, we can now actually get on to the meat of this whole thing and start implementing a parser (see this commit on Github).

Grammar, Rules and Symbols

So, in order to parse some input we first need to be able to know the grammar of the input. Grammar is a set of rules that tell you what is the valid input.

Formal grammars are usually defined in terms on BNF (Bachus-Naur Form) or more often in EBNF (Extended Backus–Naur Form). Grammars are usually defined as sets of rules, mapping nonterminal symbols to one or more productions (a list of terminal and nonterminal symbols).

An example of a grammar for simple arithmetic expression might look something like this:

    Sum     := Sum     [+-] Product
             | Product
    Product := Product [*/] Factor
             | Factor
    Factor  := '(' Sum ')'
             | Number
    Number  := [0-9]+

(I've borrowed the example from the excellent tutorial by Loup Vaillant, I hope he does not mind)

So according to this definition, the central object of my parser implementation is a Grammar. So this is my first assertion:
  • Grammar is a set of rules (link)
This of course means that I have to define a Rule:
  • Rule maps single (nonterminal) Symbol to a Production of symbols (link)
Going down that whole rabbit hole, I also need to define Production and Symbol:
  • Production is a list of Symbols (link)
  • Symbol has a name*
* Symbol is an atomic unit of the grammar data model and (at least for now) it has no properties to speak of. I've arbitrarily chosen "name" of the symbol to uniquely identify it and differentiate it from other symbols. Symbol equality, hashCode and soString are based on it's name.

In the end, this is the basic class diagram of my Grammar model:

This basic structure will allow me to express any context free grammar as a very simple model. For example the above arithmetic expression grammar can be rewritten like this

    Sum     -> Sum [+-] Product
    Sum     -> Product
    Product -> Product [*/] Factor
    Product -> Factor
    Factor  -> '(' Sum ')'
    Factor  -> Number
    Number  -> [0-9] Number
    Number  -> [0-9]

And the equivalent Java code would look something like this:

Grammar g = new Grammar(
    new Rule(new Symbol("Sum"), 
             new Production(new Symbol("Sum"), new Symbol("[+-]"), 
                            new Symbol("Product"))),
    new Rule(new Symbol("Sum"), 
             new Production(new Symbol("Product"))),
    new Rule(new Symbol("Product"), 
             new Production(new Symbol("Product"), new Symbol("[*/]"),
                            new Symbol("Factor"))),
    new Rule(new Symbol("Product"), 
             new Production(new Symbol("Factor"))),
    new Rule(new Symbol("Factor"),
             new Production(new Symbol("'('"), new Symbol("Sum"),
                            new Symbol("')'"))),
    new Rule(new Symbol("Factor"), 
             new Production(new Symbol("Number"))),
    new Rule(new Symbol("Number"), 
             new Production(new Symbol("[0-9]"), new Symbol("Number"))),
    new Rule(new Symbol("Number"), 
             new Production(new Symbol("[0-9]")))

(A bit awkward if compared to EBNF or even the rewritten textual notation above, but this will have to do for now)

For now, I have completely ignored any semantics of Symbol elements – There is no difference between terminal and nonterminal symbols. Symbols are only defined in terms of their name and even that is only done because it makes debugging and test failures somewhat easier to comprehend.

As an implementation note, I've chosen to use immutable data classes for declaring a grammar. In my experience, this makes the code much easier to reason about and optimize in some scenarios. Also, there is no use case for mutating any of the data structures created so far after they've been created, so YAGNI.

Feel free to browse around and see the code in Github.

Friday, January 15, 2016

Testdriving Earley - Introduction

So, I've been interested in parsers for a while. There are plenty of opportunities to use a parser in our every day jobs, but for some reason, it does not happen very often.  Most of the time, when some task presents us that requires us to recognize some user input, we almost instinctively reach for regular expressions.

Regular expressions are a very powerful tool, to be sure. Maybe even too powerful. With all the fancy additions like non-capturing lookahead and back references — regular expressions have become legendary source of obfuscation and subtle bugs in user input verification. 

Over the years, I've tried my hand with various parser generator and combinator libraries (ANTLR, JavaCC, JParsec to name the few) and hand-rolled my own in couple of instances. Most of the time when using parser generators, I've felt like I am invoking this big machinery just to get something simple up. And then there's specific parsing technology quirks that I just have to know or otherwise my grammar won't work at all. Not to mention adding a new dependency every time I add some new expression syntax (I've seen projects ending up with two or three versions of ANTLR on their classpath).

These are all small issues in the great scheme of things, but they add up, so I too reach for regular expressions and feel a little dirty for it.

I love it that Gavin King is pushing for including proper BNF grammar support in Ceylon language instead of supporting regular expressions. I tend to agree with him — proper context free grammars are more powerful and useful tool for validating and understanding complex input than regexes.

More recently I've come upon several articles describing Earley parser and what I've read sounds very promising. What I find interesting about it, is that it offers to get rid of some of the limitations of  more established parser technologies, allowing to parse grammars that might be considered ambiguous by some of the tools out there.

There is of course a possibility that parsing will take a significant performance hit in the worst case scenario, but it is mostly academical concern at this point, as most use cases I've encountered, have had neither overly complex grammars nor extremely long input strings to parse, so in all practical terms, I see the Earley parser algorithm as a net gain for the developer.

Now there is just this little wrinkle in all of this that there are scant few implementations of Earley parser for Java and none of those are really supported by either a community or their original developers. Of the three parsers two even have project names (PEN and Pep) and all three are terribly out of date in terms of modern coding practices. None of them are available as libraries for Maven, Gradle, Ivy, SBT or any other dependency manager.

So I decided to take up the challenge and write a clean implementation of Earley parser in Java, perhaps even learning something about theory of parsing along the way...

As it is as much a learning experience for me, I will try and share my progress along the way as a series of blog posts.

Thursday, June 20, 2013

Typesafe registry of key-value pairs

Have you ever seen or used this pattern of code:

  Map attributes = getAttributes();
  Object value = attributes.get("attr.key")
  if (object instanceof SomeType) {
    SomeType attrValue = (SomeType) value;
    // do something with the value

I've seen the above code and many variations of this pattern scattered all over the various java code bases I've read and it has always nagged me that although we have had the power of generics and the associated type-safety guarantees available at our fingertips for a long time now, this type of general registries of key-value pairs still involve this type of cast-heavy boilerplate.

It stroke me then how often I have I seen just this pattern appear and re-appear so many times that I have all but stopped thinking about it.- "Of course, if you need an extensible registry of values, you use a Map of key-value pairs (or a Dictionary)".

And usually you either engineer your registry around same type of objects or you throw the generic typesafety out with a proverbial bathwater and map your keys (usually of the type String) to values of type, well .. Object.

Would'nt it be nice if we could actually make key-value pairs themselves be type-safe?
One day I got thinking, what would be needed to make this possible.

Given the following general interface:

public interface Registry {
    public interface Key {
        T getValueType();

     V put(Key key, V value);
     V get(Key key)

We could re-write the snippet in the beginning as follows:

    Registry registry = getAttributes();
    SomeType value = registry.get(Keys.ATTR_KEY);
    if (value != null) {
      // do something with the value

Although the difference is not all that huge, the last snippet looks much cleaner and the meaning is much easier to read and understand.

Also, as I've learned to associate explicit type casts with a potential code smell, the latter snippet has a much cleaner feel to it. I know exactly what type of the value I am getting from the registry and the API contract ensures that I will always get the type of the value I am asking for.

To test out this idea, I've started a little project on GitHub. Check it out and let me know what you think...

Wednesday, June 13, 2012

When in doubt, Maybe can help you

I am slowly but firmly falling in love with the functional approach to programming.

I know, it is not a silver bullet and it cant solve all of your problems, but there are aspects of functional style programming that I am beginning to appreciate more and more.

Take for example Monads. There is a whole lot of talk about monads, monoids and whatnot floating around the intertubes, and through all that almost hard core mathematical chatter that is sometimes quite hard to see what one can benefit from using such a convoluted concept.

Specially you get this question a lot when talking to Java programmers. Functional style programming can feel quite awkward in Java, with all the character noise inner classes bring with them, but sometimes it is not all that bad.

Some of the functional style concept I've been exploring lately is what Haskell calls the Maybe monad. (Scala has it in it's Option class)

Essentially, the Maybe class is all about entirely elliminating null references out of existence and handling the optionality of values in an arguably much nicer way, making things that may or may not have a value much more explicit in code by explicitly stating this using a type system.

When in Java you have an object type (which is nearly everything floating around in an average program these days), anything that gets passed to you from an "untrusted source" may or may not reference an actual object.

The trouble is that the amount of "untrusted sources" is sometimes quite difficult to track. Apart from the explicit user input that you should almost never trust with your life to be of any useful format, there are much more subtle sources that mostly do what you expect but every now and then throw you a curveball, by throwing an occasional null where you the least expect it from.

So basically, when programming in Java you are constantly walking a minefield of possible null references. If you know what NPE stands for, you probably know what am I talking about.

One way to counter it, would be to use a Maybe monad. I know there are several variations of Maybe/Option implementations floating around, but I will silently ignore all of them and go with my own concoction.

First a little appetizer, that would hopefully also serve to explain some of the reasons I find functional style programming so nice:

An awful lot of code? Lets just strip away all the inner class creation boilerplate and see how would it look like if java had closures (as it probably will in version 8):

The boilerplate of inner class creation apart, the code is actually quite clean and easy to reason about. At least for me. Lets see what is going on here:

First I get an instance of a resource from some cache. Since I don't really know if the resource is in the cache, I wrap it in a Maybe instance.

Then I check for some conditions on a returned resource. See how I do not have to check if the resource is null? This is because the predicate function (I am using Google Guava api here) is only called if the resource instance was returned from cache, so by the contract, the resource passed to the when method can never be null.

The or method will return the value contained in the Maybe instance or evaluate the provided function in case the value is missing either by virtue of not being present in the presumed cache or not passing the predicate above.

As a last step, I use a transformation function that uses the Resource and turns it into an Entity, returning an instance of Maybe<Entity> to the caller.

What this all boils down to is a set of clearly defined instructions to retrieve a value from an untrusted source (or sources). If any of the steps fails, the following steps that need an input value will be gracefully skipped, or alternative methods of arriving at the result will be consulted.

Clean, easy to follow and null-safe.

Monday, July 11, 2011

Windows 7 hint of the day: Alt+Up is the new Backspace

Although in general I like what MS has done with Windows 7 (it genuinely is the best Windows so far), there is one constant gripe I have with Windows Explorer, that is they've "unified" the keybinding for Backspace to work just like in IE. That is - instead of navigating up one level in the filesystem hierarchy, it goes back in the navigation history.

It is highly confusing, because you might not notice it for a while, if you've navigated to your desired destination by drilling down folder by folder and then pressing Backspace to get back to the parent folder.

Where it breaks down though is with the nice new addressbar acting like a double for a navigatable breadcrumbs (a pretty nice and useful feature in it's own right) and an address input widget, when you've navigated in any other direction than one level down the hierarchy, you start getting into the apparently weird navigation behavior.

I can\t count how many times I've pressed Backspace multiple times only to finally notice that I'm alternating between two folders.

This particular "feature" has pissed me off countless times since I had an impression that the ability to navigate up in the hierarhy via keyboard was simply gone.

Until today that is. I absently pressed a keybinding while in Windows explorer that I use constantly for this operation on my Mac and found out to my pleasant surprise that it did work exactly as I had expected.

The new Backspace replacement in Windows Explorer for navigating up in the folder hierarchy is Alt+Up

Thursday, July 07, 2011

I've been playing around with a crazy idea ...

I've been playing around with a crazy idea of recreating Eclipse JDT tooling with XText.

While it does sound a fair bit more than a little crazy, it appeals to me for few reasons:
  1. It would give me great opportunity to learn XText and test it in a non-trivial setting.
  2. It would give me great opportunity to learn more about Java the language.
  3. Having Java source code parsed into a well defined EMF structure opens up JDT to all the  goodness of EMF tooling that has been developed over these past years.
It seems like an interesting idea. I wonder if I actually get around to giving it a try...

Thursday, February 03, 2011

Comparing Jenkins vs TeamCity - Part 2

This is a second installment in a series of posts that tries to compare two continuous integration and build automation server products. In my previous post I briefly introduced the contenders open source developed Jenkins and a commercial offering from JetBrains called TeamCity and compared their respective installation experience.

This time I'll try to cover the post-installation configuration steps I need to make before I can start putting some builds into them.


No server software is usually useful out of the box in and by itself. You almost always need to tweak and configure it before you can really reap the benefits. Continuous integration is just another type of server and it needs to be configured before it becomes really useful.


After you've installed your copy of TeamCity and opened up it's web interface for the first time, it greets you with a nice little license agreement, you need to agree to in order to be able to continue.

This is just a first page of the initial set-up wizard:
  1. License agreement: Check "agree" and press Continue
  2. Fill in details for admin user account and press "Create account" button
Initial Overview screen
After this really simple initial set-up wizard you're basically configured with a simple authentication scheme, have created one administrative account and are ready to go.

However, not everything is configured, as is also suggested by the projects initial project overview page.

    Email and IM notifications.

    One of the nicest things about CI services is that they can notify you proactively if you need to know of some important events like builds or test failing or some such.

    Email notification settings
    In Out of the box configuration you can set up your e-mail and Jabber notifications. Just follow the link that says "configure email and Jabber settings" in the Overview page.
    It leads you to Administration > Server Configuration section.
    Then click on the "EMail Notifier" tab, fill in your e-mail SMTP gateway details, press "Test connection" to ... well, test the connection. and if it checks out, press "Save".

    If you use internal (or external) Jabber service for instant communication, you can also set it up by filling out a similar form under the "Jabber Notifier" tab.


    By default, the TC installation comes with embedded Hypersonic SQL DB (HSQLDB for short). It is a good database in it's own right and has it's strengths in the situations where you would need a fast and tightly integrated in-process DBMS embedded into your Java application, but running a production server on HSQLDB is ... well lets just leave it at "it's just not done all that often".

    JetBrains nicely offeres a link to migration instructions on TC's server configuration administration screen.

    Basically there are two ways to migrate.
    1. Wipe TC data location, specify new db configuration and essentially re-configure your instance
    2. Full data migration
    First choice suits you perfectly if you have just installed your first instance of TC and want to store the data in something else than HSQLDB.

    The second one is what you'll do if you have run for a while, have some projects and users configured, etc. In short, if the current DBMS you has some data, you do not want to lose and you need to migrate it to another database for whatever the reason.

    TeamCity supports HSQLDB, MySQL, Oracle, PostgreSQL, Microsoft SQL Server 2005, 2008 and Sybase; the migration is possible between any of these databases.

    Migration in general is a tedious process, involving backing up data, stopping and re-starting server(s) and generally doing lots of  manual labor. The authors of TC have done a fairly good job of explaining it all.


    After you've installed Jenkins in the manner outlined in a previous article, you can open your browser, point it at the http://localhost:8080/ and you are presented with a Dashboard, where you can immediately start by adding new build projects.

    Out of the box configuration of Jenkins allows you to pull your sources from CVS, Subversion and local filesystem; build Ant or Maven driven java builds and execute shell/command line commands and scripts.

    So if all you need is a local build server instance in your machine, to build your java projects, you're pretty much done.

    Still there are some things missing from the default configuration that are sort of mandatory, if you are working in a team and building up your CI system in earnest.

    Basic security

    Out of the box installation of Jenkins has no security whatsoever. In essence it is a public free-for-all build server, where anyone has full access to alt the builds, full configuration and all of the operational details.

    It is probably just okay in a small company or department's internal closed network setting, or as a personal server, where everyone is fully trusted to do the right thing and not to mess up a production grade build automation. However as soon as your build server has to have public/internet availability you'll need to start turning down those knobs and that is where additional security measures come in handy.

     Setting up basic security isn't hard:
    1. From Dashboard: Manage Jenkins Configure System
    2. Check "Enable security"
    3. Select how your users will be authorized (e.g. LDAP or Jenkins's own database)
    4. Choose the level of security you'll feel comfortable with
    Hit "Save" at the end of the page (if that is all you needed to configure), and you are secured.

    I must however point out that the vanilla installation of Jenkins comes only with few user authentication options, but you can install additional methods from it's extensive set of authentication plug-ins

      Email notification

      As with TC you need to configure your e-mail gateway before it can start sending e-mail notifications.

      Like everything that has a configuration setting, Email notifications section is on the System Configuration page. Just open by navigating to Manage Jenkins Configure System, scroll to the bottom of the page to the "E-mail Notification" section, fill in your SMTP gateway details (if your SMTP server requires logon, do not forget to click on "Advanced" button and fill in those advanced e-mail settings)

      Test the  configuration and press "Save". Pretty much same as TC.


      From all that I've seen (not that I looked too deep), Jenkins does not use any SQL database for persisting it's state - using custom XML based persistence instead. This is good, because this means you can do ton of things with the data, like store it ins some VCS, clone settings into new Jenkins instance, back it up, script it up, etc.


      First difference I noticed is that the first thing TC did was to create an admin account and thus secure your TC server installation.One might argue if it is a good thing or not either way, but the fact is that the very first entry on Jenkins best practices page recommends "Always secure Jenkins.".

      At the very least what I would like to see is a nice friendly reminder in a dashboard, that says something to the effect "This copy of Jenkins is completely unprotected". You could dismiss it if it bothers you, but it should be visible at least the first time you visit unsecured Dashboard of a freshly installed/updated Jenkins instance.

      A nice touch of the TeamCity was the "unsaved changes" notification at the bottom of the browser window when you had modified something, but had not hit the "Save" button yet. It served a double purpose of reminding you that you had unsaved changes on the form and offered a nice easy access to the "Save" action.

      Overall, here is where TC commercial backing shows - the overall out of the box experience is somewhat smoother and nicer than Jenkins's.

      Enterprises will love the fact that TC keeps it's data in an SQL database and the lineup of supported databases should satisfy any corporation, but the simplicity of persisting data in XML files is just barely short of genius - Developers (and dare I say, sysadmins) will love this.

      The set of minimal post-installation configuration steps will vary by use case for either, but both products make it rather painless process and all things considered I'd like to give this step a tie.

      What's next

      In my next post I will configure and build my first project. Stay tuned

      Comparing Jenkins vs TeamCity - Part 1

      As some of my closest friends already know, I decided to bite the bullet and switched jobs recently.

      As the new company I now work for is really small compared to the one I left, it also means that as a developer I am also involved in more than just programming.

      One of the decisions I had to make and implement was selecting and setting up a CI server for continuous building and deployment of our next product.

      Having had extensive exposure to CruiseControl, I immediately discarded it as I've developed a particular dislike for this product.

      The next alternative I had had any exposure to was Hudson (now renamed to Jenkins) and JetBrains's TeamCity, which I had heard a lot of nice things about. Adding to the mix that first project to hit our CI server would be .NET project, I needed a solution that would run on Windows and build .NET projects using MSBuild.

      I settled on Jenkins and TeamCity as the two major contenders.

      I would like to point out that this post is full of personal opinions, by no means impartial as comparisons go and should as such be taken with a fairly grain of salt. You might not like it or not agree with my conclusions, and that is entirely your tragedy. I just record my experiences here - YMMV.


      First a little introduction of the two products.


      TeamCity (TC for short) is a commercial offering from JetBrains (the company who have brought to Java development community an excellent IDE called IntelliJ IDEA).
      JetBrains offeres it as a Professional and Enterprise editions. Professional edition is free and is limited to 20 user accounts and 20 build configurations. With the product you get to use up to 3 build agents and you can purchase more as you develop a need for these.

      20 users and 20 build configurations sounds plenty fine for our case, so this is no problem at the moment.


      Jenkins (originally called Hudson, until Oracle botched it with the community and got renamed) is an open source CI server implementation that started out as a hobby project by Kohsuke Kawaguchi while he was working as Sun, but has now grown into one of the most popular CI server implementation in the Java community. It has a vibrant community of core and extension developers, rich set of plug-ins that let it do almost anything that a CI build server needs to do.

      As an OSS project, it is completely free. No strings attached, no limitations, no hidden agendas.


      First things first. Although there are cloud offerings out there that offer CI server configurations, the requirement to build .NET projects, to really try both out, I need to install them side-by side.

      Installing TeamCity

      Here are the steps:
      1. Download the installer for free Professional version from JetBrains - 301 Mb
      2. Run the installer and follow the steps within
        Sit back and wait until TeamCity installer finishes setting itself up.
      3. Answer few more questions like operating port of the build server, set up build agent properties and some more stuff that seemed to have fairly sensible defaults.
      4. Open your browser and point it to http://localhost:80/ (or whatever port you chose to run your TeamCity server on)
      TC download was huge and it took about 30 minutes to download the thing.
      After installer finished running, the server and agent services were installed and running.
      Nothing more to do.

        Installing Jenkins

        1. Download the latest jenkins.war [1] - 35 Mb
        2. Run fom console: > java -jar jenkins.war
        3. Open your browser and point it to http://localhost:8080/
        All in all the whole process took me less than 30 minutes from the moment I started my download until I had Jenkins server up, running and ready for action.

        To be completely honest, these steps will only run Jenkins instance and as soon as you kill the process, or shut down your computer, Hudson will go down. To finish the installation of Jenkinbs you need to install it as a service on your host machine. The instructions for that vary by platform, but under Windows it is just as simple as following links from dashboard:
        1. From Dashboard: Manage Hudson Install as Windows Service
        2. Enter path to Jenkins service home (e.g. C:\Build\Jenkins)
        3. Press Install button
          Wait until Jenkins installs itself as a Window service
        4. Press "Yes" to stop currently running Jenkins instance and start the windows service.
        Alternatively, if you have a servlet container that supports Servlet 2.4/JSP 2.0, such as Glassfish v2, Tomcat 5 (or any later versions), then you can run them as services, and deploy jenkins.war as you would any other war file. Container specific documentation is available if you choose this route.
          [1] at the moment of this writing it was still called hudson.war, but has since been renamed to jenkins.war


          Installing both was relatively easy process.
          • With TeamCity you get nice rouunded up installer package which handles all the gritty details of setting up app server, deploying applications and wrapping it up into a system service. Just download, run the installer, answer to some fairly simple questions and you'r ready to go.
          • With Jenkins the process was even simpler, albeit a bit technical, requiring you to go into the command prompt to run the server.
          I personally liked Hudson way a bit more as it felt really light-weight and to some extent much simpler - "just run the .war file" and you're done.

          Additionally - installing it as Windows service just blew me away - Like it? Take it - Click, click, clickedy-click and Done!

          Definitely, Jenkins feels much friendlier towards "just trying it out" and it can not be bad to the developer community - using Hudson, it is not entirely impossible to have each developer running their own Hudson CI instance on their own machine - it is just so simple to set up!

          What's more, since it is so easy to set up (and you don't have to commit to it by installing it as a service), it has a real advantage of being the CI of choice for every developer. Every one can run their own Jenkins instance on their computer. And that is good...

          What's next

          As I started to write this article, broke it up into some chapters and wrote few of them, I realized that this one has a fair chance of becoming a monster article, so I decided to brake it up by topics.

          This one established some background and covered installation of the server.

          In the next post of the series I'll be covering post-install setup of both products