a little madness

A man needs a little madness, or else he never dares cut the rope and be free -Nikos Kazantzakis

Zutubi

Archive for June, 2006

Pulse Continuous Integration Server 1.0.6

Pulse version 1.0.6 has been released. This release fixes some minor issues in 1.0.5, see the release notes for details.

Duck Typing vs Readability

Duck typing is getting a lot of attention lately, probably due to the hype buzzing around about Ruby/Rails. My first encounter with duck typing was in Python, which is very similar to Ruby in this respect.

The first thing that scares you about duck typing is the lack of the compiler safety net. You can write non-sensical code and you have no idea until you run it. However, as you learn to take advantage of this newfound flexibility, you start seeing the static typing safety net as a straight-jacket. Sure it stops you from poking your eye out, but a lot of the time it just gets in your way. With disciplined testing, you can restore a large proportion of the safety net, and you feel comfortable again. It’s a classic tradeoff: safety/early error detection versus flexibility, and there’ll always be arguments for both sides1.

The next thing that I grappled with was the lack of precise contracts between callers and callees. In a statically-typed language, the function prototype specifies the contract with a high level of precision. The caller knows the contract that each of the parameters must satisfy, and that the returned value(s) will fulfill. In a duck-typed language, the prototype is much less informative. Without type information in the prototype, the only way to know the contracts for sure is to analyse the body of the function. Clearly this is unacceptable, especially when dealing with a third-party API. The usual way to mitigate this loss of information in the prototype is to provide the details in the API documentation. This approach, however, suffers from two major problems:

  • Verbosity: a static type (say a Java interface) is a concise way to specify type requirements, a natural language description will tend to be less direct.
  • Inaccuracy: there is no way to ensure the documented requirements are correct. In particular, as the code evolves there is a real danger the documentation will be left behind.

This readability problem is my biggest issue with duck typing in practice today. A commonly-suggested solution to the problem is some form of optional static type checking2. However, this route tends to lead us back to something like interfaces, which as I say are a pretty concise way to specify a contract. This is giving away too many of the advantages of duck typing, in particular:

  • Granularity: a duck-typed function places the least possible requirements on the passed parameters. Interfaces, on the other hand, may carry extra requirements: methods that are not required for the function in question. Although you can break interfaces down into atoms and combine them, the resulting number of interfaces would be overwhelming.
  • Adaptability: related to the above, a duck-typed function can be adapted to contexts that the function author may never have considered with as little effort as possible on the part of the caller.
  • Sheer convenience: there is no extra baggage required of calling code, you can just get on with the job.

So how do we get the convenience and power of duck typing without this readbility problem? What we need is a concise way to communicate the requirements on function parameters, without requiring them to be manually specified. Is this really so hard? Imagine a tool that analysed the body of a function (and potentially the functions it calls) to see how the parameters were used. Such a tool could extract a lot of useful information, such as what methods are called on the parameter. On the surface, it is not even a difficult tool to write3. Having this information available as you write code would be a huge plus. On the caller side, you know a lot more about the contract you need to fulfill. On the callee side, you no longer need to maintain this “type” information in the function documentation.

The idea is simple enough that I’m sure it has been thought of before. I wonder then, does such a tool exist? If not, are there some killer implementation difficulties I have overlooked?


1 C++ templates, although not without their own problems, get close to a best-of-both worlds: flexible contracts that are statically checked.
2 I suspect these suggestions often come from those who are more comfortable in a statically-typed world.
3 Famous last words, I suspect.


Into continuous integration? Want to be? Try pulse.

ANTLR By Example: Part 3: Parsing

Introduction

We finished Part 2: Lexical Analysis with a working lexer. The lexer splits character streams into streams of tokens such as “changed.by.me”, “(” and “not”. In this part, we will create a parser that can turn these token streams into an abstract syntax tree (AST).

Abstract Syntax Trees (ASTs)

An AST captures the structure of the expression being parsed. Expressions are recursive by nature: they are made up of one or more sub-expressions, which in turn have sub-expressions of their own. Such a structure is naturally modelled as a tree, where each sub-tree models a sub-expression. The tree also captures the precedence of operators: a higher-precedence operator will bind more tightly and become a lower node in the tree. For example, the expression “changed or failed and not state.change” may be represented in tree form:

    changed
or
        failed
    and
        not
            state.change

Parsers

Defining the Parser

ANTLR parsers are defined in grammar files, usually the same file as the associated lexer. From the ANTLR manual:

{ optional class code preamble }

"class" YourParserClass "extends Parser;"

options

tokens

{ optional action for instance vars/methods }

parser rules...

This is very similar to the syntax for lexers that we saw in Part 2. We will be focusing mostly on the parser rules, which define the sequences of tokens that the parser will accept, and how the AST should be constructed. Parser rules have the same high-level format as lexer rules:

rulename
    ":" alternative_1
    "|" alternative_2
    ...
    "|" alternative_n
    ";"

Parser rule names must begin with a lower case letter. Alternatives are expressed in terms of tokens (as opposed to characters for lexers). You can also use string literals – they are handled specially by ANTLR (the strings are placed in a literals table for the associated lexer, and literals are treated as tokens).

To parse our boolean expression language, we need to break down the language into ANTLR rules. When breaking the language down, we also encode the precedence of the operators. Referring back to the example tree above, you will recall that the higher precedence operators appear lower in the tree. So, for example, we can see that the general form of an “or” expression is:

orexpression
    :   andexpression ("or" andexpression)*
    ;

Note the use of the ‘*’ wildcard to indicate zero or more repetitions. Other wildcards, such as ‘+’ (one or more) and ‘?’ (zero or one) may also be used. Working our way from top to bottom, we can define the whole language in this way:

orexpression
    :   andexpression ("or" andexpression)*
    ;

andexpression
    : notexpression ("and" notexpression)*
    ;

notexpression
    : ("not")? atom
    ;

atom
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

Lower precedence operators have their expressions defined in terms of the next-highest precedence operator. This language is missing something, however. It does not yet handle grouping (using parentheses “(…)”). Recall that the purpose of grouping is to override the normal precedence. In this way, a grouped expression is an indivisible element (atom) with regards to the surrounding operators. So, to introduce grouped expressions, we need to define an atom as either a primitive condition or a grouped expression:

atom
    : condition
    | LEFT_PAREN orexpression RIGHT_PAREN
    ;

condition
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

Building the AST

Now we have a parser that will accept valid token streams for our language. The next step is to tell ANTLR to construct the AST, and to annotate the rules to describe how the tree should be built. Turning on AST generation is done with an option:

options {
    buildAST=true;
}

We guide the AST construction using postfix annotations on the tokens in our parser rules. The following annotations are available:

  • no annotation: a token without an annotation becomes a leaf node in the tree
  • ^: a token annotated with a carat becomes a sub-expression root
  • !: a token annotated with an exclamation point is not included in the tree

In our language, the operators form root nodes, so we will annotate them with the ‘^’ character. As the grouping operators are only used to override precedence, there is no need form them in the final AST, so we will omit them by using the ‘!’ annotation. This gives us our final parser definition:

class NotifyConditionParser extends Parser;

options {
        buildAST=true;
}

orexpression
    :   andexpression ("or"^ andexpression)*
    ;

andexpression
    : notexpression ("and"^ notexpression)*
    ;

notexpression
    : ("not"^)? atom
    ;

atom
    : condition
    | LEFT_PAREN! orexpression RIGHT_PAREN!
    ;

condition
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

Generating the Parser

At last, we are ready to generate the parser code:

$ java -classpath antlr-2.7.6.jar antlr.Tool NotifyCondition.g

Now we get six files:

NotifyConditionLexer.java
NotifyConditionLexer.smap
NotifyConditionParser.java
NotifyConditionParser.smap
NotifyConditionParserTokenTypes.java
NotifyConditionParserTokenTypes.txt

Let’s take the parser for a test drive:

Reader reader = new StringReader(argv[0]);
NotifyConditionLexer lexer = new NotifyConditionLexer(reader);
NotifyConditionParser parser = new NotifyConditionParser(lexer);
       
try
{
    parser.orexpression();
    AST tree = parser.getAST();
    System.out.println(tree.toStringTree());
}
catch (RecognitionException e)
{
    System.err.println(e.toString());
}
catch (TokenStreamException e)
{
    System.err.println(e.toString());
}

All of the rules in our parser become methods. To parse an expression, we simply choose the highest-level expression: orexpression. The trees are printed using a lisp-like syntax. For example, the output when parsing the expression “changed.by.me or not (success and changed)” is:

( or changed.by.me ( not ( and success changed ) ) )

The tree nodes are shown in parentheses with the root first, followed by a list of children. We can also try an expression with valid tokens, but an invalid structure “success or or changed”:

line 1:12: unexpected token: or

The parser rejects the expression because there is no rule that will accept “or or”.

Wrap Up

We’re really getting there now. We have a fully functional parser that accepts valid expressions and produces ASTs to represent them. In the next part, we will see how to convert these ASTs into our own notification condition data structure using ANTLR tree parsers.

Pulse Continuous Integration Server 1.0 Final!

Zutubi is proud to announce the availability of the Pulse automated build (or continuous integration) server for sale from today. This is the culmination of many months of development and beta testing. We would like to thank all of our beta testers for their feedback during this period.

If you haven’t tried it yet, download pulse today and let us know what you think!

ANTLR By Example: Part 2: Lexical Analysis

Introduction

In Part 1: The language, we took a look at the simple boolean expression language we want to parse. In this part, we will start getting into ANTLR, and will tackle the problem of lexical analysis. This is the first phase of parsing, where the input stream of characters is processed to form a stream of tokens that the parser can understand.

Setup

Getting ANTLR

To actually try out the examples presented in this tutorial, you will need to download ANTLR. In our case we use the single ANTLR JAR file both to run the parser generation tool and as a runtime library. Standalone executables are also available, and can be a more convenient way to run the parser generator.

Running ANTLR

To run ANTLR via the JAR file, execute the following:

$ java -classpath antlr-2.7.6.jar antlr.Tool <input file>

Lexical Analysis

Grammar Files

Lexical analysis in ANTLR is treated similarly to parsing. You specify rules that correspond to tokens, and define which sequences of characters match those rules (and thus form those tokens) using an EBNF-like syntax. The rules are specified in “grammar” files which are text files with a “.g” extension (by convention). In this tutorial we will be using a single grammar file to describe the lexical analyser, parser and tree parser. Let’s start with an overview of the ANTLR syntax for specifying lexers (literal parts enclosed in quotes):

{ optional class code preamble }

"class" YourLexerClass "extends Lexer;"

options

tokens

{ optional action for instance vars/methods }

lexer rules...

In our simple example, we focus almost entirely on the lexer rules, and do not use any options, tokens or actions. The lexer rules themselves are specified using the following syntax:

rulename
    ":" alternative_1
    "|" alternative_2
    ...
    "|" alternative_n
    ";"

Lexer rule names must begin with a capital letter, and define tokens. In our example, we need to split the input characters into four types of token:

  1. WORD: sequences of alphabetical and period (‘.’) characters, used for both the primitives (e.g. “state.change”) and operators (e.g. “and”) in our language.
  2. WHITESPACE: linebreaks, tabs and spaces, which delimit other tokens but are not passed on to the parser
  3. LEFT_PAREN: the ‘(‘ character, used for grouping
  4. RIGHT_PAREN: the ‘)’ character, used for grouping

For each of these token types, we need to provide an ANTLR rule. Here is the corresponding grammar file (NotifyCondition.g):

header {
    package com.zutubi.pulse.condition.antlr;
}

class NotifyConditionLexer extends Lexer;

// Words, which include our operators
WORD: ('a'..'z' | 'A'..'Z' | '.')+ ;

// Grouping
LEFT_PAREN: '(';
RIGHT_PAREN: ')';

WHITESPACE
    : (' ' | '\t' | '\r' | '\n') { $setType(Token.SKIP); }
    ;

There are a couple of interesting things to note in this grammar:

  • ANTLR allows the use of a range operator ‘..’ to easily specify character ranges (e.g. ‘a’..’z’ includes all lowercase latin alphabetical characters)
  • You can also use wildcards to indicate repetition (e.g. the ‘+’ character is used to specify “one or more” in the WORD rule)
  • You can apply actions to the token, as illustrated in the WHITESPACE rule. This allows you to inject code into the method that processes the rule. In this case we set the token type to “SKIP” so that these tokens are not passed to the parser. The $setType directive is an ANTLR built-in that abstracts the actual code used to set the token type.

Of course, this is just the tip of the iceberg. The ANTLR manual has further details and examples for specifying lexer rules.

Generating the Lexer

Now we can fire up ANTLR and actually generate some code:

$ java -classpath antlr-2.7.6.jar antlr.Tool NotifyCondition.g

This generates four files:

NotifyConditionLexer.java
NotifyConditionLexer.smap
NotifyConditionLexerTokenTypes.java
NotifyConditionLexerTokenTypes.txt

The most interesting is the lexer itself, NotifyConditionLexer. We can drive this lexer ourselves to see how it works (forgive the lack of proper cleanup):

Reader reader;
NotifyConditionLexer lexer;
Token token;

reader = new StringReader(argv[0])
lexer  = new NotifyConditionLexer(reader);

try
{
    while(true)
    {
        token = lexer.nextToken();
        if(token.getType() == Token.EOF_TYPE)
        {
            break;
        }

        System.out.println("Token: ‘" + token.getText() + "’");
    }
}
catch (TokenStreamException e)
{
    System.err.println(e.toString());
}

With input string “changed.by.me or not (success and changed)” this code prints:

Token: ‘changed.by.me’
Token: ‘or’
Token: ‘not’
Token: ‘(‘
Token: ‘success’
Token: ‘and’
Token: ‘changed’
Token: ‘)’

Excellent, ANTLR is breaking the string into the expected tokens! Note also that the whitespace is not reported, as we set those tokens to type “SKIP”. What if there is an error in the input? Running this code over the string “changed.by.me or 55” gives:

Token: ‘changed.by.me’
Token: ‘or’
line 1:18: unexpected char: ‘5’

Our analyser does not have any rule that matches numbers, so it stops when it sees the first ‘5’ character.

Wrap Up

Well, we’re getting somewhere now. We’ve created our first grammar file, generated a lexer and used it to successfully tokenise an input stream. In the next installment we’ll define the rest of the expression language as parser rules, and generate an Abstract Syntax Tree (AST).

Does Testing Slow You Down?

Earlier today I read Cedric Beust’s post Agile People Still Don’t Get It. In that post Cedric makes some fair criticisms of misguided agile evangelism. I find it particulalry ironic that the closing sentence is:

So here is my advice to Agilists: get real now, or you will become irrelevant soon.

Pragmatism should be a cornerstone argument for agile development, so those evangelists that don’t emphasise it surely are doing more harm than good1.

Having said all that, there is an interesting undertone in some of Cedric’s comments that got me thinking. In his defense of pragmatism, Cedric makes repeated assertions that sometimes we need to cut corners to make a real-world deadline. One corner to be cut is testing:

But the truth is: at times, I don’t do TDD because implementing a feature quickly is more important than a fuzzy feeling.

This is what really got me thinking. Note that we are talking short-term here: in the long term, I think most would agree that cutting corners will catch up with us eventually. But why do we assume that cutting this corner will save time in the short term? I think the assumption is based purely on a shallow evaluation: writing less code this second means I can hit the looming deadline more easliy. This seems valid enough: writing the extra code to perform the testing takes non-zero time after all. But the analysis is missing something: can we really implement a feature without testing it? We can try, but how will we know if it works at all? A broken feature is worse than no feature at all, so there is no way to invoke the “customer value” argument here.

I think most people agree that some level of testing is required: the risk of zero testing is just too great. A temptation, then, is to test the feature by hand, because that will be quicker than writing code to test it. Now the assumption is getting even shakier. Maybe a manual test is quicker as a once-off, but what about when you find a problem? Now you need to fix the code, then manually test again. You will be wishing you had just written test code in the first place, so you could repeat the test with very little extra effort. So, counter-intuitively, it may be quicker to code the feature and tests rather than just the feature. Not in all cases, for sure2, but probably more frequently than we realise.

In fairness to Cedric, I have taken this off on a tangent of my own. His post just got me thinking about a natural assumption that we tend to make which may actually be false. I agree with the assertion that sometimes a pragmatist has to make compromises to meet a short term goal, I just wonder about the short term effects of this particular compromise.


1 I suppose this is what happens when people become true believers in a methodology: they start to drift farther and farther from reality.

2 For example, when the feature is very hard to test in an automated fashion. Usually this can be remedied in the long term, when you can afford the extra time to refactor for testability.

ANTLR By Example: Part 1: The Language

Introduction

As promised, this is the first installment in a simple ANTLR tutorial “ANTLR By Example”. In this post, I’ll introduce the language to be implemented, and give an overview of the full tutorial.

Background

In case you missed the background, recently we added a feature to pulse that allows build notifications to be filtered using arbitrary boolean expressions. Rather than reinventing the wheel and writing a parser for the boolean expression by hand, we used ANTLR, a fantastic parser generator with support for Java (amongst several other languages). The purpose of this tutorial is to show you how easy ANTLR makes it to create your own little languages.

Overview

To make the tutorial manageable, it will be split over multiple parts:

The Language

Before we race into generating lexical anaylsers and parsers, it’s best that we nail down exactly what we need to parse. The language we need to process is a simple boolean expression language. The language consists of two types of “atoms” (indivisible elements): primitives and operators. These atoms are combined to form expressions.

Primitives

Primitives are simple elements that evaluate to true or false. To make the language readable, each primitive is represented as a descriptive string such as “true” or “changed.by.me”. Since we are filtering build notifications, most primitives in our language represent tests on the build result:

  • true: always evaluates to true
  • false: always evaluates to false
  • success: evaluates to true if the build succeeded
  • failure: evaluates to true if the build failed
  • error: evaluates to true if the build encountered an error
  • changed: evaluates to true if the build included a new change
  • changed.by.me: evaluates to true if the build included a new change by the owner of the filter
  • state.change: evaluates to true if the result of the build differs from the result of the previous build

The meanings of the primitives are not terribly important to the parsing, but this is a real world example after all :).

Operators

Our language supports standard boolean operators for disjunction (or), conjuction (and) and inversion (not). Again for readability, we use words to represent these operators. Additionally, a grouping operator is defined, allowing expressions to be treated as atoms (to override operator precedence). The operators are summarised below:

  • or: a binary operator, a “or” b evaluates to true if either a or b evaluates to true
  • and: a binary operator, a “and” b evaluates to true only if both a and b evaluate to true
  • not: a unary operator, “not” a evaluates to true only if a evaluates to false
  • grouping: “(” a “)” groups the expression a into an atom, overriding normal precedence rules

The operators are listed from lowest to highest precendence, i.e. later operators bind more tightly.

Expressions

Expressions are formed by composing strings of atoms. Not all atom strings are valid expressions. In particular, operators must have correct arguments (left and right for binary, right for unary, internal for grouping). The following are all examples of valid expressions:

  • changed
  • changed.by.me and failed
  • not (success or state.change) and changed

Operators with higher precedence bind more tightly. Thus:

a “and” b “or” c

is equivalent to:

“(” a “and” b “)” “or” c

as “and” has higher precedence than “or”.

Wrap Up

So there it is, a simple language for defining boolean expressions. In the next part, we’ll fire up ANTLR and start turning character streams into token streams.

Your Own Little Language

One of the most studied and best understood areas of computer science is lexical analysis and parsing. Fantastic tools (such as ANTLR) are also available to automatically generate parsers based on declarative input. Despite this, most developers rarely take the opportunity to design their own “little” languages. No, I’m not suggesting we should all start creating full-blown programming languages (although that might be fun…). I’m not talking about Domain Specific Languages (DSLs) either1, at least not in the strictest sense. When I say “little” I mean a language custom built to take care of a small task.

For example, in pulse you can filter build notifications using arbitrary boolean expressions. To allow this, we created a custom boolean expression language. This “little” language has the usual boolean operators (and, or, not) and primitives like “success” which evaluates to true if the build was successful. It was implemented in an afternoon, thanks to ANTLR writing most of the code for us. The result is a simple yet powerful way to specify filters, much more usable than a GUI of equal flexibility. To keep the simple cases simple we provide a GUI with pre-canned expressions for the most common scenarios. It’s the best of both worlds.

So keep your eyes open, there are opportunities for little languages everywhere. With a tool like ANTLR in your arsenal to do all the heavy lifting, there is no reason not to give little languages a go! To give you a head start, I’ll be presenting an “ANTLR By Example” series of posts that will show exactly how we used ANTLR to create the language described above. Stay tuned!


1 on second thought, that would make this post trendier…

Bash Tip: Tracing Execution

When your bash script is going haywire, and you’re having trouble sorting out why, one of the simplest ways to see what is going on is to use:

set -x

This turns on the xtrace shell option, which prints the commands the shell is executing with expanded arguments, as they are executed. From the man page:

-x After expanding each simple command, display the
expanded value of PS4, followed by the command and its
expanded arguments.

This shows you exactly what is going on as your script executes. Take the following, contrived shell script:

#! /usr/bin/env bash

set -x

for i in $(seq 0 3)
do
    if [[ $i -gt 1 ]]
    then
        echo $i
    fi
done

exit 0

When executed, the script will print the following:

$ ./example.sh
++ seq 0 3
+ [[ 0 -gt 1 ]]
+ [[ 1 -gt 1 ]]
+ [[ 2 -gt 1 ]]
+ echo 2
2
+ [[ 3 -gt 1 ]]
+ echo 3
3
+ exit 0

Note that the trace lines are prefixed with ‘+’ characters (in fact, whatever the value of $PS4 is), and regular output appears as normal. You can also see that the for loop and if conditions are displayed. This is all very handy for tracking down obscure problems.

One issue with tracing is the output can be very verbose, making the useful bits hard to find. If you have a rough idea of where the problem is, you can alleviate this problem by turning tracing on and off selectively around the interesting parts of the script (set +x turns it off). Happy scripting!


Automation crazy? Need a serious automated build server? Try pulse.