Archive for the ‘Java’ Category

If Java Could Have Just One C++ Feature…

Thursday, August 3rd, 2006

I have been immersed in Java for a while now, but having worked in C++ for years before, there is one big thing I miss: destructors. Especially in a language with exceptions, destructors are a massive time and error saver for resource management.

Having garbage collection is nice and all, but the fact is that we deal a multitude of resources and need to collect them all. How do we do this in Java? The Hard Way: we need to know that streams, database connections etc need to be closed, and we need to explicitly close them:

FileInputStream f = new FileInputStream(“somefile”);
// Do some stuff.
f.close();

Of course, with exceptions it gets worse. We need to guarantee that the stream is closed even if an exception is thrown, leading to the oft-seen pattern:

FileInputStream f = null;
try
{
    // Do some stuff
}
finally
{
    if(f != null)
    {
        try
        {
            f.close();
        }
        catch(IOException e)
        {
            // Frankly, my dear…
        }
    }
}

The noise is just incredible. A common way to reduce the noise is to use a utility function to do the null check and close, but noise still remains. Repeating the same try/finally pattern everywhere is also mind-numbing, and it can be easily forgotten leading to incorrect code.

In C++, this problem is solved elegantly using the Resource Acquisition Is Initialisation (RAII) pattern. This pattern dictates that resources should be acquired in a constructor and disposed of in the corresponding destructor. Combined with the deterministic destruction semantics for objects placed on the stack, this pattern removes the need for manual cleanup and with it the possbility of mistakes:

{
    std::ifstream f(“somefile”);
    // Do some stuff
}

Where has all the cleanup gone? It is where it should be: in the destructor for std::ifstream. The destructor is called automatically when the object goes out of scope (even if the block is exited due to an uncaught exception). The ability to create value types and place them on the stack is a more general advantage of C++, but Java can close the gap with smarter compilers1.

Interestingly, C# comes in half way between Java and C++ on this matter. In C#, you can employ a using statement to ensure cleanup occurs:

using (TextReader r = File.OpenText(“log.txt”)) {
    // Do some stuff
}

In this case the resource type must implement System.IDisposable, and IDispose is guaranteed to be called on the object at the end of the using statement. The using statement in C# is pure syntactic sugar for the try/finally pattern we bash out in Java every day.

What’s the answer for Java?2 Well, something similar to using would be a good start, but I do feel like we should be able to do better. If we’re going to add sugar why not let us define our own with a full-blown macro system? Difficult yes, but perhaps easier than always playing catch up? An alternative is to try and retrofit destructors into the language3. It is possible to mix both garbage collection and destructors, as shown in C++/CLI4. However, I don’t see an elegant way to do so that improves upon what using brings. If you do, then let us all know!


1 it appears that Mustang already has some of the smarts such as escape analysis.
2 if you’re the one down the back who shouted “finalizers”: you can leave anytime you want as long as it’s now!
3 I said NOW!
4See also Herb Suttor’s excellent post on the topic Destructors vs. GC? Destructors + GC!.

Execute the Future

Tuesday, July 18th, 2006

No matter how much multithreaded code you write, it will never be simple. The added complexity of races between threads makes it much harder to predict the possible combinations of events in your program. Even worse, testing the combinations is difficult, requiring thread-aware test cases with explicit synchronisation points. Thus it is a pleasure to see a package like java.util.concurrent offering more than just synchronisation primitives, but also higher level abstractions1.

One of the simplest but most useful abstractions in the package is the Future interface. As stated in the JavaDoc, this interface:

abstracts the result of an asynchronous computation

By encapsulating an operation in a Future, you can execute it asychronously (typically using an Executor). The Future may be queried, cancelled, and most importantly can return the result of the computation. When asking for the result (via one of the get methods), you can also specify a timeout indicating how long you are willing to wait for the result to become available.

To take a concrete example, recently I had a problem with some networking code. The purpose of the code was to ping a bunch of machines to see which of them were online and responding. Unfortunately, when a machine is unavailable it can take some time for the ping to fail, making the whole process take several minutes. In my case, such a long ping time indicates that the machine is essentially unusable anyhow, so I would prefer to fail the ping after a reasonable delay, say 10 seconds. I achieved this timeout without a single line of sychronisation code by using the FutureTask implementation of the Future interface:

public class DefaultAgentManager
{
    private ExecutorService pingService =
        Executors.newCachedThreadPool();

    private class Pinger implements
        Callable<SlaveStatus>
    {
        private SlaveAgent agent;

        public Pinger(SlaveAgent agent)
        {
            this.agent = agent;
        }

        public SlaveStatus call() throws Exception
        {
            SlaveStatus status;

            try
            {
                agent.getSlaveService().ping();
            }
            catch (Exception e)
            {
                status = new SlaveStatus(Status.OFFLINE,
                    e.getMessage());
            }

            return status;
        }
    }

    void pingSlave(SlaveAgent agent)
    {
        FutureTask<SlaveStatus> future =
            new FutureTask<SlaveStatus>(new Pinger(agent));
       
        pingService.execute(future);

        SlaveStatus status;

        try
        {
            status = future.get(10, TimeUnit.SECONDS);
        }
        catch (TimeoutException e)
        {
            status = new SlaveStatus(Status.OFFLINE,
                “Agent ping timed out”);
        }
        catch(Exception e)
        {
            status = new SlaveStatus(Status.OFFLINE,
                e.getMessage());
        }

        agent.updateStatus(status);
    }
}

Although I have simplified the code for clarity2, the most error-prone part of the implementation all lives in java.util.concurrent, which is tried and tested. There was no need for me to carefully synchronise and test for races, I was able to just get on with the job.

This is just one small part of the functionality offered by Futures. Combining them with CompletionServices, for example, allows an even higher level of abstraction for dealing with multiple tasks. If you write multithreaded code, and haven’t looked at the java.util.concurrent package, then do yourself a favour and dive in. The JavaDoc comes complete with usage suggestions and example code snippets, making the package very approachable. Package designer Doug Lea has also authored multiple books on the topic.


1 moving out of the imperative world we get into even more interesting abstractions to help exploit concurrency, but that’s a topic all of its own.
2 the most significant omission is how to handle the tasks that did not complete before the timeout; in some applications they may be a resource burden, in others it may be safe to let them complete in their own time

ANTLR By Example: Part 5: Extra Credit

Tuesday, July 11th, 2006

Introduction

Over the past four parts, I have illustrated how to parse and evaluate boolean expressions using ANTLR. The grammar presented is in those parts is based on real code in pulse. Although it works as presented, there are a couple of items to polish up, one of which I have solved, and the other of which I have not yet been able to solve.

Error Reporting

As pulse allows users to enter their own boolean expressions (to configure when they receive build notifications), decent error reporting is paramount. The first step is to turn off ANTLR’s default error handling, so that the errors can be handled by pulse. This is done by setting the defaultErrorHandler option to false:

class NotifyConditionParser extends Parser;
options {
        buildAST=true;
        defaultErrorHandler=false;
}

With that done, the ANTLR-generated code will throw exceptions on errors. Let’s take a look at the sorts of errors that are generated by the grammar as it stands.

Case 1: Unrecognised word:

$ java NotifyConditionParserTest "changed or tuer"
Caught error: unexpected token: tuer

Case 2: Unrecognised character:

$ java NotifyConditionParserTest "6 and false"
Caught error: unexpected char: '6'

Case 3: Illegal expression structure

$ java NotifyConditionParserTest "state.change or or success"
Caught error: unexpected token: or

Case 4: Unbalanced parentheses

$ java NotifyConditionParserTest "failure or (changed and success"
Caught error: expecting RIGHT_PAREN, found 'null'

Most of these messages are not too bad, at least they are on the right track. Case 4 is certainly the worst of the lot, although the information is accurate it is not exactly user friendly. We’ll get back to that later. One big thing missing in all cases is location information. I figured that ANTLR must have a way to retrieve the information, and a little digging uncovered it. All of the above messages are generated using the getMessage method of the exceptions thrown by ANTLR. To get the line and column number information (which is indeed stored in the exception), you can use the toString method instead:

try
{
    …
}
catch (Exception e)
{
    System.err.println(“Caught error: “ + e.toString());
}

Trying case 1 again:

$ java NotifyConditionParserTest "changed or tuer"
Caught error: line 1:12: unexpected token: tuer

Much better! Now the user knows where the error occured. That leaves us with case 4, which is still a little on the cryptic side:

$ java NotifyConditionParserTest "failure or (changed and success"
Caught error: expecting RIGHT_PAREN, found 'null'

It would be nice if we could not expose the raw token names (e.g. RIGHT_PAREN) and also explicitly say we hit the end of the input (instead of “found ‘null’”). To fix the former problem, we can add paraphrase options to our lexer tokens. This allows us to specify a phrase describing the token which will be used in error messages instead of the token name. The options are applied in the grammar file as part of the lexer rules, for example:

RIGHT_PAREN
options {
    paraphrase = "a closing parenthesis ')'";
}
    : ')';

Applying the paraphrases improves the error message considerably:

$ java NotifyConditionParserTest "failure or (changed and success"
Caught error: line 1:32: expecting a closing parenthesis ')', found 'null'

Unfortunately, we still have the pesky “found ‘null’” to deal with. In this case, I haven’t yet found a simple way to customise the error message. Instead, it is handled as a special case. I found that in this case the exception being thrown was a MismatchedTokenException, with the text of the found token set to null. This allowed the specific case to be handled with a custom message:

catch(MismatchedTokenException mte)
{
    if(mte.token.getText() == null)
    {
        System.err.println(“Caught error: line “ +
            mte.getLine() + “:” +
            mte.getColumn() +
            “: end of input when expecting “ +
            NotifyConditionParser._tokenNames[mte.expecting]);
    }
    else
    {
        System.err.println(“Caught error: “ + mte.toString());
    }
}

This is far from an ideal solution, and I am still looking for a better alternative. However, the user experience is king, and this hack improves it:

$ java NotifyConditionParserTest "failure or (changed and success"
Caught error: line 1:32: end of input when expecting a closing parenthesis ')'

DRY Violation

Those paying close attention would have noticed a wrinkle in the final ANTLR grammar: a violation of the DRY (Don’t Repeat Yourself) principle. Specifically, both the parser and tree parser share a common rule, which is repeated verbatim in the grammar file:

condition
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

Despite scouring the ANTLR documentation, I am yet to find a way around this. I even took a look at some of the example grammars on the ANTLR website, and noticed that they suffer from a similar problem. If anyone knows a way to reuse a rule, let me know! I would love to remove the duplication.

Wrap Up

Well, that just about does it. I hope this series of posts has piqued your interest in ANTLR and parsing, and maybe even helped you to solve some of your own problems. Now go forth and parse!

ANTLR By Example: Part 4: Tree Parsing

Thursday, July 6th, 2006

Introduction

By the end of Part 3: Parsing, we had a working parser that could understand our expression language. The parser turns an input string into an Abstract Syntax Tree (AST) representing the expression. The final piece of the puzzle is evaluating the expression for a given build result (remember, these expressions are used to determine when to send build notifications). To do this we will make use of an ANTLR tree parser.

NotifyCondition’s

Before we get into the details of tree parsers, let’s first define the end goal. Currently we have a way to generate an AST for an expression. What we need, however, is a way to evaluate the expression for a given build result. Approaching from this angle, we can use a self-evaluating data structure that takes a build result and returns true if a notification should be sent for this result. The NotifyCondition interface serves as a basis for this data structure:

public interface NotifyCondition
{
    public boolean satisfied(BuildResult result, User user);
}

This interface defines just one method, satisfied, which returns true iff a notification should be sent for the given build result. The additional user argument is required to evaluate the “changed.by.me” condition. Implementations of this interface mirror the primitive conditions in our expression language:

  • ChangedByMeNotifyCondition
  • ChangedNotifyCondition
  • ErrorNotifyCondition
  • FailureNotifyCondition
  • FalseNotifyCondition
  • StateChangeNotifyCondition
  • SuccessNotifyCondition
  • TrueNotifyCondition

In addition, compound expressions (i.e. those involving operators) are represented using two further implementations:

  • CompoundNotifyCondition: combines a list of child NotifyConditions either conjunctively (and) or disjunctively (or).
  • NotNotifyCondition: inverts a single child NotifyCondition

Using these implementations of the NotifyCondition interface, we can create the self-evaluating data structure we need to test build results. The missing link is getting from the AST to the right combination of NotifyConditions. To help create NotifyConditions based on expressions, a simple factory is used:

class NotifyConditionFactory
{
    …

    public NotifyCondition createCondition(String condition)
    {
        …
    }
}

The factory takes a primitive condition string (e.g. “state.change”) and returns a new instance of the corresponding condition (e.g. StateChangeNotifyCondition).

Tree Parsers

A tree parser is a special kind of ANTLR parser that operates on an AST, rather than a token stream. This seems strange at first, as trees are a two-dimensional structure, not a single-dimansional stream. However, the principles are much the same: tree parsers just have a special rule syntax for matching input in two dimensions. In our case, we use the tree parser as a simple, declarative way to transform our AST into our own data structure representing the expression.

Defining the Tree Parser

Tree parsers are defined in grammar files, in our case in the same file: NotifyCondition.g. From the ANTLR manual:

{ optional class code preamble }
"class" YourTreeParserClass "extends TreeParser";
options
tokens
{ optional action for instance vars/methods }
tree parser rules...

The tree parser rule definitions should look very familiar:

rulename
    ":" alternative_1
    "|" alternative_2
    ...
    "|" alternative_n
    ";"

The difference is in how the alternatives are specified. ANTLR uses a special syntax that allows matching against trees:

"#(" root-token child1 child2 ... childn ")"

This will match against a tree (including subtrees) that has the given token as its root node, and has children matching the “childx” subrules. In a simple case, the child subrules are just token references, for example:

#( "and" "changed" "success" )

would match the tree:

    changed
and
    success

More generally, however, we can define the children recursively in terms of the tree parser rules themselves. This allows us to match against the ASTs generated by our parser:

class NotifyConditionTreeParser extends TreeParser;
cond
    : #("and" cond cond)
    | #("or" cond cond)
    | #("not" cond)
    | condition
    ;
condition
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

Notice how the “cond” rule is defined recursively: children of an “and” expression can themselves be arbitrary conditions. In fact, this is more permissive than our parser, but that keeps the tree rules simpler.

Evaluating the Tree

The tree parser we defined does not yet do anything useful. To transform the AST into the corresponding NotifyCondition structure we need to embed actions into the tree rules. These actions can be used to insert code into the tree parser methods that conver the matched subtrees into corresponding NotifyCondition’s. Let’s dive right into the full tree parser definition:

class NotifyConditionTreeParser extends TreeParser;
{
    private NotifyConditionFactory factory;
    public void setNotifyConditionFactory(NotifyConditionFactory factory)
    {
        this.factory = factory;
    }
}
cond returns [NotifyCondition r]
{
    NotifyCondition a, b;
    r = null;
}
    : #("and" a=cond b=cond) {
        r = new CompoundNotifyCondition(a, b, false);
    }
    | #("or" a=cond b=cond) {
        r = new CompoundNotifyCondition(a, b, true);
    }
    | #("not" a=cond) {
        r = new NotNotifyCondition(a);
    }
    | c:condition {
        r = factory.createCondition(c.getText());
    }
    ;
condition
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

The first thing you’ll notice is a new function setNotifyConditionFactory that allows us to inject a factory into the parser. Next, you can see that the “cond” rule has been updated to return a NotifyCondition, from a variable named “r”. The syntax looks odd at first, but it is just a way to add Java code to the method that processes the rule. Likewise, we declare NotifyCondition variables “a” and “b”. Using the “cond=a” syntax, we can assign the result of the recursive invocations of the “cond” rule to these variables. Finally, we define the NotifyCondition to return for each alternative by embedding Java code to assign the value of “r”. I have skimmed over the details somewhat here, but you can always take a look at the NotifyConditionTreeParser code generated by ANTLR to help make the connection.

Generating the Tree Parser

OK, time to fire up ANTLR again:

$ java -classpath antlr-2.7.6.jar antlr.Tool NotifyCondition.g

We get a full 10 files:

NotifyConditionLexer.java
NotifyConditionLexer.smap
NotifyConditionParser.java
NotifyConditionParser.smap
NotifyConditionParserTokenTypes.java
NotifyConditionParserTokenTypes.txt
NotifyConditionTreeParser.java
NotifyConditionTreeParser.smap
NotifyConditionTreeParserTokenTypes.java
NotifyConditionTreeParserTokenTypes.txt

The tree parser itself is, of course, defined in NotifyConditionTreeParser.java. We can put the lexer, parser and tree parser together to transform an input string into a self-evaluating NotifyCondition:

public NotifyCondition getNotifyCondition(String expression)
{
    NotifyCondition notifyCondition = null;
    StringReader reader = null;

    try
    {
        reader = new StringReader(expression);
       
        NotifyConditionLexer lexer;
        NotifyConditionParser parser;
        NotifyConditionTreeParser tree;

        lexer = new NotifyConditionLexer(reader);
        parser = new NotifyConditionParser(lexer);
        tree = new NotifyConditionTreeParser();
        tree.setNotifyConditionFactory(notifyFactory);

        parser.orexpression();
        AST t = parser.getAST();       
        notifyCondition = tree.cond(t);
    }
    catch (Exception e)
    {
        System.stderr.println(e.getMessage());
    }
    finally
    {
        if(reader != null)
        {
            reader.close();
        }
    }

    return notifyCondition;
}

That’s it! We have reached the end goal. Now when a build completes, we can pass the build result to the NotifyCondition object to determine if a notification should be sent.

Wrap Up

Well, it took a few steps, but we have finally transformed our input into our desired output. Hopefully this has shown some of the power of ANTLR, and will encourage you to take a closer look. In the final part, we will look into minor improvements to add some polish to the implementation.

ANTLR By Example: Part 3: Parsing

Monday, June 19th, 2006

Introduction

We finished Part 2: Lexical Analysis with a working lexer. The lexer splits character streams into streams of tokens such as “changed.by.me”, “(” and “not”. In this part, we will create a parser that can turn these token streams into an abstract syntax tree (AST).

Abstract Syntax Trees (ASTs)

An AST captures the structure of the expression being parsed. Expressions are recursive by nature: they are made up of one or more sub-expressions, which in turn have sub-expressions of their own. Such a structure is naturally modelled as a tree, where each sub-tree models a sub-expression. The tree also captures the precedence of operators: a higher-precedence operator will bind more tightly and become a lower node in the tree. For example, the expression “changed or failed and not state.change” may be represented in tree form:

    changed
or
        failed
    and
        not
            state.change

Parsers

Defining the Parser

ANTLR parsers are defined in grammar files, usually the same file as the associated lexer. From the ANTLR manual:

{ optional class code preamble }
"class" YourParserClass "extends Parser;"
options
tokens
{ optional action for instance vars/methods }
parser rules...

This is very similar to the syntax for lexers that we saw in Part 2. We will be focusing mostly on the parser rules, which define the sequences of tokens that the parser will accept, and how the AST should be constructed. Parser rules have the same high-level format as lexer rules:

rulename
    ":" alternative_1
    "|" alternative_2
    ...
    "|" alternative_n
    ";"

Parser rule names must begin with a lower case letter. Alternatives are expressed in terms of tokens (as opposed to characters for lexers). You can also use string literals – they are handled specially by ANTLR (the strings are placed in a literals table for the associated lexer, and literals are treated as tokens).

To parse our boolean expression language, we need to break down the language into ANTLR rules. When breaking the language down, we also encode the precedence of the operators. Referring back to the example tree above, you will recall that the higher precedence operators appear lower in the tree. So, for example, we can see that the general form of an “or” expression is:

orexpression
    :   andexpression ("or" andexpression)*
    ;

Note the use of the ‘*’ wildcard to indicate zero or more repetitions. Other wildcards, such as ‘+’ (one or more) and ‘?’ (zero or one) may also be used. Working our way from top to bottom, we can define the whole language in this way:

orexpression
    :   andexpression ("or" andexpression)*
    ;
andexpression
    : notexpression ("and" notexpression)*
    ;
notexpression
    : ("not")? atom
    ;
atom
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

Lower precedence operators have their expressions defined in terms of the next-highest precedence operator. This language is missing something, however. It does not yet handle grouping (using parentheses “(…)”). Recall that the purpose of grouping is to override the normal precedence. In this way, a grouped expression is an indivisible element (atom) with regards to the surrounding operators. So, to introduce grouped expressions, we need to define an atom as either a primitive condition or a grouped expression:

atom
    : condition
    | LEFT_PAREN orexpression RIGHT_PAREN
    ;
condition
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

Building the AST

Now we have a parser that will accept valid token streams for our language. The next step is to tell ANTLR to construct the AST, and to annotate the rules to describe how the tree should be built. Turning on AST generation is done with an option:

options {
    buildAST=true;
}

We guide the AST construction using postfix annotations on the tokens in our parser rules. The following annotations are available:

  • no annotation: a token without an annotation becomes a leaf node in the tree
  • ^: a token annotated with a carat becomes a sub-expression root
  • !: a token annotated with an exclamation point is not included in the tree

In our language, the operators form root nodes, so we will annotate them with the ‘^’ character. As the grouping operators are only used to override precedence, there is no need form them in the final AST, so we will omit them by using the ‘!’ annotation. This gives us our final parser definition:

class NotifyConditionParser extends Parser;
options {
        buildAST=true;
}
orexpression
    :   andexpression ("or"^ andexpression)*
    ;
andexpression
    : notexpression ("and"^ notexpression)*
    ;
notexpression
    : ("not"^)? atom
    ;
atom
    : condition
    | LEFT_PAREN! orexpression RIGHT_PAREN!
    ;
condition
    : "true"
    | "false"
    | "success"
    | "failure"
    | "error"
    | "changed"
    | "changed.by.me"
    | "state.change"
    ;

Generating the Parser

At last, we are ready to generate the parser code:

$ java -classpath antlr-2.7.6.jar antlr.Tool NotifyCondition.g

Now we get six files:

NotifyConditionLexer.java
NotifyConditionLexer.smap
NotifyConditionParser.java
NotifyConditionParser.smap
NotifyConditionParserTokenTypes.java
NotifyConditionParserTokenTypes.txt

Let’s take the parser for a test drive:

Reader reader = new StringReader(argv[0]);
NotifyConditionLexer lexer = new NotifyConditionLexer(reader);
NotifyConditionParser parser = new NotifyConditionParser(lexer);
       
try
{
    parser.orexpression();
    AST tree = parser.getAST();
    System.out.println(tree.toStringTree());
}
catch (RecognitionException e)
{
    System.err.println(e.toString());
}
catch (TokenStreamException e)
{
    System.err.println(e.toString());
}

All of the rules in our parser become methods. To parse an expression, we simply choose the highest-level expression: orexpression. The trees are printed using a lisp-like syntax. For example, the output when parsing the expression “changed.by.me or not (success and changed)” is:

( or changed.by.me ( not ( and success changed ) ) )

The tree nodes are shown in parentheses with the root first, followed by a list of children. We can also try an expression with valid tokens, but an invalid structure “success or or changed”:

line 1:12: unexpected token: or

The parser rejects the expression because there is no rule that will accept “or or”.

Wrap Up

We’re really getting there now. We have a fully functional parser that accepts valid expressions and produces ASTs to represent them. In the next part, we will see how to convert these ASTs into our own notification condition data structure using ANTLR tree parsers.

ANTLR By Example: Part 2: Lexical Analysis

Monday, June 12th, 2006

Introduction

In Part 1: The language, we took a look at the simple boolean expression language we want to parse. In this part, we will start getting into ANTLR, and will tackle the problem of lexical analysis. This is the first phase of parsing, where the input stream of characters is processed to form a stream of tokens that the parser can understand.

Setup

Getting ANTLR

To actually try out the examples presented in this tutorial, you will need to download ANTLR. In our case we use the single ANTLR JAR file both to run the parser generation tool and as a runtime library. Standalone executables are also available, and can be a more convenient way to run the parser generator.

Running ANTLR

To run ANTLR via the JAR file, execute the following:

$ java -classpath antlr-2.7.6.jar antlr.Tool <input file>

Lexical Analysis

Grammar Files

Lexical analysis in ANTLR is treated similarly to parsing. You specify rules that correspond to tokens, and define which sequences of characters match those rules (and thus form those tokens) using an EBNF-like syntax. The rules are specified in “grammar” files which are text files with a “.g” extension (by convention). In this tutorial we will be using a single grammar file to describe the lexical analyser, parser and tree parser. Let’s start with an overview of the ANTLR syntax for specifying lexers (literal parts enclosed in quotes):

{ optional class code preamble }
"class" YourLexerClass "extends Lexer;"
options
tokens
{ optional action for instance vars/methods }
lexer rules...

In our simple example, we focus almost entirely on the lexer rules, and do not use any options, tokens or actions. The lexer rules themselves are specified using the following syntax:

rulename
    ":" alternative_1
    "|" alternative_2
    ...
    "|" alternative_n
    ";"

Lexer rule names must begin with a capital letter, and define tokens. In our example, we need to split the input characters into four types of token:

  1. WORD: sequences of alphabetical and period (’.’) characters, used for both the primitives (e.g. “state.change”) and operators (e.g. “and”) in our language.
  2. WHITESPACE: linebreaks, tabs and spaces, which delimit other tokens but are not passed on to the parser
  3. LEFT_PAREN: the ‘(’ character, used for grouping
  4. RIGHT_PAREN: the ‘)’ character, used for grouping

For each of these token types, we need to provide an ANTLR rule. Here is the corresponding grammar file (NotifyCondition.g):

header {
    package com.zutubi.pulse.condition.antlr;
}
class NotifyConditionLexer extends Lexer;
// Words, which include our operators
WORD: ('a'..'z' | 'A'..'Z' | '.')+ ;
// Grouping
LEFT_PAREN: '(';
RIGHT_PAREN: ')';
WHITESPACE
    : (' ' | '\t' | '\r' | '\n') { $setType(Token.SKIP); }
    ;

There are a couple of interesting things to note in this grammar:

  • ANTLR allows the use of a range operator ‘..’ to easily specify character ranges (e.g. ‘a’..’z’ includes all lowercase latin alphabetical characters)
  • You can also use wildcards to indicate repetition (e.g. the ‘+’ character is used to specify “one or more” in the WORD rule)
  • You can apply actions to the token, as illustrated in the WHITESPACE rule. This allows you to inject code into the method that processes the rule. In this case we set the token type to “SKIP” so that these tokens are not passed to the parser. The $setType directive is an ANTLR built-in that abstracts the actual code used to set the token type.

Of course, this is just the tip of the iceberg. The ANTLR manual has further details and examples for specifying lexer rules.

Generating the Lexer

Now we can fire up ANTLR and actually generate some code:

$ java -classpath antlr-2.7.6.jar antlr.Tool NotifyCondition.g

This generates four files:

NotifyConditionLexer.java
NotifyConditionLexer.smap
NotifyConditionLexerTokenTypes.java
NotifyConditionLexerTokenTypes.txt

The most interesting is the lexer itself, NotifyConditionLexer. We can drive this lexer ourselves to see how it works (forgive the lack of proper cleanup):

Reader reader;
NotifyConditionLexer lexer;
Token token;

reader = new StringReader(argv[0])
lexer  = new NotifyConditionLexer(reader);

try
{
    while(true)
    {
        token = lexer.nextToken();
        if(token.getType() == Token.EOF_TYPE)
        {
            break;
        }

        System.out.println(“Token: ‘” + token.getText() + “‘”);
    }
}
catch (TokenStreamException e)
{
    System.err.println(e.toString());
}

With input string “changed.by.me or not (success and changed)” this code prints:

Token: ‘changed.by.me’
Token: ‘or’
Token: ‘not’
Token: ‘(’
Token: ’success’
Token: ‘and’
Token: ‘changed’
Token: ‘)’

Excellent, ANTLR is breaking the string into the expected tokens! Note also that the whitespace is not reported, as we set those tokens to type “SKIP”. What if there is an error in the input? Running this code over the string “changed.by.me or 55″ gives:

Token: ‘changed.by.me’
Token: ‘or’
line 1:18: unexpected char: ‘5′

Our analyser does not have any rule that matches numbers, so it stops when it sees the first ‘5′ character.

Wrap Up

Well, we’re getting somewhere now. We’ve created our first grammar file, generated a lexer and used it to successfully tokenise an input stream. In the next installment we’ll define the rest of the expression language as parser rules, and generate an Abstract Syntax Tree (AST).

ANTLR By Example: Part 1: The Language

Monday, June 5th, 2006

Introduction

As promised, this is the first installment in a simple ANTLR tutorial “ANTLR By Example”. In this post, I’ll introduce the language to be implemented, and give an overview of the full tutorial.

Background

In case you missed the background, recently we added a feature to pulse that allows build notifications to be filtered using arbitrary boolean expressions. Rather than reinventing the wheel and writing a parser for the boolean expression by hand, we used ANTLR, a fantastic parser generator with support for Java (amongst several other languages). The purpose of this tutorial is to show you how easy ANTLR makes it to create your own little languages.

Overview

To make the tutorial manageable, it will be split over multiple parts:

The Language

Before we race into generating lexical anaylsers and parsers, it’s best that we nail down exactly what we need to parse. The language we need to process is a simple boolean expression language. The language consists of two types of “atoms” (indivisible elements): primitives and operators. These atoms are combined to form expressions.

Primitives

Primitives are simple elements that evaluate to true or false. To make the language readable, each primitive is represented as a descriptive string such as “true” or “changed.by.me”. Since we are filtering build notifications, most primitives in our language represent tests on the build result:

  • true: always evaluates to true
  • false: always evaluates to false
  • success: evaluates to true if the build succeeded
  • failure: evaluates to true if the build failed
  • error: evaluates to true if the build encountered an error
  • changed: evaluates to true if the build included a new change
  • changed.by.me: evaluates to true if the build included a new change by the owner of the filter
  • state.change: evaluates to true if the result of the build differs from the result of the previous build

The meanings of the primitives are not terribly important to the parsing, but this is a real world example after all :) .

Operators

Our language supports standard boolean operators for disjunction (or), conjuction (and) and inversion (not). Again for readability, we use words to represent these operators. Additionally, a grouping operator is defined, allowing expressions to be treated as atoms (to override operator precedence). The operators are summarised below:

  • or: a binary operator, a “or” b evaluates to true if either a or b evaluates to true
  • and: a binary operator, a “and” b evaluates to true only if both a and b evaluate to true
  • not: a unary operator, “not” a evaluates to true only if a evaluates to false
  • grouping: “(” a “)” groups the expression a into an atom, overriding normal precedence rules

The operators are listed from lowest to highest precendence, i.e. later operators bind more tightly.

Expressions

Expressions are formed by composing strings of atoms. Not all atom strings are valid expressions. In particular, operators must have correct arguments (left and right for binary, right for unary, internal for grouping). The following are all examples of valid expressions:

  • changed
  • changed.by.me and failed
  • not (success or state.change) and changed

Operators with higher precedence bind more tightly. Thus:

a “and” b “or” c

is equivalent to:

“(” a “and” b “)” “or” c

as “and” has higher precedence than “or”.

Wrap Up

So there it is, a simple language for defining boolean expressions. In the next part, we’ll fire up ANTLR and start turning character streams into token streams.

Your Own Little Language

Friday, June 2nd, 2006

One of the most studied and best understood areas of computer science is lexical analysis and parsing. Fantastic tools (such as ANTLR) are also available to automatically generate parsers based on declarative input. Despite this, most developers rarely take the opportunity to design their own “little” languages. No, I’m not suggesting we should all start creating full-blown programming languages (although that might be fun…). I’m not talking about Domain Specific Languages (DSLs) either1, at least not in the strictest sense. When I say “little” I mean a language custom built to take care of a small task.

For example, in pulse you can filter build notifications using arbitrary boolean expressions. To allow this, we created a custom boolean expression language. This “little” language has the usual boolean operators (and, or, not) and primitives like “success” which evaluates to true if the build was successful. It was implemented in an afternoon, thanks to ANTLR writing most of the code for us. The result is a simple yet powerful way to specify filters, much more usable than a GUI of equal flexibility. To keep the simple cases simple we provide a GUI with pre-canned expressions for the most common scenarios. It’s the best of both worlds.

So keep your eyes open, there are opportunities for little languages everywhere. With a tool like ANTLR in your arsenal to do all the heavy lifting, there is no reason not to give little languages a go! To give you a head start, I’ll be presenting an “ANTLR By Example” series of posts that will show exactly how we used ANTLR to create the language described above. Stay tuned!


1 on second thought, that would make this post trendier…

Wallet friendly RSS feeds

Thursday, May 4th, 2006

I have been subscribing to RSS feeds for some time now, but it was not until I had to implement one that I realised that there was more to it than just a structured XML response.

Below is a quick dissection of what I would consider the absolute minimum implementation for an RSS feed. I’m calling it “wallet friendly” not because it makes you money, but because it will save your users in bandwidth costs by not spitting out a full feed for every request. Whilst the example is in Java, using Rome and Webwork, the details apply equally well to other frameworks and languages.

This example is an extension of the WebWorkResultSupport class.


protected void doExecute(String format, ActionInvocation action....
{
    HttpServletResponse resp = ServletActionContext.getResponse();
    HttpServletRequest req = ServletActionContext.getRequest();
    OgnlValueStack stack = actionInvocation.getStack();  

Firstly, what do we do if there is no feed to display? The answer will depend on what that means in your system. If it’s an error, then it’s time for a 500 (Internal Server Error) response. If, on the other hand, it means that the feed request is invalid, it’s best to return a 410 (GONE) so that whoever is making the request knows that they should stop.

         
    SyndFeed feed = (SyndFeed) stack.findValue("feed");
    if (feed == null)
    {
        resp.sendError(HttpServletResponse.SC_GONE);
        return;
    }

Now set the content type and disposition. This will help the browser to deal with the response appropriately, and give it a friendly name if the user decides to ’save as’.


    resp.setContentType("application/rss+xml; charset=UTF-8");
    resp.setHeader("Content-Disposition", "filename=rss.xml");

When handling an RSS request, you should not return the feed unless it has changed since it was last requested. You can find a very good discussion on this by Charles in his fishbowl, and by Randy at the kbcafe. (For those in a hurry, the relevant details are that the “Last-Modified” and “ETag” headers are returned in the following request as “If-Modified-Since” and “If-None-Match” respectively)

There are two steps to this. The first is to always set the “Etag” and “Last-Modified” response headers. The “Last-Modified” details can be taken from the feed as so:


    // A happy default here is the If-Modified-Since header.
    // If we don't have any feed entries, then this will result
    // in a 304 Not modified
    Date lastModified =
            new Date(request.getDateHeader("If-Modified-Since"));
    List entries = feed.getEntries();
    if (entries.size() > 0)
    {
        // Get the latest feed entry - assuming the latest is
        // at the top and that you set a published/updated
        // date on the feed entries.
        SyndEntry entry = (SyndEntry) entries.get(0);
        lastModified = entry.getPublishedDate();
        Date updated = entry.getUpdatedDate();
        if (updated != null && lastModified.compareTo(updated) < 0)
        {
            lastModified = updated;
        }
    }

The Etag should uniquely identify this feed (read – the latest item in this feed). Unless you expect feed entries to be created at exactly the same time (or your database does not provide a high degree of accuracy in the time field) it’s sufficient to use the last modified timestamp for the Etag. If this is not unique enough in your case, you will need to create a unique hash from the content.


    String etag = Long.toString(lastModified.getTime());
    resp.setHeader("ETag", etag);

Before you can use the last modified date for the header, you may need to drop the milliseconds since they are not part of the date format used by HTTP.


    Calendar cal = Calendar.getInstance();
    cal.setTime(lastModified);
    cal.set(Calendar.MILLISECOND, 0);
    lastModified = cal.getTime();

Now we can set the “Last-Modified” header.


    // always set
    resp.setDateHeader("Last-Modified", lastModified.getTime());

That completes the first step (setting the “Last-Modified” and “Etag” headers on every response). The second step is to check the “If-None-Match” and “If-Modified-Since” on the request (remembering that they ’should’ contain what you sent our in the previous response). If they match the “ETag” and “Last-Modified” values we just set on the response then we do not need to return the feed. A 304 Not Modified will suffice.


    // Check the headers to determine whether or not a response
    // is required.
    if (TextUtils.stringSet(req.getHeader("If-None-Match")) ||
    TextUtils.stringSet(req.getHeader("If-Modified-Since")))
    {
        if (etag.equals(req.getHeader("If-None-Match")) &&
            lastModified.getTime() ==
                         req.getDateHeader("If-Modified-Since"))
        {
            // If response is not required, send 304 Not modified.
            resp.sendError(HttpServletResponse.SC_NOT_MODIFIED);
            return;
        }
    }

Now, let’s generate the feed data and send it on its way.


    // Render the feed in the requested format.
    WireFeed outFeed = feed.createWireFeed(format);
    outFeed.setEncoding("UTF-8");
    new WireFeedOutput().output(outFeed, response.getWriter());
    resp.flushBuffer();

Oh, and one last thing. If you want to return error details when things go wrong, do the RSS reader a favour and format the error response in valid RSS format. But be sure to set appropriate Last Modified and Etag header. For example, set an error token in the ETag header that you can check next time round. If your current error token matches the ETag in the request, respond with the 304.

——-
Into continuous integration? Want to be? Try pulse.

Ajax vs Caching

Wednesday, April 12th, 2006

Recently I’ve thrown some simple Ajax (well, Aj at least) into our app, to refresh content that changes frequently without refreshing the whole page. With a library like prototype, it really couldn’t be simpler, just throw in an Ajax.PeriodicalUpdater that pulls in a new HTML fragment and you’re set.

Almost.

Working with web UIs, we’ve all run into the standard browser caching problems. A meta tag or two:

<meta http-equiv=”Pragma” content=”no-cache”/>
<meta http-equiv=”Expires” content=”0″/>

solves most problems. When you start using Javascript to periodically modify the page, however, things get trickier. The first problem I ran into was with IE. Symptom: the periodical updates just didn’t work. Because these were just fragments of a page, they didn’t contain the meta tags above. IE was caching the HTML fragment rather than hitting the server again as desired. Solution: use the no-cache and expiration headers on the HTTP response itself. I implemented this in WebWork with a simple interceptor, keeping this detail out of my action classes and allowing it to be reused as necessary:

public class AjaxInterceptor extends AroundInterceptor
{
    protected void after(ActionInvocation actionInvocation, String string)
        throws Exception
    {
    }
    protected void before(ActionInvocation actionInvocation)
        throws Exception
    {
        HttpServletResponse response = ServletActionContext.getResponse();
        response.setHeader("Pragma", "no-cache");
        response.setDateHeader("Expires", 0);
    }
}

So we’re done right? Right? Wrong. This fixed the original IE problem, but after a while I noticed another major annoyance. I had committed a mortal web UI sin: the back button was broken.

If an updating page was left open long enough for a content refresh, navigating away from that page and then returning via the back button displayed the originally-loaded content, not the refreshed content. From a user’s perspective this is not only annoying, but downright confusing. The problem in this case was again caching: not caching of the HTML fragment, but caching of the whole page. “But wait!”, you ask, “That original page has the meta tags, right?”. Right! But the back button doesn’t care. In fact, the HTTP spec requires the back button not to care. When you hit that back button you get the originally-loaded content, from the browser’s cache, expiration rules be damned!

So what can we do? The answer is to stop the page being put in the cache in the first place. This can be done with a liberal sprinkling of Cache-Control headers, to cater for various browsers:

...
    protected void before(ActionInvocation actionInvocation)
        throws Exception
    {
        HttpServletResponse response = ServletActionContext.getResponse();
        response.setHeader("Pragma", "no-cache");
        response.setHeader("Cache-Control", "must-revalidate");
        response.setHeader("Cache-Control", "no-cache");
        response.setHeader("Cache-Control", "no-store");
        response.setDateHeader("Expires", 0);
    }
...

Firefox in particular is fussy, requiring “no-store”.

For more information on back button and caching behaviour in Firefox and IE, here’s a head start:

——-
Into continuous integration? Want to be? Try pulse.