a little madness

A man needs a little madness, or else he never dares cut the rope and be free -Nikos Kazantzakis

Zutubi

ANTLR By Example: Part 2: Lexical Analysis

Introduction

In Part 1: The language, we took a look at the simple boolean expression language we want to parse. In this part, we will start getting into ANTLR, and will tackle the problem of lexical analysis. This is the first phase of parsing, where the input stream of characters is processed to form a stream of tokens that the parser can understand.

Setup

Getting ANTLR

To actually try out the examples presented in this tutorial, you will need to download ANTLR. In our case we use the single ANTLR JAR file both to run the parser generation tool and as a runtime library. Standalone executables are also available, and can be a more convenient way to run the parser generator.

Running ANTLR

To run ANTLR via the JAR file, execute the following:

$ java -classpath antlr-2.7.6.jar antlr.Tool <input file>

Lexical Analysis

Grammar Files

Lexical analysis in ANTLR is treated similarly to parsing. You specify rules that correspond to tokens, and define which sequences of characters match those rules (and thus form those tokens) using an EBNF-like syntax. The rules are specified in “grammar” files which are text files with a “.g” extension (by convention). In this tutorial we will be using a single grammar file to describe the lexical analyser, parser and tree parser. Let’s start with an overview of the ANTLR syntax for specifying lexers (literal parts enclosed in quotes):

{ optional class code preamble }
"class" YourLexerClass "extends Lexer;"
options
tokens
{ optional action for instance vars/methods }
lexer rules...

In our simple example, we focus almost entirely on the lexer rules, and do not use any options, tokens or actions. The lexer rules themselves are specified using the following syntax:

rulename
    ":" alternative_1
    "|" alternative_2
    ...
    "|" alternative_n
    ";"

Lexer rule names must begin with a capital letter, and define tokens. In our example, we need to split the input characters into four types of token:

  1. WORD: sequences of alphabetical and period (‘.’) characters, used for both the primitives (e.g. “state.change”) and operators (e.g. “and”) in our language.
  2. WHITESPACE: linebreaks, tabs and spaces, which delimit other tokens but are not passed on to the parser
  3. LEFT_PAREN: the ‘(‘ character, used for grouping
  4. RIGHT_PAREN: the ‘)’ character, used for grouping

For each of these token types, we need to provide an ANTLR rule. Here is the corresponding grammar file (NotifyCondition.g):

header {
    package com.zutubi.pulse.condition.antlr;
}
class NotifyConditionLexer extends Lexer;
// Words, which include our operators
WORD: ('a'..'z' | 'A'..'Z' | '.')+ ;
// Grouping
LEFT_PAREN: '(';
RIGHT_PAREN: ')';
WHITESPACE
    : (' ' | '\t' | '\r' | '\n') { $setType(Token.SKIP); }
    ;

There are a couple of interesting things to note in this grammar:

  • ANTLR allows the use of a range operator ‘..’ to easily specify character ranges (e.g. ‘a’..’z’ includes all lowercase latin alphabetical characters)
  • You can also use wildcards to indicate repetition (e.g. the ‘+’ character is used to specify “one or more” in the WORD rule)
  • You can apply actions to the token, as illustrated in the WHITESPACE rule. This allows you to inject code into the method that processes the rule. In this case we set the token type to “SKIP” so that these tokens are not passed to the parser. The $setType directive is an ANTLR built-in that abstracts the actual code used to set the token type.

Of course, this is just the tip of the iceberg. The ANTLR manual has further details and examples for specifying lexer rules.

Generating the Lexer

Now we can fire up ANTLR and actually generate some code:

$ java -classpath antlr-2.7.6.jar antlr.Tool NotifyCondition.g

This generates four files:

NotifyConditionLexer.java
NotifyConditionLexer.smap
NotifyConditionLexerTokenTypes.java
NotifyConditionLexerTokenTypes.txt

The most interesting is the lexer itself, NotifyConditionLexer. We can drive this lexer ourselves to see how it works (forgive the lack of proper cleanup):

Reader reader;
NotifyConditionLexer lexer;
Token token;

reader = new StringReader(argv[0])
lexer  = new NotifyConditionLexer(reader);

try
{
    while(true)
    {
        token = lexer.nextToken();
        if(token.getType() == Token.EOF_TYPE)
        {
            break;
        }

        System.out.println("Token: ‘" + token.getText() + "’");
    }
}
catch (TokenStreamException e)
{
    System.err.println(e.toString());
}

With input string “changed.by.me or not (success and changed)” this code prints:

Token: ‘changed.by.me’
Token: ‘or’
Token: ‘not’
Token: ‘(‘
Token: ‘success’
Token: ‘and’
Token: ‘changed’
Token: ‘)’

Excellent, ANTLR is breaking the string into the expected tokens! Note also that the whitespace is not reported, as we set those tokens to type “SKIP”. What if there is an error in the input? Running this code over the string “changed.by.me or 55” gives:

Token: ‘changed.by.me’
Token: ‘or’
line 1:18: unexpected char: ‘5’

Our analyser does not have any rule that matches numbers, so it stops when it sees the first ‘5’ character.

Wrap Up

Well, we’re getting somewhere now. We’ve created our first grammar file, generated a lexer and used it to successfully tokenise an input stream. In the next installment we’ll define the rest of the expression language as parser rules, and generate an Abstract Syntax Tree (AST).

Liked this post? Share it!

6 Responses to “ANTLR By Example: Part 2: Lexical Analysis”

  1. June 15th, 2006 at 1:48 pm

    Technical Related Notes » Blog Archive » links for 2006-06-13 says:

    […] a little madness » Blog Archive » ANTLR By Example: Part 2: Lexical Analysis (tags: java antlr) […]

  2. August 8th, 2006 at 9:13 pm

    Ibrahim says:

    Hi!

    You made an error in the rule WHITESPACE of the lexer. I think you meant something like:

    (‘ ‘ | ‘\t’ | ‘\n’ | ‘\r’)

    Regards,
    Ibrahim

  3. August 8th, 2006 at 10:23 pm

    Jason says:

    Ibrahim,

    Thanks for the catch! I didn’t notice that the backslashes were eaten by wordpress. I have fixed it now.

  4. September 26th, 2006 at 2:37 am

    Lucio says:

    Thanks for the introduction to ANTLR. This helps a lot with something that I was about to do and write my own parser for, but this makes much more sense.

    Thank You

  5. November 18th, 2006 at 8:26 am

    Chitro Neogy says:

    Jason, can you post the source code of these examples in a downloadable link? This is very helpful, thanks. I’m trying to learn this language.

    Chitro Neogy

  6. April 23rd, 2011 at 3:52 am

    Khalid says:

    Hi,

    I am trying to learn ANTLR tool. Can you upload the source code for grammer file as I get errors generating source files

    Regards
    Khalid

Leave a Reply