Hari's Corner

Humour, comics, tech, law, software, reviews, essays, articles and HOWTOs intermingled with random philosophy now and then

Writing a Toy Calculator scripting language with Java and ANTLR 4 - Part 2

Filed under: Tutorials and HOWTOs by Hari
Posted on Sun, Apr 5, 2020 at 14:57 IST (last updated: Thu, Apr 9, 2020 @ 17:56 IST)

In this series < Previous Next >

In the second part of this series (the first part here), we will set up NetBeans/Maven to work with ANTLR, and also build our first grammar. I will assume that you have installed Java/NetBeans and also created a new empty Maven project. Hereinafter all references to subfolders will be relative to this project folder.

Setting up Maven

Add the following lines to pom.xml found in Project Files, under the Projects sidebar in NetBeans. This will pull in the required dependencies to build the ANTLR project as well as set up automatic generation of Java classes from our ANTLR grammar file.

<dependencies>
     <dependency>
         <groupId>org.antlr</groupId>
         <artifactId>antlr4-runtime</artifactId>
         <version>4.7.2</version>
     </dependency>
</dependencies>
<build>
     <plugins>
          <plugin>
                <groupId>org.antlr</groupId>
                <artifactId>antlr4-maven-plugin</artifactId>
                <version>4.7.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>antlr4</goal>
                        </goals>
                    </execution>
                </executions>
           </plugin>
      </plugins>
</build>

This will pull in the ANTLR4 dependencies when building the ANTLR project from NetBeans.

Our Grammar

As per the specifications set out in the earlier part of this series, it is now time to define our grammar. If you haven't already, please read the previous part to understand the conceptualization of the grammar. That will make it easier to follow the remaining part of this series.

First under your project folder, create a new folder antlr4 under src/main. This is also a good time to get familiarized with the EBNF notation which is the notation used by ANTLR in describing grammars.

Create a new grammar file ToyCalc.g4 inside the src/main/antlr4 directory under the full package subfolder i.e in our case org/harishankar/toycalc. Thus the grammar file will reside in src/main/antlr4/org/harishankar/toycalc. By putting the grammar definitions here, the maven ANTLR plugin will automatically generate the sources for the parser in the generated sources folder of the project.

ToyCalc.g4 is the file where we will define our grammar (both lexical analysis as well as parsing logic in this case, but it is possible in ANTLR to separate them into separate files).

This is our basic grammar.

grammar ToyCalc;

toycalc     : (statement TERMINATOR)+;
statement   : (OPERATION EXPR | PRINT STRING | GETVALUE);

TERMINATOR  : ';';
OPERATION   : 'SETVALUE' | 'ADD' | 'SUB' | 'MUL' | 'DIV' ;
PRINT       : 'PRINT';
GETVALUE    : 'GETVALUE';
EXPR        : INTEGER | FLOAT;
STRING      : '"'(.*?)'"';
INTEGER     : [0-9]+ | '-'[0-9]+;
FLOAT       : [0-9]+'.'[0-9]+ | '-'[0-9]+'.'[0-9]+;
COMMENT     : '/*'(.*?)'*/' -> skip;
WS          : [ \t\r\n]+ -> skip ;

Each lowercase identifier is a parser rule and each uppercase identifier is a lexical rule. The lexical rules are placed below the parsing rules. The first line defines that our file is ANTLR grammar file (combined lexer and parsing rules) with the name ToyCalc.

Let us analyze the parsing rules first:

toycalc     : (statement TERMINATOR)+;
statement   : (OPERATION EXPR | PRINT STRING | GETVALUE);

The first is the entry point of our grammar which defines that main grammar is basically one or more statements followed by a terminator token (defined by our lexer). The + sign stands for one or more.

The next line defines what a statement is in our language. The pipe symbol is used to define a set of alternative constructs that can apply to the rule. In this case, the rule states that a statement can be an "OPERATION" followed by an "EXPR" (expression, which in turn can be a positive or negative integer or decimal number). A statement can also be a "PRINT" statement followed by a "STRING" to be printed (which is basically defined as any literal string enclosed in double quotes), or a simple "GETVALUE" statement which doesn't need any other token following it.

Recall the conceptualization of our toy language. A statement can consist of either an operation (one of setting/resetting the calculator value, adding, subtracting, multiplying or dividing); a print statement to display a message; or a statement to display the current value of the calculator. This distinction is rather important, because the way we structure the grammar will influence how we implement the logic. That is why it is important to have structured our grammar well before putting down the code. In this case, the grammar is quite trivial so it may not be so important to have a mental map of it before-hand. But still it is a good idea to map out the entire grammar before defining it in code.

Note that each of the above uppercase identifiers found in the grammar file (i.e. TERMINATOR, OPERATION, EXPR, PRINT, STRING, GETVALUE etc) is a token that is emitted by the Lexer when parsing plain text. These tokens are "terminators" i.e. they are the end points of the grammar, the basic building blocks on which we construct the rules.

In the next part, I will walk through the lexical analysis (which is the first stage of parsing) that is required to generate these tokens.

In this series

No comments yet

There are no comments for this article yet.

Comments closed

The blog owner has closed further commenting on this entry.