Hari's Corner

Humour, comics, tech, law, software, reviews, essays, articles and HOWTOs intermingled with random philosophy now and then

Writing a Toy Calculator scripting language with Java and ANTLR 4 - Part 3

Filed under: Tutorials and HOWTOs by Hari
Posted on Mon, Apr 6, 2020 at 19:12 IST (last updated: Thu, Apr 9, 2020 @ 18:09 IST)

In this series < Previous Next >

In the previous part, I had introduced the grammar of our Toy calculator language. If you haven't already, read part 1 and part 2 first.  Here is a summary of the grammar rules portion:

toycalc     : (statement TERMINATOR)+;
statement   : (OPERATION EXPR | PRINT STRING | GETVALUE);

Basically the program is a series of statements, which can be one of operation followed by an expression, or a print statement or a statement to get and print the current calculator value.

Here is the lexer definition which follows the above parsing, analyzed line by line:

TERMINATOR  : ';';

The above is a token. Tokens start with upper-case characters in ANTLR grammar files and for readability the entire token is made upper-case as convention. Tokens are basically lexing rules - how to read characters from a stream of input and organize a bunch of characters as tokens. Mostly lexer rules are (or should be) simple and unambiguous, because problems in tokenizing can lead to frustrating problems with parsing the grammar.

The above rule defines what is our statement terminator (a literal semi-colon, but we could use any other character also - it is our grammar after all!). Note that the terminator is used in the first rule of our grammar following a statement.

OPERATION   : 'SETVALUE' | 'ADD' | 'SUB' | 'MUL' | 'DIV' ;

The next one defines what is an operation. Note that we use string literals (we could also use symbolic constants, but for simplicity sake I have used string literals). This is simple: we define a set of alternatives, i.e. operation can be either of setting a value, add, subtract, multiply or divide. In our Toy Calculator each operation is followed by only one number (integer or decimal number, represented by the token EXPR), and the operation is applied on the calculator's value, like say, the statement ADD 23.25; will add 23.25 to the current calculator value.

PRINT       : 'PRINT';
GETVALUE    : 'GETVALUE';

These are self-explanatory. PRINT is a token for the literal word PRINT from the input stream and GETVALUE is a token for the literal match GETVALUE. As already defined conceptually, the first should print a string enclosed in double quotes and the next one should simply display the current calculator value.

EXPR        : INTEGER | FLOAT;

The above defines a token as either of two tokens, i.e. INTEGER or FLOAT. This makes it easy to treat a class of tokens as one, or if required individually also. In this case, the token name EXPR might as well be NUMBER, but I chose EXPR as a token name. You can always change it if you wish.

STRING      : '"'(.*?)'"';

This is defining what is a string in our language. Basically it is a very simple rule (most real-world applications have escaping rules for strings, which makes it much more non-trivial) but what this rule says is that match the first double quote, and then read any character 0 or more times and match the next double quote. Note that the question mark makes this match non-greedy, i.e. the parser will stop at the very next double quote and not keep reading until the longest possible match. This is an important thing to note, because if the match is greedy, our grammar will fail, because the lexer will hunt for the longest match between two double-quotes.

Note: Our definition of STRING has severe limitations in a production use scenario  - you cannot have double quotes inside the string as our lexer will stop at the first double quote character it meets. You cannot have any escape sequences. Also all whitespaces are treated literally (which may or may not meet your requirements). But for our basic Toy calculator, all we want is a way to display a short message to the user and so this should do the trick.

INTEGER     : [0-9]+ | '-'[0-9]+;
FLOAT       : [0-9]+'.'[0-9]+ | '-'[0-9]+'.'[0-9]+;

The next two tokens define an integer/float respectively. Here the form [0-9]+ defines a range, i.e. any digit between 0 and 9 repeated one or more times. Note the alternative i.e. a minus character before the [0-9]+. An integer can be positive or negative.

A FLOAT is similarly defined, except that, there should be a decimal point between digits, i.e. one or more digits followed by a decimal point and followed by one or more digits. The alternative for a negative number.

In this case, you might have noticed we repeat the pattern [0-9]+ so many times but in itself is not a token to be recognized by the parser. In such cases, you can use fragments, i.e. symbolic constants that are not parsed as tokens but are meant to be used to build tokens. Symbolic constants are defined with a keyword fragment in ANTLR like this:

fragment DIGIT [0-9]

and we could replace the corresponding lexer rules with:

INTEGER     : DIGIT+ | '-' DIGIT+;
FLOAT       : DIGIT+'.'DIGIT+ | '-'DIGIT+'.'DIGIT+;

However, I have chosen not to use it in my grammar. But for larger and less trivial grammars, using fragments will arguably make the grammar more readable.

Finally, the last two lines deserve special mention.

COMMENT     : '/*'(.*?)'*/' -> skip;
WS          : [ \t\r\n]+ -> skip ;

The above defines two special lexer rules that cannot be used in parsing. Basically the -> skip instructs the lexer to discard these tokens. Here, a comment is defined as any content that starts with /* and ends with */ i.e. basically C-style comments. Again, like strings, we are using non-greedy matching, so as to catch the first match of the comment terminator */ properly.

The next rule discards one or more whitespaces between other characters in the stream, i.e. spaces, tabs, carriage return and newline characters. This will have an interesting effect in our grammar, since we basically state that all whitespaces are meaningless to the parser. Basically, with such an approach, we can combine multiple (or even all) statements in a singe line with the statement separator character distinguishing individual statements. Also, we need not have a space between two distinct tokens, i.e. a statement like ADD 10 can as well be written as ADD10 as per our grammar, since 'ADD' matches one token rule and 10 matches another token rule unambiguously. But more on that later.

One very important point about lexing rules. If there are two or more lexing rules that are ambiguous i.e. matching one particular pattern, ANTLR will give prominence to the first rule. Hence the ordering of rules is important. A more general rule, if it appears above a more specific rule that matches a subset of the general rule, will always win. Meaning the specific rule will never be matched. For example:
IDENTIFIER   : [a-z]+;
PRINT : 'print';
In the above, the PRINT token will never be generated by the lexer, as the previous token IDENTIFIER defined as any combination of one or more lower case characters will match the word 'print' (note the case sensitivity).

In the next part, we will generate the actual parser code for our grammar using ANLTR and do something useful with the generated classes. This is where all the magic happens.

In this series

No comments yet

There are no comments for this article yet.

Comments closed

The blog owner has closed further commenting on this entry.