If you want to master a bison first you should learn to flex your muscles well.
Oh, all right. That was an extreme bad pun. But in any case, I'm starting to learn and understand how a lexical analyzer works and how to create a full-fledged lexer-scanner and parser for a fairly simple application.
For those who are not aware, a lexer is a program that reads a stream of characters and tries to match patterns in text. For example, a pattern can be any word, any number, a telephone number, etc. In fact, anything you can write as a regular expression is a pattern readable by a computer.
flex is the modern (and Free Software) implementation (not wholly compatible with the old Lex) of a scanner-generator software. Using flex you can generate a C/C++ program that scans input text, looks for patterns in them and generates "TOKENS" (which are nothing but the matched text tagged as an identifier) which can be used for further processing especially by a parser generator like
bison or
yacc. The parser grammar defines how to read the tokens, which types of tokens are valid and how to arrange tokens to ensure correct grammar.
As you might have guessed by now, using Flex, especially with a parser generator like bison is not at all easy for those not experienced with at least the basics of regexps. And writing the grammar for a parser generator is even more involved. There are precious few tutorials on the web out there which tell you how to write a decent parser and even then many of those grammars you find in tutorials are rather arcane and not wholly compatible with bison.
Yet it beats using only a regular expression engine which can lead to some extremely unreadable code. Yesterday as an exercise I wrote a simple HTML tag removal script entirely in Flex (without using bison).
It reads everything from <body> to </body> and ignores tags. The flex source code below (well commented) defines the pattern matching necessary to do this and also the actions performed on encountering the patterns (either ignore or output the text based on the conditions).
/* is_started - generic flag to detect start of any opening
or closing tag
is_body - flag to detect the start of the body opening tag */
int is_started = 0, is_body = 0;
/* body matches either <body> or <body attribte="value"...> etc.
we avoid complex regexps here as the purpose is merely to ignore
everything inside < and > */
BODY "<BODY"
/* once the end of the <body> is matched, we again keep ignoring the
rest of the text */
ENDBODY "</BODY>"
/* define the start and end of any tag or tag like element - here again
we keep it simple. We are not writing a validating program but merely
detecting the beginning and end of tags */
STARTTAG "<"
ENDTAG ">"
/* convert tabs */
TAB \t
/* match spaces, two or more */
WHITESPACE [ ]{2,}
/* match newlines */
NEWLINE [\r\n]
NEWLINES {NEWLINE}{2,}
%%
{BODY} { is_body = 1; /* body start tag started but not
necessarily closed as it might
contain attributes */
is_started = 0; /* don't output anything till
the end of the body start tag is
reached */
}
{STARTTAG} { is_started = 0; /* Flag the begin of any tag */ }
{ENDTAG} { is_started = 1; /* Flag the end of any tag */ }
{ENDBODY} { is_body = 0; /* Flag the end body tag */ }
{TAB} {
if (is_started == 1 && is_body == 1)
printf (" "); /* replace single tabs with 2 spaces */
}
{WHITESPACE} {
if (is_started == 1 && is_body == 1)
printf (" "); /* if you encounter two or more
whitespaces, replace with exactly
two spaces */
}
{NEWLINE} {
if (is_started == 1 && is_body == 1)
printf ("\n");
}
{NEWLINES} {
if (is_started == 1 && is_body == 1)
printf ("\n\n"); /* if you encounter two or more
newlines replace with exactly
two newlines */
}
. {
if (is_started == 1 && is_body == 1)
printf (yytext);
}
%%
To generate the C source from the above, you need flex installed (it should be available with most GNU/Linux distributions). Save the above source code in a file called
htmlstrip.l.
From a terminal/console, run the command
flex -s -i -o htmlstrip.c htmlstrip.l
The
-s flag states that Flex should be in "silent" mode, that is not echo the unmatched patterns directly to output. The
-i flag asks flex to run in case-insensitive mode, so any matches for alphabetic characters are done in a case-insensitive way. The resulting output is saved in a file called
htmlstrip.c
To compile the C source code, run the command
gcc htmlstrip.c -lfl -o htmlstrip
The
-lfl flag asks the compiler to link against the flex shared library. Now you can test your script by using an HTML file like this:
htmlstrip < file.html > file.txt
The above commands reads from a source file
file.html and outputs the result to a file called
file.txt. You can also run it interactively by not using the
< file.html to read from a source file and to echo the output to stdout you can avoid using the
>file.txt.
As you can imagine, it's pretty obviously very powerful and with a bit of practice easy to use. Combine it with a full fledged parser and you can do a lot more with it.
While this kind of thing is mostly taught in Computer Science courses in universities and for those writing programming language interpreters/compilers, text pattern matching and parsing is useful in the arsenal for any programmer looking for advanced text-manipulation tools.
2 comment(s)
Leave a comment »Comment by RT Cunningham (visitor) on 1 May 2009 @ 10:56:10 IST #
For general non-contextual parsing which also involves tokenizing strings and then matching tokens, you'll find that tools like flex and bison are irreplaceable as they simplify so many steps in the process of text analysis.However, for simple pattern matching in text which is context-driven, reg-exps do the job much more nicely.
Comment by Hari (blog owner) on 1 May 2009 @ 12:19:55 IST #