Hari's Corner

Humour, comics, tech, law, software, reviews, essays, articles and HOWTOs intermingled with random philosophy now and then

flex - learning a little bit of pattern matching

Filed under: Tutorials and HOWTOs by Hari
Posted on Thu, Mar 26, 2009 at 13:27 IST (last updated: Thu, Mar 26, 2009 @ 14:17 IST)

If you want to master a bison first you should learn to flex your muscles well.

Oh, all right. That was an extreme bad pun. But in any case, I'm starting to learn and understand how a lexical analyzer works and how to create a full-fledged lexer-scanner and parser for a fairly simple application.

For those who are not aware, a lexer is a program that reads a stream of characters and tries to match patterns in text. For example, a pattern can be any word, any number, a telephone number, etc. In fact, anything you can write as a regular expression is a pattern readable by a computer.

flex is the modern (and Free Software) implementation (not wholly compatible with the old Lex) of a scanner-generator software. Using flex you can generate a C/C++ program that scans input text, looks for patterns in them and generates "TOKENS" (which are nothing but the matched text tagged as an identifier) which can be used for further processing especially by a parser generator like bison or yacc. The parser grammar defines how to read the tokens, which types of tokens are valid and how to arrange tokens to ensure correct grammar.

As you might have guessed by now, using Flex, especially with a parser generator like bison is not at all easy for those not experienced with at least the basics of regexps. And writing the grammar for a parser generator is even more involved. There are precious few tutorials on the web out there which tell you how to write a decent parser and even then many of those grammars you find in tutorials are rather arcane and not wholly compatible with bison.

Yet it beats using only a regular expression engine which can lead to some extremely unreadable code. Yesterday as an exercise I wrote a simple HTML tag removal script entirely in Flex (without using bison).

It reads everything from <body> to </body> and ignores tags. The flex source code below (well commented) defines the pattern matching necessary to do this and also the actions performed on encountering the patterns (either ignore or output the text based on the conditions).

	/* is_started - generic flag to detect start of any opening 
	or closing tag 
	is_body - flag to detect the start of the body opening tag */	
	int is_started = 0, is_body = 0;
			
/* body matches either <body> or <body attribte="value"...> etc.
   we avoid complex regexps here as the purpose is merely to ignore 
   everything inside < and > */
BODY		"<BODY"
/* once the end of the <body> is matched, we again keep ignoring the
   rest of the text */
ENDBODY	"</BODY>"
/* define the start and end of any tag or tag like element - here again 
   we keep it simple. We are not writing a validating program but merely 
   detecting the beginning and end of tags */
STARTTAG	"<"
ENDTAG	">"
/* convert tabs */
TAB		\t
/* match spaces, two or more */
WHITESPACE	[ ]{2,}
/* match newlines */
NEWLINE	[\r\n]
NEWLINES	{NEWLINE}{2,}
%%
		
{BODY}	{ is_body = 1; /* body start tag started but not 
				  necessarily closed as it might 
				  contain attributes */
		 is_started = 0; /* don't output anything till 
			  	    the end of the body start tag is
				    reached */
		}							
{STARTTAG}	{ is_started = 0; /* Flag the begin of any tag */ }
{ENDTAG}	{ is_started = 1; /* Flag the end of any tag */ }
{ENDBODY}	{ is_body = 0; /* Flag the end body tag */ }
{TAB}		{ 
			if (is_started == 1 && is_body == 1)
				printf ("  "); /* replace single tabs with 2 spaces */
		}
{WHITESPACE}	{
			if (is_started == 1 && is_body == 1)
				printf ("  ");	/* if you encounter two or more 
						whitespaces, replace with exactly
						two spaces */
		}
{NEWLINE} 	{
			if (is_started == 1 && is_body == 1)
				printf ("\n");
		}
{NEWLINES}	{
			if (is_started == 1 && is_body == 1)
				printf ("\n\n");  /* if you encounter two or more
						  newlines replace with exactly 
						  two newlines */
		}
.		{
			if (is_started == 1 && is_body == 1)
			printf (yytext);
		}
%%

To generate the C source from the above, you need flex installed (it should be available with most GNU/Linux distributions). Save the above source code in a file called htmlstrip.l.

From a terminal/console, run the command
flex -s -i -o htmlstrip.c htmlstrip.l

The -s flag states that Flex should be in "silent" mode, that is not echo the unmatched patterns directly to output. The -i flag asks flex to run in case-insensitive mode, so any matches for alphabetic characters are done in a case-insensitive way. The resulting output is saved in a file called htmlstrip.c

To compile the C source code, run the command
gcc htmlstrip.c -lfl -o  htmlstrip

The -lfl flag asks the compiler to link against the flex shared library. Now you can test your script by using an HTML file like this:
htmlstrip < file.html > file.txt

The above commands reads from a source file file.html and outputs the result to a file called file.txt. You can also run it interactively by not using the < file.html to read from a source file and to echo the output to stdout you can avoid using the >file.txt.

As you can imagine, it's pretty obviously very powerful and with a bit of practice easy to use. Combine it with a full fledged parser and you can do a lot more with it.

While this kind of thing is mostly taught in Computer Science courses in universities and for those writing programming language interpreters/compilers, text pattern matching and parsing is useful in the arsenal for any programmer looking for advanced text-manipulation tools.

2 comment(s)

  1. I've been using Perl regular expressions because it's the simplest way to parse. I use it for link checking. I use a simple PHP string parsing routine to pull in page titles. I know a lexical analyzer would do the job for both much better, but I think it would be overkill for the way I would use it, basically generating more overhead than it's worth.

    Comment by RT Cunningham (visitor) on Fri, May 1, 2009 @ 10:56 IST #
  2. Hi RT,
    For general non-contextual parsing which also involves tokenizing strings and then matching tokens, you'll find that tools like flex and bison are irreplaceable as they simplify so many steps in the process of text analysis.

    However, for simple pattern matching in text which is context-driven, reg-exps do the job much more nicely.

    Comment by Hari (blog owner) on Fri, May 1, 2009 @ 12:19 IST #

Comments closed

The blog owner has closed further commenting on this entry.