Hari's Corner
Humour, comics, tech, law, software, reviews, essays, articles and HOWTOs intermingled with random philosophy now and thenflex - learning a little bit of pattern matching
Filed under:
Tutorials and HOWTOs by
Hari
Posted on Thu, Mar 26, 2009 at 13:27 IST (last updated: Thu, Mar 26, 2009 @ 14:17 IST)
To generate the C source from the above, you need flex installed (it should be available with most GNU/Linux distributions). Save the above source code in a file called/* is_started - generic flag to detect start of any opening or closing tag is_body - flag to detect the start of the body opening tag */ int is_started = 0, is_body = 0; /* body matches either <body> or <body attribte="value"...> etc. we avoid complex regexps here as the purpose is merely to ignore everything inside < and > */ BODY "<BODY" /* once the end of the <body> is matched, we again keep ignoring the rest of the text */ ENDBODY "</BODY>" /* define the start and end of any tag or tag like element - here again we keep it simple. We are not writing a validating program but merely detecting the beginning and end of tags */ STARTTAG "<" ENDTAG ">" /* convert tabs */ TAB \t /* match spaces, two or more */ WHITESPACE [ ]{2,} /* match newlines */ NEWLINE [\r\n] NEWLINES {NEWLINE}{2,} %% {BODY} { is_body = 1; /* body start tag started but not necessarily closed as it might contain attributes */ is_started = 0; /* don't output anything till the end of the body start tag is reached */ } {STARTTAG} { is_started = 0; /* Flag the begin of any tag */ } {ENDTAG} { is_started = 1; /* Flag the end of any tag */ } {ENDBODY} { is_body = 0; /* Flag the end body tag */ } {TAB} { if (is_started == 1 && is_body == 1) printf (" "); /* replace single tabs with 2 spaces */ } {WHITESPACE} { if (is_started == 1 && is_body == 1) printf (" "); /* if you encounter two or more whitespaces, replace with exactly two spaces */ } {NEWLINE} { if (is_started == 1 && is_body == 1) printf ("\n"); } {NEWLINES} { if (is_started == 1 && is_body == 1) printf ("\n\n"); /* if you encounter two or more newlines replace with exactly two newlines */ } . { if (is_started == 1 && is_body == 1) printf (yytext); } %%
htmlstrip.l
.
From a terminal/console, run the command
Theflex -s -i -o htmlstrip.c htmlstrip.l
-s
flag states that Flex should be in "silent" mode, that is not echo the unmatched patterns directly to output. The -i
flag asks flex to run in case-insensitive mode, so any matches for alphabetic characters are done in a case-insensitive way. The resulting output is saved in a file called htmlstrip.c
To compile the C source code, run the command
Thegcc htmlstrip.c -lfl -o htmlstrip
-lfl
flag asks the compiler to link against the flex shared library. Now you can test your script by using an HTML file like this:
The above commands reads from a source filehtmlstrip < file.html > file.txt
file.html
and outputs the result to a file called file.txt
. You can also run it interactively by not using the < file.html
to read from a source file and to echo the output to stdout you can avoid using the >file.txt
.
As you can imagine, it's pretty obviously very powerful and with a bit of practice easy to use. Combine it with a full fledged parser and you can do a lot more with it.
While this kind of thing is mostly taught in Computer Science courses in universities and for those writing programming language interpreters/compilers, text pattern matching and parsing is useful in the arsenal for any programmer looking for advanced text-manipulation tools.
Comments closed
The blog owner has closed further commenting on this entry.
2 comment(s)
Comment by RT Cunningham (visitor) on Fri, May 1, 2009 @ 10:56 IST #
For general non-contextual parsing which also involves tokenizing strings and then matching tokens, you'll find that tools like flex and bison are irreplaceable as they simplify so many steps in the process of text analysis.
However, for simple pattern matching in text which is context-driven, reg-exps do the job much more nicely.
Comment by Hari (blog owner) on Fri, May 1, 2009 @ 12:19 IST #