Hari's Corner

Humour, comics, tech, law, software, reviews, essays, articles and HOWTOs intermingled with random philosophy now and then

On logical markup, XML/SGML and documentation

Filed under: Software and Technology by Hari
Posted on Mon, Mar 23, 2009 at 10:59 IST (last updated: Thu, May 7, 2009 @ 20:58 IST)

In this series < Previous
This article probably won't have much appeal for non-geeks or geeks not very much interested in structured technical writing/documentation.

I have had a fascination for "logical" or structured documentation markup since I discovered LaTeX and used it quite productively for for preparing reports. I always find debates about purely "logical" markup languages as opposed to a WYSIWYG word processing tool which also creates layout markup in the documentation quite interesting because of the issues involved. Here I'll try to clarify my own thoughts regarding these tools and how they help one become more productive and in which areas they hinder.

There are two issues here. First, whether purely logical markup is really a more efficient method of writing portable documents than a mixture of logical and layout markup. The other issue is whether you can actually separate the layout completely from the logical structure of a document and whether it actually makes sense in many cases to go to a level of abstraction which totally isolates the document author from the content presentation level or at least makes it very hard for the average document author to make even trivial customizations to the final layout.

Out of my own interest, I read more on the topic on online newsgroups and mailing list discussions and I have to say that I am not convinced on the practicality of a purely abstract markup language in many situations, especially when one considers that the final presentation formats are inherently limited at the present stage of technology being mainly confined to printed/printable (say PostScript, PDF) documents versus online (for all practical purposes, HTML) documents.

My answer to the first question is one of practicality. LaTeX, a macro-package for the TeX typesetting engine is probably the best compromise between a purely logical document preparation system and a pure layout descriptor. To compare TeX, an intelligent and sophisticated typesetting engine complete with its own algorithms and mechanisms for resolving layout decisions to an abstract documentation markup like DocBook is missing the point.

The approach to using LaTeX (if used correctly) is neither totally logical nor totally layout oriented. It is a comfortable compromise between the two and offers quite a bit of flexibility to the document author to provide enough "hints" to the TeX typesetting engine about layout and such without sacrificing structure which is so important in large documents like books. It is possible to separate layout and logic in a well-written LaTeX document without much effort. Since TeX is primarily a tool for creating beautiful printed documentation, it is extremely focussed at its primary task, while there are adequate tools which allow writers to generate online (HTML) versions as well.

The other markup approach, which focusses on completely isolating the document writer from ANY layout decision whatsoever and in fact, does not even provide clues as to what the final medium of the document will be, is to use XML/SGML with a DTD/Schema like DocBook.

The problem is that, while this sounds great in theory, XML/SGML by themselves are pure dumb text markup. It is not self-contained. You are dependent on third-party tools for processing and generating output from the sources and most of tools are not easy to understand or become productive with. Even minor layout decisions like changing page margins or background colour requires one to poke around with esoteric markup in a non-trivial style-transformation language with its own learning curve.

Consider this: what you do with the actual markup is completely up to the parsing programs (or toolchain as is popularly known) and the default style sheets provided by the toolchain distribution. Unless you happen to be a professional programmer with intricate knowledge of markup technology it is very difficult to generated customized output. The typical approach is to first define a set of "style sheets" or "transformation sheets" like XSLT or DSSSL (in the case of SGML) and then use a tool which will apply the transformation on the pure markup and produce the output.

Obviously most third-party XML tools provide these style sheets, but will the default layouts always be preferable? In most cases, authors will be stuck at the stage where they need to customize their output and find that they have to learn a whole new complex or obscure markup language in addition to the source format used (DocBook in this case). This is probably a great thing for people who work in teams and there are a set of people who are dedicated full-time to creating and maintaing style-sheets and know exactly what they need.

My argument is that DocBook is hardly a preferred format for an individual writer or even a small company with limited resources. The time and money invested in hiring and training document writers and developers to poke around with style-sheets might simply not be worth the effort. Most people would be content with writing in MS-Word, leave alone a typesetting tool like TeX/LaTeX.

In any case, even DocBook writers who need a printable version of their documents currently still depend on tools like OpenJade/JadeTeX to produce the TeX markup in the final processing stage as TeX is still the best typesetting engine in the business. Writing typesetting logic for non-trivial multi-page documents manually is just too complex a task to contemplate even for a seasoned programmer with lots of time in his hands. Dumb stylesheets are probably lot easier to work with, but might produce unsatisfactory results in print and in any case might still be asking too much of a reasonably intelligent tech document writer.

In such a case, would it not be better to use LaTeX at all as an individual writer who needs the structure of a logical mark up without the total rigidity of DocBook? Theoritically it is possible to use a different typesetting engine, but then again, you will need to write a whole new set of style sheets and customizations for it to be of any practical use with handling XML/SGML markup.

To me the verdict is quite clear. Pure logical markup is great theory and probably an excellent way to store portable documents electronically in the long term where you might have no idea of what the final medium will be (currently limited probably to print and online versions). There might be a day when you can easily convert a book into audio or even video using only the logical markup and a back-end processing tool without any additional work but that is still a long way away.

For day to day writing, I'll stick to OpenOffice, and for larger documentation requiring a bit of structure, LaTeX seems irreplaceable, at least with current technology.

In this series

2 comment(s)

  1. Interesting argument Hari. You seem to have covered everything.

    These days, I find more and more people picking up DocBook rather than LaTeX. Don't most of the distros require documentation contributors to use DocBook? Unless I'm mistaken, Fedora and Ubuntu are just two such distros.

    Plus, DocBook's Wikipedia page has a Used In section that mentions TLDP, Zend Frameword and the FreeBSD Documentation project.
    http://en.wikipedia.org/wiki/DocBook#Used_in

    Seems like LaTeX just isn't getting marketed/promoted enough, wouldn't you say?

    Comment by Shashank Sharma (visitor) on Wed, Mar 25, 2009 @ 20:16 IST #
  2. Shashank, thanks for the comment. I agree with you. XML must be just about the most marketed technology on the planet today. And besides that, programmers seem to love parsing XML.

    XML/SGML just seems to be very difficult to pin down to any specific purpose. It's almost *too* flexible, but the parsing, validating and transformations (which arguably are the most important part of XML) are not necessarily simple and can sometimes be rather complex to model and implement for non-trivial DOCTYPEs.

    The other popular use of XML, I think, is to represent data trees in programming in which programmers write a specific small subset of XML (for example a configuration file or a data file just for a specific purpose). The idea is that they can easily handle it within their program; but even there, I could argue that the cost of linking to a generic XML parser like DOM or SAX and still having to implement the logic internally within the program (you still have to interpret the structure and what they mean yourself) will be more than writing a generic and simple text file parser.

    So either way, as either a logical text markup language for documentation or as a way to store data in trees, it's not an optimized solution.

    DocBook, naturally suffers from some of the shortcomings of XML itself. LaTeX/TeX has reached a stage of maturity where it's hard to find exciting new developments in its design. So naturally its news value is quite low. But that doesn't take anything away from an otherwise excellent, high quality piece of software which is highly tailored to writing reasonably structured documentation and producing high quality output in print and fairly good output for online viewing.

    Comment by Hari (blog owner) on Wed, Mar 25, 2009 @ 20:51 IST #

Comments closed

The blog owner has closed further commenting on this entry.