Hari's Corner

Humour, comics, tech, law, software, reviews, essays, articles and HOWTOs intermingled with random philosophy now and then

On logical markup, XML/SGML and documentation

Filed under: Software and Technology by Hari
Posted on Mon, Mar 23, 2009 at 10:59 IST (last updated: Thu, May 7, 2009 @ 20:58 IST)

This article probably won't have much appeal for non-geeks or geeks not very much interested in structured technical writing/documentation.

I have had a fascination for "logical" or structured documentation markup since I discovered LaTeX and used it quite productively for for preparing reports. I always find debates about purely "logical" markup languages as opposed to a WYSIWYG word processing tool which also creates layout markup in the documentation quite interesting because of the issues involved. Here I'll try to clarify my own thoughts regarding these tools and how they help one become more productive and in which areas they hinder.

There are two issues here. First, whether purely logical markup is really a more efficient method of writing portable documents than a mixture of logical and layout markup. The other issue is whether you can actually separate the layout completely from the logical structure of a document and whether it actually makes sense in many cases to go to a level of abstraction which totally isolates the document author from the content presentation level or at least makes it very hard for the average document author to make even trivial customizations to the final layout.

Out of my own interest, I read more on the topic on online newsgroups and mailing list discussions and I have to say that I am not convinced on the practicality of a purely abstract markup language in many situations, especially when one considers that the final presentation formats are inherently limited at the present stage of technology being mainly confined to printed/printable (say PostScript, PDF) documents versus online (for all practical purposes, HTML) documents.

My answer to the first question is one of practicality. LaTeX, a macro-package for the TeX typesetting engine is probably the best compromise between a purely logical document preparation system and a pure layout descriptor. To compare TeX, an intelligent and sophisticated typesetting engine complete with its own algorithms and mechanisms for resolving layout decisions to an abstract documentation markup like DocBook is missing the point.

The approach to using LaTeX (if used correctly) is neither totally logical nor totally layout oriented. It is a comfortable compromise between the two and offers quite a bit of flexibility to the document author to provide enough "hints" to the TeX typesetting engine about layout and such without sacrificing structure which is so important in large documents like books. It is possible to separate layout and logic in a well-written LaTeX document without much effort. Since TeX is primarily a tool for creating beautiful printed documentation, it is extremely focussed at its primary task, while there are adequate tools which allow writers to generate online (HTML) versions as well.

The other markup approach, which focusses on completely isolating the document writer from ANY layout decision whatsoever and in fact, does not even provide clues as to what the final medium of the document will be, is to use XML/SGML with a DTD/Schema like DocBook.

The problem is that, while this sounds great in theory, XML/SGML by themselves are pure dumb text markup. It is not self-contained. You are dependent on third-party tools for processing and generating output from the sources and most of tools are not easy to understand or become productive with. Even minor layout decisions like changing page margins or background colour requires one to poke around with esoteric markup in a non-trivial style-transformation language with its own learning curve.

Consider this: what you do with the actual markup is completely up to the parsing programs (or toolchain as is popularly known) and the default style sheets provided by the toolchain distribution. Unless you happen to be a professional programmer with intricate knowledge of markup technology it is very difficult to generated customized output. The typical approach is to first define a set of "style sheets" or "transformation sheets" like XSLT or DSSSL (in the case of SGML) and then use a tool which will apply the transformation on the pure markup and produce the output.

Obviously most third-party XML tools provide these style sheets, but will the default layouts always be preferable? In most cases, authors will be stuck at the stage where they need to customize their output and find that they have to learn a whole new complex or obscure markup language in addition to the source format used (DocBook in this case). This is probably a great thing for people who work in teams and there are a set of people who are dedicated full-time to creating and maintaing style-sheets and know exactly what they need.

My argument is that DocBook is hardly a preferred format for an individual writer or even a small company with limited resources. The time and money invested in hiring and training document writers and developers to poke around with style-sheets might simply not be worth the effort. Most people would be content with writing in MS-Word, leave alone a typesetting tool like TeX/LaTeX.

In any case, even DocBook writers who need a printable version of their documents currently still depend on tools like OpenJade/JadeTeX to produce the TeX markup in the final processing stage as TeX is still the best typesetting engine in the business. Writing typesetting logic for non-trivial multi-page documents manually is just too complex a task to contemplate even for a seasoned programmer with lots of time in his hands. Dumb stylesheets are probably lot easier to work with, but might produce unsatisfactory results in print and in any case might still be asking too much of a reasonably intelligent tech document writer.

In such a case, would it not be better to use LaTeX at all as an individual writer who needs the structure of a logical mark up without the total rigidity of DocBook? Theoritically it is possible to use a different typesetting engine, but then again, you will need to write a whole new set of style sheets and customizations for it to be of any practical use with handling XML/SGML markup.

To me the verdict is quite clear. Pure logical markup is great theory and probably an excellent way to store portable documents electronically in the long term where you might have no idea of what the final medium will be (currently limited probably to print and online versions). There might be a day when you can easily convert a book into audio or even video using only the logical markup and a back-end processing tool without any additional work but that is still a long way away.

For day to day writing, I'll stick to OpenOffice, and for larger documentation requiring a bit of structure, LaTeX seems irreplaceable, at least with current technology.
Comments (2)  

Papa Hari Foundation's gender neutrality initiative

Filed under: Humour and Nonsense by Hari
Posted on Sat, Mar 21, 2009 at 11:13 IST (last updated: Fri, Jun 8, 2012 @ 07:46 IST)

Papa Hari News Service

Tired of the constant discrimination and gender bias in language, a group of High Thinkers in the Papa Hari Foundation has executed a complete, comprehensive search and replace program in all major dictionaries of the world, replacing the words "MAN" with "PERSON", "MEN" with "PERSONS", "HIS" with "THEIR", "SON" with "OFFSPRING" and "HIM" with "THEM" and "HE" with "THAT". Fortunately or unfortunately tthat peroffspring who did it forgot tthat "whole words only" option while doing it. Strange results have followed as can be seen below.

As a result, peroffspringy people have found that tthaty are now reading about tthat Ropersoffspring civilization in tthatirtory books. People attending MBA courses are now studying about Peroffspringageperoffspringst tthatories instead of tthat otthatr word. Sportsperoffsprings now ascribe tthatir victories to extraordinary perforperoffspringces and efforts. "It's such a great achieveperoffspringst," said one super star. Anotthatr side effect of tthat complete search and replace algorithm is that peroffsprings of tthat female sex are now referred to as woperoffspring or woperoffsprings.

"Peroffspringgo is now in great deperoffspringd ttthatir seaoffspring," said one fruit-seller in India, ratthatr irrelevantly for merely highlighting tthat odd discrepancies that have crept into tthat English language as a result of ttthatir move. Politicians, especially peroffspringipulative ones are quite happy to turn tthat situation to tthatir advantage in various ways.

"Ttthatir move is extraordinary and totally unwarranted," said one peroffspring who can be identified as a male, "Tthat whole language is totally corrupted." On tthat otthatr hand peroffspringy people have supported ttthatir move saying that tthat language needed to be cleaned and sanitised to prevent female infanticide in rural parts of India. "Tthatre is no doubt in my mind that such a move will completely change attitudes all over tthat world," said one spokespersoffspring of tthat Papa Hari Foundation.

It is believed that tthatre are furtthatr efforts on to identify gender specific terminology in language and replace tthatm completely with gender neutral terms. Even tthat words male and female are now being sought to be replaced with something more neutral.

"Tthat language now reads more like Gerperoffspring," said one experienced linguist as a final comment.
Comments (6)  

PyTamEditor - a Tamil Unicode Editor

Filed under: My software by Hari
Posted on Fri, Mar 20, 2009 at 17:47 IST (last updated: Thu, May 7, 2009 @ 21:04 IST)

I've written a simple text editor with reasonably intuitive phonetic English key map for Tamil input in Python and Qt 4. I'd earlier written a simpler GUI in C using Gtk, but this version is slightly more enhanced and has printing support using the Qt printing subsystem.

Information, screenshot and download included in the software section.
Comments (2)  

More 32-bit 64-bit madness

Filed under: Software and Technology by Hari
Posted on Thu, Mar 19, 2009 at 22:17 IST (last updated: Thu, May 7, 2009 @ 18:19 IST)

I have a whole bunch of pet peeves. Sub-optimal 32-bit applications running on 64-bit architectures is one of them. The lack of interest shown by software developers in releasing native 64-bit binaries along with their 32-bit cousins is another. We are in 2009, and we've had 64-bit processors for years now, but still saddled with and forced to use 32-bit applications sub-optimally and sacrificing inter-operability with native 64-bit libraries and vice-versa.

I had written a while ago on flaky 64-bit native support for commercial (and often proprietary) software and the lack of interest that software vendors show for releasing 64-bit binary executables for their software.

With Free Software/Open Source, it is almost as bad if there are no 64-bit pre-compiled executables for Windows. What makes situations even worse is when you badly need 64-bit support for a program for which there is no pre-compiled version available.

It is maddening when you're stuck with a 32-bit DLL and need to call it from a 64-bit process. While compiling from source is probably a tolerable situation in *nix (maybe even desirable for some people - I'm not one of them), setting up a build environment in Windows just so that you can compile from source is almost as bad as visiting a dentist and pulling out all your good teeth. Case in point: I had installed the 64-bit version of Python, but there is no 64-bit precompiled version of PyQt available. Net result: ImportError (due to binary incompatibility of 64-bit Python with 32-bit PyQt). After a lot of googling and searching for a possible solution, I was forced to download the 32-bit version of Python instead. One would have thought that a problem like this would have a better solution, but no - there is no 64-bit pre-built binary installer for PyQt. So one is trapped in the 32-bit universe the more one wants to leave it.

Those of us who were unlucky enough to be landed with 64-bit versions of XP or Vista are the biggest sufferers in this department whether it is sub-optimal performance of games and other application, or the interprocess communication problems I mentioned above.
Comments (3)  

On Legal Opinions

Filed under: People and society by Hari
Posted on Mon, Mar 16, 2009 at 21:27 IST (last updated: Sat, Jun 6, 2009 @ 10:33 IST)

Being a student of Law, I am finding it extremely annoying that a majority of online writers and communities consider themselves authorities on Law, especially the law concerned with copyrights and patents. I see this particularly in discussions on piracy/copyright and sometimes while discussing software licenses like the GNU/GPL.

Some basics:

Those who want to run should first learn to walk. Understanding the difference between common morality and a legal system is the basis of any sensible discussion on legalities. When you mix the two, what you get is a hodge-podge of incomprehensible and often contradictory opinions which form the basis of lengthy and often painfully inconclusive discussions culminating in flame wars and personal attacks.

If somebody wants to talk Law, the first thing to do is to put sentiment aside. In fact, most Laws are quite often complex and have dozens of clauses and sub-clauses which render general conclusions useless.

When you want to discuss the morality/nature of a particular enactment of Law or a legal system, it becomes a subset of Jurisprudence, which is really an ocean in itself and requires systematic study.

I am not saying this out of arrogance. Indeed, the more one steps into legal education, the more amazed I am at how much there is to learn and how little I know presently.
Comments (9)  

Python's unicode strings and QString gotcha

Filed under: Bits and Bytes by Hari
Posted on Fri, Mar 13, 2009 at 17:47 IST (last updated: Wed, Mar 18, 2009 @ 20:28 IST)

This is more a small personal note than anything else, but writing a UTF-8 string to a UTF-8 file is a bit tricky if you're using PyQt and doing implicit conversion between QString and Python's built in unicode string.

What I tried to achieve: To get the contents of a text box and save the Unicode contents to a file.

Here's what didn't work actually:
def exportFile (self, filename):
	"""Procedure to export the UNICODE contents to a file"""
	txtOutput = self.findChild (QtGui.QPlainTextEdit, "txtTamil")
	fcontents = unicode (txtOutput.toPlainText(), "utf-8")
	
	f = codecs.open (filename, "w", encoding="utf-8")
	f.write ( fcontents )
	f.close ()

Most confusing, as the output file ended with a series of question marks instead of the actual Unicode characters. Surely I was doing everything right?

After investigating the Python side fully, I turned to QT's QString class for inspiration. Turns out that you need to actually convert the QString first to a UTF-8 bytestream using QString's toUtf8 () function before calling the Python unicode () function.

Code which works as expected:
def exportFile (self, filename):
	"""Procedure to export the UNICODE contents to a file"""
	txtOutput = self.findChild (QtGui.QPlainTextEdit, "txtTamil")
	fcontents = unicode (txtOutput.toPlainText().toUtf8(), "utf-8")
	
	f = codecs.open (filename, "w", encoding="utf-8")
	f.write ( fcontents )
	f.close ()
Comments (5)