Hari's CornerHumour, comics, tech, law, software, reviews, essays, articles and HOWTOs intermingled with random philosophy now and then
Will your electronic documents be readable 30 years from now?
Software and Technology by
Posted on Wed, Feb 5, 2014 at 11:14 IST (last updated: Wed, Feb 5, 2014 @ 11:33 IST)
This is a question that occurred to me when I first figured out that the legacy StarOffice binary format (SDW) is no longer being supported by LibreOffice or OpenOffice starting from version 4 onwards. To me, it was a matter of concern because I had actively used the original proprietary (but freeware) StarOffice suite in the late 1990s and when it become open-sourced in early 2000. As a result, I have quite a number of documents in the SDW format. Considering the implications, I recently spent quite a long time in manually converting these files to ODT using LibreOffice 3.x, to preserve and ensure future readability. Even though these documents might not be useful or relevant to me at present, yet they represent quite a bit of my early creative writing efforts which I'd like to keep around.
The dropping of this particular legacy format in LibreOffice and OpenOffice 4.0+ suggests that maintaining old code base is a serious concern among open-source developers. Ironically, Microsoft's legacy binary-only DOC format is still supported in OpenOffice and LibreOffice, mainly due to its immense popularity. To this day, I am forced to save some files in DOC format because I have to send it across to a client or colleague not knowing what version of MS Office they use. A surprising number of people continue to use legacy versions of Microsoft Office in spite of the availability of modern, Open Source and Free office suites and they have difficulty in opening even DOCX (Microsoft's newer XML document format). Microsoft's legacy DOC is a bizarre example because, in spite of modern, better formats, it continues to hold its own.
Thinking about it, electronic documents do not have a very long history of existence. Several document formats have come into existence and then dwindled into oblivion when the parent companies which introduced them folded or got acquired by some other company. Some document formats simply became less popular and slowly disappeared while others have lived on and evolved. But the fact is, most of the early document formats have been binary-only, opaque and proprietary. There has been considerable user lock-in into tools and technologies as a result of binary data formats. Except in the relatively small open source community, proprietary tools and data formats have been the norm, not the exception. I'm sure that quite a large number of documents would have become unreadable for one reason or the other over the years. I'm equally sure that in the history of IT, corporates/organizations have spent a whole lot of time, effort and money simply in data conversion and preservation while migrating between technologies.
The current trend towards moving to standards-compliant, open and non-binary formats is welcome, because it shows that there is a definite interest in preserving electronic documents and preventing them from becoming non-readable in the future as a result of the whims, fancies or fortunes of one particular company, or when the underlying tools and technologies change. The popularity of the open source movement has contributed to this awareness. Yet a significant proprortion of users continue to rely on proprietary and in some cases, binary formats which are locked-in to specific applications or family of software.
I would encourage everybody to use the non-binary and standards-compliant document formats rather than proprietary formats that are in popular use and which become difficult to read or decipher with different tools. My criteria is
- Eelectronic documents should be readable by more tools than one, and preferably by open source applications.
- Data in these documents should preserved losslessly when converted to other (hopefully open, non-binary, standards-compliant) formats.
- The document format should be well-documented, in the public domain and unencumbered by restrictive Intellectual Property rights, and easily accessible to any developer who wishes to write tools to access and modify the data
I certainly hope that the preservation of electronic documents and records is a concern among mainstream developers and users as well. Otherwise, we'll end up having to periodically resort to the potential bug-ridden process of mass-converting files from an older format to a newer one, hoping that no data/formatting is lost in the process.