Will your electronic documents be readable 30 years from now?

Filed under: Software and Technology by Hari
Posted at 11:14 IST (last updated: Wed, Feb 5, 2014 @ 11:33 IST)

Pen on paperThis is a question that occurred to me when I first figured out that the legacy StarOffice binary format (SDW) is no longer being supported by LibreOffice or OpenOffice starting from version 4 onwards. To me, it was a matter of concern because I had actively used the original proprietary (but freeware) StarOffice suite in the late 1990s and when it become open-sourced in early 2000. As a result, I have quite a number of documents in the SDW format. Considering the implications, I recently spent quite a long time in manually converting these files to ODT using LibreOffice 3.x, to preserve and ensure future readability. Even though these documents might not be useful or relevant to me at present, yet they represent quite a bit of my early creative writing efforts which I'd like to keep around.

The dropping of this particular legacy format in LibreOffice and OpenOffice 4.0+ suggests that maintaining old code base is a serious concern among open-source developers. Ironically, Microsoft's legacy binary-only  DOC format is still supported in OpenOffice and LibreOffice, mainly due to its immense popularity. To this day, I am forced to save some files in DOC format because I have to send it across to a client or colleague not knowing what version of MS Office they use. A surprising number of people continue to use legacy versions of Microsoft Office in spite of the availability of modern, Open Source and Free office suites and they have difficulty in opening even DOCX (Microsoft's newer XML document format). Microsoft's legacy DOC is a bizarre example because, in spite of modern, better formats, it  continues to hold its own.

Thinking about it, electronic documents do not have a very long history of existence. Several document formats have come into existence and then dwindled into oblivion when the parent companies which introduced them folded or got acquired by some other company. Some document formats simply became less popular and slowly disappeared while others have lived on and evolved. But the fact is, most of the early document formats have been binary-only, opaque and proprietary. There has been considerable user lock-in into tools and technologies as a result of binary data formats. Except in the relatively small open source community, proprietary tools and data formats have been the norm, not the exception. I'm sure that quite a large number of documents would have become unreadable for one reason or the other over the years. I'm equally sure that in the history of IT, corporates/organizations have spent a whole lot of time, effort and money simply in data conversion and preservation while migrating between technologies.

The current trend towards moving to standards-compliant, open and non-binary formats is welcome, because it shows that there is a definite interest in preserving electronic documents and preventing them from becoming non-readable in the future as a result of the whims, fancies or fortunes of one particular company, or when the underlying tools and technologies change. The popularity of the open source movement has contributed to this awareness. Yet a significant proprortion of users continue to rely on proprietary and in some cases, binary formats which are locked-in to specific applications or family of software.

I would encourage everybody to use the non-binary and standards-compliant document formats rather than proprietary formats that are in popular use and which become difficult to read or decipher with different tools. My criteria is

  • Eelectronic documents should be readable by more tools than one, and preferably by open source applications.
  • Data in these documents should preserved losslessly when converted to other (hopefully open, non-binary, standards-compliant) formats.
  • The document format should be well-documented, in the public domain and unencumbered by restrictive Intellectual Property rights, and easily accessible to any developer who wishes to write tools to access and modify the data

I certainly hope that the preservation of electronic documents and records is a concern among mainstream developers and users as well. Otherwise, we'll end up having to periodically resort to the potential bug-ridden process of mass-converting files from an older format to a newer one, hoping that no data/formatting is lost in the process.

4 comment(s)

Leave a comment »
  1. I had a little ponder over this a while back - and we seem to have reached the same conclusions!

    So which "standards-based" ascii formats support semantic notation?

    - LaTeX
    - Markdown
    - yaml
    - Lilypond

    So that covers complex documents, simple documents, data relationships, and music :)

    Comment by Dion Moult (visitor) on Fri, Feb 7, 2014 @ 05:55 IST #
  2. I agree with you. However, the vast majority of users don't know even about LibreOffice, let alone LaTeX or other tools.

    I would be glad to see the day when I can share my ODT files with colleagues and clients. :-P

    Comment by Hari (blog owner) on Fri, Feb 7, 2014 @ 08:31 IST #
  3. Thankfully things seem to be getting better: ebooks are converging formats into XML-based epub (I believe, it's been a while since I've checked), MS Office are improving their open XML-based standards too, and yaml is growing. This combined helps improve the longevity of books (yes, XML is not quite LaTeX, but...), office docs (the open XML makes it easier for LibreOffice to be compatible) :)

    Comment by Dion Moult (visitor) on Fri, Feb 7, 2014 @ 10:37 IST #
  4. Yes, things are getting better. Hopefully there will be a day when there is one definitive Open Source Document format for cooked (processed) and uncooked (source) documents. Likewise for all other kinds of data as well.

    Comment by Hari (blog owner) on Fri, Feb 7, 2014 @ 12:15 IST #

Leave a comment

First-time comments on this blog are moderated.
Your name*
Email ID*
(wont' be published)
Website
Your comments*
(No HTML allowed)
:-) :-D :biggrin: :-P ;-) 8-) :-( :mad: |-| :oops: :-/ :-| :roll:
bold italic quote code
Code* captcha Enter the code you see in the image
* required fields