Smart Data

Copyright (c) 1999-2001 by Rich Morin
published in Silicon Carny, August 1999


By adding contextual information and/or executable code to your data files, you can make them more flexible, powerful, and portable.

In "A Lazy Afternoon" (Silicon Carny, March 1999), I described the use of Perl as a tool for encoding dbm data for transfer between machines, as:

    $divisions{AP} = 'Applications';
    ...
      

This format is easy to generate. Parsing it, of course, is trivial: we just let Perl do the work! And, because the data is represented as ASCII strings, byte-ordering is not an issue.

As long as we take care to "escape" any magic characters (e.g., quote marks), Perl is quite happy to ingest the file, loading specified locations with specified values. Because Perl is a full programming language, the assignment statements can be as tricky as desired:

    $a = 'to be';
    $b = "$a or not $a";
    foreach (1 .. 9) { $c[$_] = $_ * $_ }
      

This general technique (encoding data in a programming language) sees more regular use than you might think. Every PostScript file, for example, is a program. The PostScript program tells the printer how to render and print the desired image, as:

    %!
    100 100 moveto
    300 300 lineto
    stroke showpage
      

Several attempts have been made to use this technique in windowing environments. Sun's ill-fated NeWS environment used raw PostScript both for imaging and user interface programming. Although this was an elegant and powerful system, industry backlash against Sun dominance and licensing fees allowed X11 to win the Unix desktop battle.

The general idea has not, however, gone away. Display PostScript, used in conjunction with X11, is still used by some programs to provide imaging support. Apple's Mac OS X system will use Adobe System's PDF (Portable Document Format) as a basis for imaging. PDF is more formally defined than raw PostScript, allowing documents to be merged a bit more easily.

Getting away from PostScript, there are a number of other systems that generate programs and ship them around as data. Sun's Java and Microsoft's macros are revealing examples, showing both the power and dangers inherent in this technique.

Importing programs into your computer environment is always an iffy proposition. If you don't have reason to trust both the code's originator and the transmission channel, you may be subjecting your computer to unknown dangers.

Fortunately, many of the benefits of "smart data" can be realized without incurring these sorts of dangers. For instance, it is quite possible to use declarative programming languages (e.g., BoulderIO and XML) that support variable definitions but not arbitrary programming constructs.

BoulderIO

Lincoln Stein's "BoulderIO" is a very simple system, supporting only two data types (strings and records). Strings are terminated by a newline character, records are enclosed by a pair of braces:

    a=aaaaa...
    b={
      ba=aaaaa...
      bb=bbbbb...
      bc=ccccc...
    }
      

To make these structures easy to use, Lincoln has created a library of Perl modules that parse and/or emit BoulderIO files. Of course, the syntax is so simple that BoulderIO files are absolutely trivial for scripts to generate.

BoulderIO has its roots in genetic sequence manipulation, so it has some specialized features (e.g., a Blast interface) that would be of interest only to geneticists. More usefully to the rest of us, however, BoulderIO's data format (and supporting code) are convenient, general, and surprisingly powerful.

Although BoulderIO fits well into the Unix "filter and pipe" metaphor, it does so in a somewhat peculiar manner. BoulderIO programs can filter an incoming data stream, using and/or modifying only those elements which are relevant. Any items which are not specifically processed are passed to following programs in the pipeline.

This allows the entire body of data to increase in size and complexity, adding annotations, intermediate results, etc. The resulting file can thus be a complete record of the pipeline's activities, allowing easy and productive analysis, if need be.

I have found myself using BoulderIO for a variety of tasks, some of which are probably rather different from anything Lincoln had in mind when he designed the system. For instance, I find it to be a very useful format for writing log files (e.g., for CGI scripts).

Log entries can be of arbitrary size, but they are always tied together as records. New elements are trivial to add, without concerns about "breaking" the data format. Finally, the files are easlily readable by both humans and computers.

XML

I haven't studied XML (eXtended Markup Language) in depth, but it appears to have many of BoulderIO's useful attributes, along with several of its own. Like BoulderIO, XML can encode both strings and records. It is supported in this, however, by strong tools for enforcing standardization.

Using XML DTDs (Document Type Definitions), it is quite possible to "publish" standards for given sorts of information (e.g., bibliographic entries). Because each corresponding XML references the DTD, the receiving program can ensure that all of the elements in the document are well-defined, complete, etc.

If, at some later time, it becomes necessary to add elements, a revised form of the DTD can be published. As long as the old DTD is still available, documents that use it can still be interpreted unambiguously. This gives us a chance to write robust systems of programs that will be able to exchange information for years to come.

XML and its related standards are language-independent; support software is thus available for Java, Perl, Python, Tcl, etc. The programming models for the support software vary wildly, including procedural, event-oriented, and object-oriented interfaces. Nonetheless, it is clear that a great deal of support software is on the way, and much of it will be very good indeed.

The principal impact of HTML and the Internet has been to facilitate "many to many" publishing. Unlike telephones or the media, the Internet allows Joe Sikspak to reach a mass audience (happily refuting A.J. Liebling's comment that "Freedom of the press belongs to those who own one.").

Unfortunately, HTML pages are designed (in theory :-) to look nice and convey information to human viewers. In addition, their format changes whenever the webmaster gets a new idea. Consequently, most web pages are not well-suited for parsing by computer programs.

The impact of XML will thus be the facilitiation of "many to many" data exchange among computers. As organizations define DTDs and publish XML documents, more and more information will be cleanly available for use by random computer programs. I can't predict how this will all turn out, but it should be pretty interesting!

Resources

Adobe Systems
http://www.adobe.com
        
BoulderIO
http://stein.cshl.org/software/boulder/
                        
Java
http://www.javasoft.com
                        
Mac OS X
http://www.apple.com/macosx
                        
Perl
http://www.perl.com
                        
Sun
http://www.sun.com
                        
XML
http://www.xml.com
      

About the author

Rich Morin (rdm@cfcl.com) operates Prime Time Freeware (www.ptf.com), a publisher of books about Open Source software. Rich lives in San Bruno, on the San Francisco peninsula.