Ostensible Markup Language

San Francisco Perl User Group
May 28, 2002

Rich Morin <rdm@cfcl.com>
Vicki Brown <vlb@cfcl.com>
 

Extensible Markup Language (XML) is a highly buzzword-compliant standard. It is being adopted in a variety of areas and may even be appropriate for some of these. At heart, however, it is simply a text-based system for encoding hierarchical data structures.

Vicki Brown and I use an informal variant of XML, which we refer to as Ostensible Markup Language (OML). OML is syntactically consistent with XML, allowing us to use tools such as XML::Simple , but it tends not to have DTDs or other formalities.

Definitions

XML is a text-based, self-descriptive, tagged system for encoding hierarchical data structures.
  • text-based — informally, ASCII; really, Unicode.
  • self-descriptive — the description of the data’s format is bound in some manner to the data.
  • tagged — each data item is bound to a "tag" (i.e., name).
  • hierarchical data structures — basically, anything that looks like a "tree", but arbitrary graphs can be encoded, with some effort.
XML is standardized, but languages created by it may not be. XML solves only the "syntax" problem. The "meaning" of a given field, or even its internal structure, is undefined by XML.

Self-descriptive data

There are many forms of self-descriptive data. What they have in common is the fact that the program gets told how to read the data, rather than being "hard-coded" to use a particular format. Some self-descriptive data is designed to be human-readable; some isn’t. Here’s a self-descriptive format that is occasionally used in Fortran circles:
(I4I2I2)
1776 7 4
...

Here’s a tagged format that is used by some spreadsheets, etc:

yyyy,mm,dd
1776,7,4
...

Here’s Lincoln Stein’s Boulder Data Interchange Format:

date={
  yyyy=1776
  mm=7
  dd=4
}
...
 

Here’s a bit of XML:

... assorted DTD headers, possibly referencing some URI ...
<date>
  <yyyy>1776</yyyy>
  <mm>7</mm>
  <dd>4</dd>
</date>

And finally, here’s a bit of OML:

<date>
  <yyyy>1776</>
  <mm>7</>
  <dd>4</>
</date>
 

One of the things that becomes apparent from looking at these formats is that each of them imposes some overhead. In XML, for instance, each tag is given twice, along with some angle brackets and the occasional slash. If the tags are fairly long and the data fairly short, most of the bandwidth (storage, processing, ...) could easily be used for overhead.

The XML example above uses ridiculously short tags, but the per-item overhead is still 89%; the OML example is only a little better, at 87%. In short, there is a cost to using this method.

On the other hand, if the data is relatively small, the total difference may be insignificant. Also, the benefits brought by self-describing data may be so large as to justify the overhead. In the Perl and Unix communities, the prevailing wisdom tends to run in the direction of using text-based encoding; adding some tags and associated syntax simply extends that approach.

SGML , HTML, and XML

SGML (Structured Generalized Markup Language) was designed as a way to create "markup languages", ala troff and TeX, but with stronger capabilities for analysis, semantic markup, etc. Happily, as shown above, it ended up solving a much larger class of problems.

Unfortunately, although SGML is extremely powerful, many users find it to be overwhelming. As a result, it has never received broad acceptance, even in the text processing community.

HTML (Hypertext Markup Language) bears only a passing resemblance to SGML. The basic syntax is similar (e.g., use of angle brackets and tags), but:

  • HTML is not user-extensible (i.e., new tags can’t be added).

  • Some tags (e.g., <P> ) don’t need to be "matched" by closing tags ( </P> is officially recommended, but not generally required).

  • HTML allows some tags to stand on their own without flagging them as such. For example,

      <IMG SRC="photo.jpg" ALT="picture of me">

    as opposed to the SGMLish

      <IMG SRC="photo.jpg" ALT="picture of me" />

  • HTML has tags (e.g., <B>, <I> , <TT> ) that are specifically intended to control the displayed appearance of an item, as opposed to defining its fundamental nature.

  • The browser wars have added many goodies and allowed even more looseness of expression than the original HTML definition did.
XML (Extensible Markup Language) is often described an attempt to recapture some of the power and formality of SGML, without the complexity. It has succeeded in this role, spawning dozens of markup languages for specific uses. One of these, XHTML, is being promoted as the successor to HTML.

XML is, however, far more than a markup language. Many important uses of XML have nothing at all to do with "document production". Just think of it as a way to exchange information between computer programs (as one article put it, XML is "digital Tupperware").

Be careful, however, to remember that XML (like SGML) defines only the syntax of the data encoding. The meaning of a particular item (and even its internal format) are still unspecified. Assuming that XML encoding assures easy interchange of data can be dangerously naive.

OML

OML (Ostensible Markup Language) is our simplified variant of XML, used for applications where schemas and standardization are not appropriate. We also feel free to simplify the syntax, add warts and blemishes, and generally customize the format to our own needs. The only "rule" is that we should be able to use XML tools to process the resulting data. Even there, as discussed below, we’re willing to bend the rules on occasion.

OML is only one of many data encoding possibilities. It’s quite possible, for instance, to write configuration files in Perl! Alternatively, Lincoln Stein’s Boulder package is convenient and quite powerful. OML’s "buzzword compliance" (and the concomitant toolsets), however, tend to push us hard in its direction.

Events vs. Trees

As discussed at http://www.saxproject.org/?selected=event , there are two popular ways to handle XML (or SGML):

    Tree-based APIs

    These map an XML document into an internal tree structure, then allow an application to navigate that tree. The Document Object Model (DOM) working group at the World-Wide Web Consortium (W3C) maintains a recommended tree-based API for XML and HTML documents, and there are many such APIs from other sources.

    Event-based APIs

    An event-based API, on the other hand, reports parsing events (such as the start and end of elements) directly to the application through callbacks, and does not usually build an internal tree. The application implements handlers to deal with the different events, much like handling events in a graphical user interface. SAX is the best-known example of such an API.

Although event-based APIs are handy (and sometimes necessary) for dealing with extremely large chunks of XML, they don’t tend to be needed for the kind of things we do. So, we use tree-based APIs (e.g., XML::Simple).