Ostensible Mark-up Language

Copyright (c) 2001 by Rich Morin
published in Silicon Carny, February 2001


The Meta Project's file-tree browser is supposed to recognize path names and supply descriptive information, but in cases like /dev/*, this can be a real challenge. Using Perl and OML (an informal variant of XML), however, Rich Morin has pieced together a solution, and in this month's Silicon Carny, he shares it with you.

As I mentioned last month, I'm working on a file-tree browser for the Meta Project. One of the interesting sub-problems I've encountered has to do with characterizing the device names in /dev. As mentioned in the demo, the idea is that a user should be able to enter an arbitrary device name (e.g., /dev/ersa0.1) and receive some useful information, as:

/dev/rwd0a is a member of the wd (Generic WD100x/IDE disks) family. This device node has the attributes "raw, unit 0, partition a". Related device nodes have names that match the Perl regular expressions /^r?wd[0-9]+$/, /^r?wd[0-9]+[a-z]$/, and /^r?wd[0-9]+s[0-9]+.*$/.

There are literally thousands of possible device names, so a brute-force approach is out of the question. Even when the name space is folded down by unit numbers and such, there are hundreds of device families (e.g. /dev/*sa*). Although many of these families have similar name-formation rules, there are over a dozen sets of rules, all told.

My solution to this nightmare is based on three components: a set of device family descriptions, a set of parsing macros, and some supporting Perl code. Both the descriptions and the macros use XML syntax.

By matching each family's base name (e.g., sa) against the name in question, I can find out if it has even a chance of matching the desired name. Assuming that this initial test succeeds, I can use a specific set of parsing macros to see if I really have a match.

Device families

Here's the XML description for the sa device family:

    <driver>
      <name>sa</name>
      <desc>(SCSI) Sequential Access devices</desc>
      <man_page section="4" status="primary">sa</man_page>
      <parse>[EN]?R?BU.M</parse>
    </driver>

The parse entry looks pretty complex, but it's actually just a mnemonic name for the parsing macro. Any unique text string would serve to identify the macro, but this one gives a hint to the nature of the required parsing. The rest of the description should be pretty self-explanatory.

I should note, in passing, that the description above is written in something I call Ostensible Mark-up Language (OML). That is, it looks enough like XML to pass muster, but it doesn't have a style sheet or other niceties. It may also contain things, such as Perl regular expressions, that aren't really kosher by normal XML standards.

Parsing macros

Assuming that the entered device name contains the device family's base name (sa), we look at the contents of the specified parsing macro(s) (e.g., [EN]?R?BU.M ):

    <macro parse="[EN]?R?BU.M">
      <regexp>([en]?)(r?)$name([0-9]+)(?:\.([0-9]+))?</regexp>
      <redisp>[en]?r?$name[0-9]+</redisp>
      <redisp>[en]?r?$name[0-9]+\.([0-9]+</redisp>
      <substr>rewind,root,unit,mode</substr>
    </macro>

If the entered device name matches the regular expression specified in regexp, Perl will fill four numbered variables (e.g., $1) with captured substrings. We can then interpret these substrings, based on the names in the substr entry.

Supporting Perl code

Fortunately, the really hard parts of the job are accomplished by some handy Perl utility modules. For instance, the XML text is stored in a tied hash, using BerkeleyDB::Btree. Parsing the XML into Perl data structures is accomplished by XML::Simple; printing the structures (for debugging) is accomplished by Data::Dumper.

With these nasty parts under control, we only need to fiddle the returned values into English. Here's a simplified version of the relevant code. The device family name ($dname) gets plugged into the regular expression ($regexp), and is then matched against the query name ($qname). The resulting substrings are folded into a parenthesized expression ($paren), which is then tidied up into passable English format.

    $regexp =~ s/\$name/$dname/;
    if ($qname =~ /^$regexp$/) {
      print "\n  /dev/$qname is a device node ";
      @substr = split(/,/, $substr);

      $paren =  '';
      $paren .=  "($substr[0] $1" if ($#substr >= 0);
      $paren .= ", $substr[1] $2" if ($#substr >= 1);
      ...
      $paren .= ") " if ($#substr >= 0);

      if ($parse eq '[EN]?R?BU.M') {
        $paren =~ s/\(rewind , /(rewind on close,/;
        ...
      }
      ...
      print $paren if ($paren ne '');

      print "\n  for device family $dname ($desc).\n";
      ...

I won't try to pretend this is elegant code, but it gets the job done in a small and reasonably simple bit of code. Part of the reason for this brevity lies in Perl and its very handy modules.

Another part, however, comes from using XML as a tool to build a little language. By creating XML-based parsing macros (complete with embedded regular expressions), I was able to encode some fairly complex notions in a very compact form.

I'm not sure what other applications could benefit from this approach, but I think that it is one that will stay in my coding arsenal. Here's hoping it will find a place in yours...

About the author

Rich Morin (rdm@cfcl.com) operates Prime Time Freeware (www.ptf.com), a publisher of books about Open Source software. Rich lives in San Bruno, on the San Francisco peninsula.