A Prime Example

Rich Morin <rdm@cfcl.com>

 

Prime Time Freeware uses OML to several parts of the book publishing process. The DOSSIER series had been available in printed form for several months, but we wanted to make the volumes available in PDF (Acrobat) form, as well. So, we set up some "PDF Download Areas" on Prime Time Freeware’s web server (aka sf.pm.org :-), keyed to the email addresses of authorized users:

PDF Download Areas

The "subscription download area" for PDFs needs to be secure (not letting unauthorized parties browse around) and simple to administer. It should also support a pleasant and convenient user interface.

Here is the directory layout:

.../ptf/dossier/PP/               # note 1

  Files/                          # note 1
    <sys_kwd>/                    # note 2
       <file>.pdf<-
       ...                 |
                           |
  Users/                   |      # note 1
    <user_id_kwd>/         |      # note 3
       index.shtml         |
       info.xml            |
       <link>.pdf>-               # note 4
       ...

  titles.xml

 

Notes:

1 Use an empty index.html file; set mode 750 on directory.

2 <sys_kwd> is where the PDF files actually "live", although we create links to them from the users’ directories. The real name of <sys_kwd> is a closely-held secret. Also, we can rename it (first updating all the user links!) whenever we need to do so.

3 Each <user_id_kwd> is generated, then sent to the user at his or her "official" email address (used in placing the order). If need be (e.g., we see too many downloads for the user), <user_id_kwd> can be regenerated and sent again.

4 We use hard <link> s, rather than symbolic links. This saves space, but requires us to rebuild the links if we replace the original <file> s.

Buried in all of this, we see a couple of XML (well, OML :-) files:

info.xml          # information about this user
titles.xml        # information about all titles

Here is an info.xml file we set up for this example:

% cat Users/u_ima@cfcl.com_asdfghjkl/info.xml

<email>ima@cfcl.com</> <uname>Ima Perlie</> # Local Perlies <item patt="Emai_MaS" eff="0205" cat="demo" />

There are three things about this file that obviously aren’t kosher XML:

The closing tags (e.g., </> ) don’t echo the opening tags.

XML doesn’t allow comments to start with sharp signs.

There isn’t any enclosing item for the XML as a whole.

All of these variations can be handled by the script before it hands the incoming string to the XML parser, however, and they certainly make the file easier to work with! Less obviously, there isn’t any Document Type Definition (DTD) in sight and the patt tag can have a regular expression as its value. Again, not particularly kosher, but quite convenient.

After we set up the download area, we needed to send email to all of the folks who had directories in it. Here are the essentials from the code:

use Data::Dumper;
use XML::Simple;

# Walk through the user directories.

for $dir (<Users/u_*>) {

  # Load info.xml (information on user's access rights)

  $file = "$dir/info.xml";
  $r_info = loadit($file);
# print "$file:\n"; dumpit($r_info);               # DEBUG

  $email = $r_info->{email}[0];
  $uname = $r_info->{uname}[0];
  $cat   = $r_info->{item}[0]{cat};

  if      ($cat eq 'comp') {
    $tmp = "... comp-specific message text ...";
  } elsif ($cat eq 'demo') {
    $tmp = "... demo-specific message text ...";
  } else {
    print "!!! cat=$cat, dir=$dir\n";
    next;
  }

  $tmp .= "...";

  # Now email or print the message...

  if ($run eq 'production') {
    open(MAIL, '|/usr/sbin/sendmail -t')
      or die "cannot fork sendmail: $!";
    print MAIL "$tmp\n";
    close(MAIL);
  } else {
    print "$tmp\n\n";
} } }

 
##
# dumpit                # (Terse) structure dump
##

sub dumpit {

  my ($data) = @_;

  my $tmp;

  local $Data::Dumper::Indent = 1;

  $tmp =  Dumper($data);
  $tmp =~ s|\[\n\s+({})\n\s+]|[ $1 ]|g;
  $tmp =~ s|\[\n\s+('[^']+')\n\s+]|[ $1 ]|g;
  $tmp =~ s|\[\n\s+([0-9]+)\n\s+]|[ $1 ]|g;
  $tmp =~ s|\n\s+{| {|g;
  $tmp =~ s|\n\s+],| ],|g;

  print "=====\n$tmp=====\n";
}
##
# loadit                # load XML from file
##

sub loadit {

  my ($file) = @_;
  my $ref;
  my $xml;

  local $/;
  undef $/;

  open(FILE, $file) or die "can't open $file";
  $xml = <FILE>;
  close(FILE);
 
  $lgl =  '[-a-zA-Z0-9_\.:]+';
  $xml =~ s|<($lgl)([^>]*)>([^<>]*)</>|<$1$2>$3</$1>|g;
  $xml =~ s|^#[^\n]*$||gm;
  $xml =  "<x>\n$xml</x>\n";

# print "xml=|||$xml|||\n";                       # DEBUG
  $ref = eval
    { XMLin($xml, 'forcearray'=>1, 'keyattr'=>[]) };
  if ($@) {
    print "<P><B>XML ERROR</B> - please report!\n";
    print "xml=||\n$xml||\n";
    exit;
  }
  return $ref;
}

Well, that’s a lot of code, but the important parts, for our purposes, are:

dumpit — We like Data::Dumper , but we think it uses too much white space. So, we edit the space on the way out. Here’s some sample output:

$VAR1 = {
  'uname' => [ 'Ima Perlie' ],
  'item' => [ {
      'cat' => 'demo',
      'eff' => '0205',
      'patt' => 'Emai_MaS'
    } ],
  'email' => [ 'ima@cfcl.com' ]
};
 
loadit — This wrapper calls XMLin , handling our "local preferences". It expands our shorthand tags, strips out comments, and puts an enclosing item around the body of the XML.

The XMLin options tell it that we want even single-XML items to be stored in arrays and that we don’t want the "array folding" feature. These make the resulting data structures consistent, at the cost of some complexity.

Finally, the wrapper traps and reports on any parsing errors, without letting XMLin crash the program (:-).

main — The main routine is pretty simple, once you get used to the way XMLin builds up its data structures. Compare these to the dump above:

$uname = $r_info->{uname}[0];

$cat = $r_info->{item}[0]{cat};

 

There’s a somewhat larger routine ( u_update ) which generates the users’ index.shtml files and links, but (although I am quite proud of it :-) it doesn’t really demonstrate much more about OML than the example above does. Here, however, are a few interesting code snippets:

# Index entries by volume title.
# <title code="Cetc_Est" name="C, etc.: Essential Tools">
#   <version date="May 2002">1.0.2</>
# </title>

for $r_title ( @{ $r_titles->{title} } ) {
  $name          = $r_title->{name};
  $titles{$name} = $r_title;
}

#####
# Build hash of lists of items.
# <item code="...._..." eff="0205" cat="comp" />

$r_items = $r_info->{item};
for $r_item ( @{ $r_items } ) {
$patt = $r_item->{patt};
   push(@{ $items{$patt} }, $r_item);
}
 

#####
# Walk items, by name.
for $name ( sort( keys(%titles) ) ) {
  $r_title = $titles{$name};
  $code    = $r_title->{code};

  for $patt (keys(%items)) {
    if ($code =~ m|$patt|) {
      for $r_item ( @{ $items{$patt} } ) {
        $eff = $r_item->{eff};
        ...
} } }
 

With all these hashes of of arrays (of hashes of arrays ...), OML-derived data structures can get a bit baroque. When you add hashes of references, things can get even weirder. So, take a deep breath and remember: Data::Dumper (aka dumpit) is your friend!

Little Languages, OML-style

Jon Bentley ("Programming Pearls", "More Programming Pearls", etc) is a big promoter of "little languages". The idea is that you write up an interpreter (or whatever) for an application-specific language, then use that to do the job.

Those of us who aren’t comfortable writing lexers and parsers may find this idea a bit intimidating. Fortunately, OML provides a solution. If your "little language" is basically declarative, why not encode it in OML? That way, all you have to worry about is the semantics of the language!

In the next part of this talk, Vicki will show how she created an OML-based "little language" to parse a variety of flat file formats and load the data into a DBMS. Her "interpreter", written entirely in Perl, is currently in use by BGI.