Refactoring an Inefficient Dataflow Process

Vicki Brown <vlb@cfcl.com>

Problem Statement

The Data

New data comes from a wide variety of vendors, in a wide variety of formats. Files also (often) contain more data than we actually use. Somehow, we need to extract the data that we want and load it into the database.

Unfortunately, we rarely have control over the incoming data format. Input files are usually "flat" data files with an (optional) human-readable header, e.g.

 
Last Name,First Name,Address Line 1,Address Line 2,City,State,Zip,... 
Jones,Fred,Box 18,512 Maine St.,Hamburg,PA,16000,...   
Baker,Betty,1733 State Dr.,,Philmore,CT,06516,...
... 

# Fruit Barrel Data file #1   
# 20020515   
# separator |  
# fields  
# ID|Date|Company|State|...|Tag|Item|Price|Unit
097632|20020501|FruitCo|CA|...|FC|apples|3.95|pound
064513|20020502|FruitCo|CA|...|FC|pears|.95|each   ...
The "human-readable" header may be anywhere from zero to several lines long. It may begin with a "comment character"... or not. The names given to the data fields by the vendor do not necessarily correspond to our database tables; we may not want to load all of the data as provided. It is not necessarily simple to determine what data should be loaded, given only the header in the file.

The Process

Our dataflow process, from vendor to database, was neither easy to understand nor to modify. Modifications could be risky and often included cloning an existing program instead of changing it.

Much special knowledge was encoded inside the various scripts including - the path to the data, the name of the data file, what "family" (data type) is represented, which fields contain what data, etc. Often the name of the file contains special knowledge as well, e.g.

 
    /ag4/data/incoming/MyCo_Contacts_20020515.csv

which contains date and family information in the filename.

Sometimes the filename contains information that can allow us to derive other information (such as company name ) if we know the right mapping, e.g.

      file                                             Company     
      ===============================================  ==========     
      /ag4/data/incoming/MyCo_Contacts_20020515.csv     MyCo Inc.     
      /ag4/data/incoming/VBC_Contacts_20020515.csv      Very Big Corp.     

Validation and Troubleshooting

Although our primary goal was to load data into the database, we had other goals as well. We would prefer to validate the data before we load it! We also wanted to be able to troubleshoot any problems that came our way during the dataflow process. Initially, we had the following to assist us in troubleshooting:

  • input files (not consistent or standard)

  • logs (possibly with not enough information)

  • database results (by this time, it's usually too late to trace what happened)

Problems

  • Our current process involved many parsers and loaders, as well as programs that do both.

  • Our current process was not easy to understand and potentially risky to modify.

  • Our current process included too much "hard-coded" and "secret" knowledge.

  • The current process had no validation steps.

Desired Situation (Goals)

In trying to refactor the initial process, we decided to tackle the problem of many different file formats first. The reasoning was that, if we could munge the data into a more consistent and standardized data format at the beginning of the process, we could make things easier for downstream processing. Troubleshooting and validation, for example, should be easier when applied to a standard data format. Loading should also be simplified.

We wanted to have a more generic parser that could read any input file format and produce a standard output format that could then be passed to validation and loading utilities. We were willing to put extra data into the intermediate files (i.e. data that will not be loaded into the database) if it improved tracking and troubleshooting.

In thinking about the problem, we determined the following.

The input is composed of:

  • existing data files supplied to us

  • existing configuration files we have created

  • data already in the database database

The intermediate data format should be:

  • understandable by humans

  • easily parsed by programs

  • consistent

  • standardized - suitable for passing "down the pipe"

  • general

The process should be:

  • Simpler - parser parse, loaders load

  • Easier to understand - Special knowledge should be moved out of the programs and into configuration files.

  • Validatable - we wanted to improve our ability to validate the input data and the output at each step in the process

  • Easier to document

  • less risky to modify

  • scalable

Choosing a Data-interchange Format

Choice #1 - Continue along well-trodden paths

We could continue along the same lines we'd been following in the past, to wit:

  • Parsers read flat input files and produce flat output files, formatted as columnar data, one record per line, fields separated by a "well-known" separator.

  • We could try to reduce the number of parsers and loaders and improve their efficiency.

  • Loaders are told (or know?) the separator, are told which field(s) the various data item(s) could be found in, and gobble their way through.

This approach, however, does not do a good job of fitting the requirements stated above. The typical ("foo"-separated fields in a flat file format) approach has several flaws:

  • The format is understandable by humans but not easily comprehended.

  • The format isn't self-documenting.

  • The format is parsable by programs (but not easily; someone needs to tell the program what the fields are).

  • The format is not entirely consistent, as some of the fields could change - you can never assume that you "know" what field 5 is.

  • The format is not well standardized (see above).

  • The format doesn't scale well if you need to add (or remove) a field.

Choice #2 - Take a different approach

We decided to use a parseable, hierarchical, tag/value format, optimized for data interchange and data streaming. Several of these exist. I was already familiar with Boulder Data Interchange Format (BoulderIO, http://brie.cshl.org/software/boulder/). Boulder is described as

... an easily parseable hierarchical tag/value format suitable for applications that need to pipe the output of one program into the input of another.

Boulder, however, does not meet all of the requirements we had laid out for our new process. It's popular in certain fields, but it hasn't yet reached the level of a standard. In addition, Boulder uses simple tag=value pairs which aren't flexible enough for handling multi-line data.

Another choice was XML (eXtensible Markup Language). Although generally thought of as a document formatting language (for web-based documents), XML is actually a flexible and extensible data representation standard. XML can also be described as an "easily parsable hierarchical tag/value format suitable for applications...".

Why Choose XML?

XML is an industry standard, designed by a recognized Standards Committee. A large number of XML libraries exist and more are being written every day. There are many books (most of them not worth the shelf space :-) available for XML. Excel understands XML; many databases such as Oracle and Sybase can import XML directly (or will be able to in upcoming releases). We might even be able to convince some of our external vendors to send us XML-formatted input data, some day! And... XML is buzzword-compliant (which means getting management buy-in will be simpler).

Also, XML has some capabilities that Boulder doesn't have. For example, XML wraps values in opening and closing tags, making it easy to include multi-line data (a program can readily determine where the data and tags begin and end.) XML also allows for "attributes" to tags; attributes provide a way of modifying tags (similar to the way adjectives modify nouns in English).

Does XML meet our stated criteria?

Again, if we accept the assertions that our intermediate format should be

  • understandable by humans

  • easily parsed by programs

  • consistent

  • standardized

  • general

  • scalable

XML is

  • relatively easy to understand by humans (easier than flat-file records)

  • especially if we pick the tag names reasonably

  • for example, use English words or perhaps database table field IDs

  • easily parsed by programs (and we can get extra leverage by using the Perl modules and other code that is already available

  • self-documenting (to a certain extent)

  • consistent

  • standardized (both in terms of final data format and in terms of process)

  • general

  • scalable - to add a field, just... add a field.

  • largely order-independent ("fields" can be re-ordered, often without loss of meaning)

  • buzzword-compliant (easier to sell to management :-)

So yes, XML meets the requirements we had set out for the project.

Could We Have Chosen Something Else?

Could we have chosen something else, such as a database table-driven solution? Yes. We didn't; but we could have. There is nothing magical about the "XML" format except that it lends itself well to packaging data and there is a lot of support software and documentation readily available.

Introduction to XML

Given that we did choose XML (or something that looks a lot like it...) what is "XML"?

Depending on your point of view, XML is either the latest flash-in-the-pan buzzword-compatible strongly-hyped language... or it's the greatest programming breakthrough since Object-Oriented Programming... or it's a useful data and document preparation tool.

Specifically, XML is a document formatting and processing standard proposed by the World Wide Web Consortium (W3C). It is similar to, but more flexible (and powerful) than HTML. Both HTML and XML are simpler forms of another standard, SGML (Standard Generalized Markup Language).

Comparing XML to Other Markup Languages

There have been many document (and data) formatting and markup standards. These include troff, TeX, HTML, to name three. In most of these "languages", the set of formatting and markup commands is fixed. Users may be able to add to the set, but there is a well-known set of documented commands. Thus, the types of formatting that you can perform are static (although new commands can be added; sometimes by the user, often by the vendor). Nevertheless, books have been written cataloguing and describing the permissible set of commands and their meaning.

For example, anyone with a knowledge of HTML can say what the following means, and what it will look like as displayed in a browser.

<HTML>     
   <HEAD>      
   <TITLE>Intro</TITLE>
      </HEAD>     
      <BODY>       
        <H1>An Introduction to HTML</H1>
          HTML provides
          <UL>
            <P><LI>tags
            <P><LI>formatting
            <P><LI>markup for web pages
         </UL>
      </BODY>
      </HTML>

XML is different. In XML you create your own tags. Moreover, with XML, you define what each tag means. For example, in the following markup

   <MEMO>
     <ADDRESS>
       <SUBJECT>Let's use XML</SUBJECT>
       <DATE>Friday, February 22, 2001</DATE>
       <FROM>Vicki Brown</FROM>
       <TO>Rich Morin</TO>
       <TO>Quinn Weaver</TO>
       <CC>David Fetter</CC>
       <CC>Matthew D. P. K. S-L</CC>
     </ADDRESS>
     <CONTENT>
        <EMPHASIS>XML Is Cool</EMPHASIS>
        <LP>Let's use XML for everything. XML is really cool.</LP>         
     <CONTENT>
   </MEMO>

you can probably take a good guess at what the tags mean... but can you? Does <EMPHASIS> imply italic? bold? highlighting in COLOR="#ff0000">red? Can you guess the position of the text in the output or the font changes, if any? In XML, you define the tags and you decide what they mean. In particular, you decide how a document will be formatted, based on the tags you use.

Ostensible Markup Language

Ostensible Markup Language, or OML, is a term coined by Rich Morin to describe the way we use XML. OML looks enough like XML to be used with parsers such as Perl's XML::Simple module, but it may not have formal Document Type Definitions (DTDs), style sheets, or other niceties. (Then again, it might.)

Strictly speaking, XML documents tend to be composed of three parts

  1. The actual XML data, with tags

  2. TA stylesheet, dictating how elements are to be formatted or displayed, for example, in a word processor or web browser.

  3. TA Document Type Definition (DTD), specifying rules for how XML elements, attributes, and other data are defined and related.

Because we are using XML format (OML) as a data stream formatting language, we don't need a stylesheet; we won't need to format and display text in a browser. We will probably have a DTD, but it may not be as formal as those required for documents that are largely written by people (rather than by programs) and displayed in browsers.

Nonetheless, the format looks, tastes, and feels quite a lot like XML; it can be parsed by XML parsers; but it may not be precisely "regulation" XML. It is, however, ostensibly XML, or OML.

Using XML Throughout the Process

Once we decided to use XML for our intermediate data format, we made the decision to use XML throughout the dataflow path, from parser to loader (and someday, we hope, to database directly).

To drive the parser, we have a configuration file, in XML format

<config> <!-- Config file generated: Tues May 28 15:43:04 PDT 2002 --> <!-- By: vlb cwd: /home/vlb/projex/Perl/SampleCode --> <!-- Command: write_config.pl -S ',' -f ./data/sample0515.dat --> <family seq="1" family_id="Sample"> <info> <description>Sample Contact Data</description> <source name="Sample"> <url></url> </source> <path>./data/sample0515.dat </path> <filename>sample0515.dat</filename> <headers> Last Name, First Name, Address Line 1, Address Line 2,City, State, Zip, Phone, Fax, Mobile, Email, Company </headers> </info> <parser> <format> <comment_char></comment_char> <field_sep>,</field_sep> <record_sep></record_sep> <skip type="lines">2</skip> <read type="until">EOF</read> </format> <field pos="1" header="Last Name" parse="1" tag="last" /> <field pos="2" header="First Name" parse="1" tag="first" /> ... <field pos="11" header="Email" parse="1" tag="email" /> <field pos="12" header="Company" parse="0" tag="" /> </parser> <loader> <table name="contacts"> <load type="numeric" record_tag="record_id">f_record_id</load> <load type="string" global_tag="family_id">f_family_id</load> <load type="string" field_tag="first">f_first_name</load> ... <load type="numeric" field_tag="zip">f_zip_code</load> </table> </loader> </family> </config>

Each <record> section in the configuration file represents the configuration for a type of input data (a "family").

Parser

The parser accepts a date and a family as arguments. It then reads the configuration file to determine how to parse the input data for that family. Given the date, along with information found in the config file, the parser locates the proper input data file and begins parsing, extracting only those fields specified in the config file as being desirable. A decision to extract additional files requires only a small change to the config file (and re-running the parser).

Because XML tags, and their usage, are defined by the author (programmer), we don't have to use any tags we don't want to use at any given point in the dataflow. This makes XML a convenient format for storing data; we can keep a copy of the original configuration information in the data stream, and include that in the output from the parser. The remainder of the output is the parsed information in... XML format.

Each program in the chain, from parser, to validator, to loader, reads in XML and emits XML. Internally, each program converts the XML to a Perl data structure and works with that. Throughout the dataflow, each program produces a self-documenting output file contaning information from all previous steps in the process.

Dataflow

The process dataflow can be diagramed as:

Vendor-supplied sample data files, one per family
     |
     |  "head" + some judicious editing  
     V    
  ASCII data format file for one or more families
     |
     | write_config.pl 
     V
    XML framework for config file, one or more families
     |
     | careful editing by someone who knows things
     V
    XML config file for one or more families (final) 
     |
     | Parser.pl
     |       ^
     |       |__________Vendor-supplied data file 
     V
    XML output - parsed results for one input file
     |
     | Validator(s) 
     |           - validate parsed data
     V
    XML output - validated results for one input file
     |
     | Loader(s)
     |           - load validated data into database
     V
  [ Database ]

References

To learn more about XML and OML, try:

Learning XML by Erik T. Ray, publ. by O'Reilly & Assoc. 2001.

"Like digital Tupperware, [XML] is configurable to fit your data perfectly, while remaining a universal and flexible format that can be shared by many applications. Following the explosive popularity of HTML, XML will go further to break down barriers to global communication and data sharing."

XML Pocket Reference by Robert Eckstein, publ. by O'Reilly & Assoc., 1999
The book provides an introduction to XML terminology and syntax as well as a quick reference to attributes, entities, datatypes, etc.

Ostensible Mark-up Language by Rich Morin (Silicon Carny, SunWorld Magazine, February 2001); available online at http://cfcl.com/~rdm/carny/2001.02.html

XML::Simple — An easy Perl API to read/write XML (esp config files). The XML::Simple module provides a simple API layer on top of the XML::Parser module. Two functions are exported: XMLin() and XMLout(). From www.cpan.org or theoryx5.uwinnipeg.ca/CPAN/data/XML-Simple/Simple.html

Appendix - Notes on beginning XML

Syntax

The following are either equivalent or can be readily derived from each other (transformed)
<foo val="abc" />
<foo val="abc"></>
<foo val="abc"></foo>
<def tag="foo" val="abc" />
<foo>abc</foo>
<foo>abc</>
"arg" and "val" aren't keywords, per se. They can actually be any string. Double quotes are only required around the value if it has embedded blanks.

Reading XML

Given:
<a b=c>d</>

a tells what kind of a tag this is (an "a" tag)

d is the bracket value (which could contain nested tag)

b is a parameter

b=c represents types of things that are arguments (parameters) & values. These can be anything but it's recommended that they be "annotative" - it’s easier to understand if they are related to the tag).

Creating XML

You get to define

  • what your tags are

  • how they will be nested

  • how they will be ordered

  • whether they are required or optional

  • what they mean