Refactoring an Inefficient Dataflow Process
Vicki Brown <email@example.com>
New data comes from a wide variety of vendors, in a wide variety of formats. Files also (often) contain more data than we actually use. Somehow, we need to extract the data that we want and load it into the database.
Unfortunately, we rarely have control over the incoming data format. Input files are usually "flat" data files with an (optional) human-readable header, e.g.
Last Name,First Name,Address Line 1,Address Line 2,City,State,Zip,... Jones,Fred,Box 18,512 Maine St.,Hamburg,PA,16000,... Baker,Betty,1733 State Dr.,,Philmore,CT,06516,... ... # Fruit Barrel Data file #1 # 20020515 # separator | # fields # ID|Date|Company|State|...|Tag|Item|Price|Unit 097632|20020501|FruitCo|CA|...|FC|apples|3.95|pound 064513|20020502|FruitCo|CA|...|FC|pears|.95|each ...The "human-readable" header may be anywhere from zero to several lines long. It may begin with a "comment character"... or not. The names given to the data fields by the vendor do not necessarily correspond to our database tables; we may not want to load all of the data as provided. It is not necessarily simple to determine what data should be loaded, given only the header in the file.
Our dataflow process, from vendor to database, was neither easy to understand nor to modify. Modifications could be risky and often included cloning an existing program instead of changing it.
Much special knowledge was encoded inside the various scripts including - the path to the data, the name of the data file, what "family" (data type) is represented, which fields contain what data, etc. Often the name of the file contains special knowledge as well, e.g.
which contains date and family information in the filename.
Sometimes the filename contains information that can allow us to derive other information (such as company name ) if we know the right mapping, e.g.
file Company =============================================== ========== /ag4/data/incoming/MyCo_Contacts_20020515.csv MyCo Inc. /ag4/data/incoming/VBC_Contacts_20020515.csv Very Big Corp.
Although our primary goal was to load data into the database, we had other goals as well. We would prefer to validate the data before we load it! We also wanted to be able to troubleshoot any problems that came our way during the dataflow process. Initially, we had the following to assist us in troubleshooting:
In trying to refactor the initial process, we decided to tackle the problem of many different file formats first. The reasoning was that, if we could munge the data into a more consistent and standardized data format at the beginning of the process, we could make things easier for downstream processing. Troubleshooting and validation, for example, should be easier when applied to a standard data format. Loading should also be simplified.
We wanted to have a more generic parser that could read any input file format and produce a standard output format that could then be passed to validation and loading utilities. We were willing to put extra data into the intermediate files (i.e. data that will not be loaded into the database) if it improved tracking and troubleshooting.
In thinking about the problem, we determined the following.
The input is composed of:
The intermediate data format should be:
The process should be:
We could continue along the same lines we'd been following in the past, to wit:
This approach, however, does not do a good job of fitting the requirements stated above. The typical ("foo"-separated fields in a flat file format) approach has several flaws:
We decided to use a parseable, hierarchical, tag/value format, optimized for data interchange and data streaming. Several of these exist. I was already familiar with Boulder Data Interchange Format (BoulderIO, http://brie.cshl.org/software/boulder/). Boulder is described as
... an easily parseable hierarchical tag/value format suitable for applications that need to pipe the output of one program into the input of another.
Boulder, however, does not meet all of the requirements we had laid out for our new process. It's popular in certain fields, but it hasn't yet reached the level of a standard. In addition, Boulder uses simple tag=value pairs which aren't flexible enough for handling multi-line data.
Another choice was XML (eXtensible Markup Language). Although generally thought of as a document formatting language (for web-based documents), XML is actually a flexible and extensible data representation standard. XML can also be described as an "easily parsable hierarchical tag/value format suitable for applications...".
XML is an industry standard, designed by a recognized Standards Committee. A large number of XML libraries exist and more are being written every day. There are many books (most of them not worth the shelf space :-) available for XML. Excel understands XML; many databases such as Oracle and Sybase can import XML directly (or will be able to in upcoming releases). We might even be able to convince some of our external vendors to send us XML-formatted input data, some day! And... XML is buzzword-compliant (which means getting management buy-in will be simpler).
Also, XML has some capabilities that Boulder doesn't have. For example, XML wraps values in opening and closing tags, making it easy to include multi-line data (a program can readily determine where the data and tags begin and end.) XML also allows for "attributes" to tags; attributes provide a way of modifying tags (similar to the way adjectives modify nouns in English).
Does XML meet our stated criteria?
Again, if we accept the assertions that our intermediate format should be
So yes, XML meets the requirements we had set out for the project.
Could We Have Chosen Something Else?
Could we have chosen something else, such as a database table-driven solution? Yes. We didn't; but we could have. There is nothing magical about the "XML" format except that it lends itself well to packaging data and there is a lot of support software and documentation readily available.
Given that we did choose XML (or something that looks a lot like it...) what is "XML"?
Depending on your point of view, XML is either the latest flash-in-the-pan buzzword-compatible strongly-hyped language... or it's the greatest programming breakthrough since Object-Oriented Programming... or it's a useful data and document preparation tool.
Specifically, XML is a document formatting and processing standard proposed by the World Wide Web Consortium (W3C). It is similar to, but more flexible (and powerful) than HTML. Both HTML and XML are simpler forms of another standard, SGML (Standard Generalized Markup Language).
There have been many document (and data) formatting and markup standards. These include troff, TeX, HTML, to name three. In most of these "languages", the set of formatting and markup commands is fixed. Users may be able to add to the set, but there is a well-known set of documented commands. Thus, the types of formatting that you can perform are static (although new commands can be added; sometimes by the user, often by the vendor). Nevertheless, books have been written cataloguing and describing the permissible set of commands and their meaning.
For example, anyone with a knowledge of HTML can say what the following means, and what it will look like as displayed in a browser.
<HTML> <HEAD> <TITLE>Intro</TITLE> </HEAD> <BODY> <H1>An Introduction to HTML</H1> HTML provides <UL> <P><LI>tags <P><LI>formatting <P><LI>markup for web pages </UL> </BODY> </HTML>
XML is different. In XML you create your own tags. Moreover, with XML, you define what each tag means. For example, in the following markup
<MEMO> <ADDRESS> <SUBJECT>Let's use XML</SUBJECT> <DATE>Friday, February 22, 2001</DATE> <FROM>Vicki Brown</FROM> <TO>Rich Morin</TO> <TO>Quinn Weaver</TO> <CC>David Fetter</CC> <CC>Matthew D. P. K. S-L</CC> </ADDRESS> <CONTENT> <EMPHASIS>XML Is Cool</EMPHASIS> <LP>Let's use XML for everything. XML is really cool.</LP> <CONTENT> </MEMO>
you can probably take a good guess at what the tags mean... but can you? Does <EMPHASIS> imply italic? bold? highlighting in COLOR="#ff0000">red? Can you guess the position of the text in the output or the font changes, if any? In XML, you define the tags and you decide what they mean. In particular, you decide how a document will be formatted, based on the tags you use.
Ostensible Markup Language, or OML, is a term coined by Rich Morin to describe the way we use XML. OML looks enough like XML to be used with parsers such as Perl's XML::Simple module, but it may not have formal Document Type Definitions (DTDs), style sheets, or other niceties. (Then again, it might.)
Strictly speaking, XML documents tend to be composed of three parts
Because we are using XML format (OML) as a data stream formatting language, we don't need a stylesheet; we won't need to format and display text in a browser. We will probably have a DTD, but it may not be as formal as those required for documents that are largely written by people (rather than by programs) and displayed in browsers.
Nonetheless, the format looks, tastes, and feels quite a lot like XML; it can be parsed by XML parsers; but it may not be precisely "regulation" XML. It is, however, ostensibly XML, or OML.
Using XML Throughout the Process
Once we decided to use XML for our intermediate data format, we made the decision to use XML throughout the dataflow path, from parser to loader (and someday, we hope, to database directly).
To drive the parser, we have a configuration file, in XML format
Each <record> section in the configuration file represents the configuration for a type of input data (a "family").
The parser accepts a date and a family as arguments. It then reads the configuration file to determine how to parse the input data for that family. Given the date, along with information found in the config file, the parser locates the proper input data file and begins parsing, extracting only those fields specified in the config file as being desirable. A decision to extract additional files requires only a small change to the config file (and re-running the parser).
Because XML tags, and their usage, are defined by the author (programmer), we don't have to use any tags we don't want to use at any given point in the dataflow. This makes XML a convenient format for storing data; we can keep a copy of the original configuration information in the data stream, and include that in the output from the parser. The remainder of the output is the parsed information in... XML format.
Each program in the chain, from parser, to validator, to loader, reads in XML and emits XML. Internally, each program converts the XML to a Perl data structure and works with that. Throughout the dataflow, each program produces a self-documenting output file contaning information from all previous steps in the process.
The process dataflow can be diagramed as:
Vendor-supplied sample data files, one per family | | "head" + some judicious editing V ASCII data format file for one or more families | | write_config.pl V XML framework for config file, one or more families | | careful editing by someone who knows things V XML config file for one or more families (final) | | Parser.pl | ^ | |__________Vendor-supplied data file V XML output - parsed results for one input file | | Validator(s) | - validate parsed data V XML output - validated results for one input file | | Loader(s) | - load validated data into database V [ Database ]
To learn more about XML and OML, try:
Learning XML by Erik T. Ray, publ. by O'Reilly & Assoc. 2001.
"Like digital Tupperware, [XML] is configurable to fit your data perfectly, while remaining a universal and flexible format that can be shared by many applications. Following the explosive popularity of HTML, XML will go further to break down barriers to global communication and data sharing."
XML Pocket Reference by Robert Eckstein, publ. by O'Reilly & Assoc., 1999
Ostensible Mark-up Language by Rich Morin (Silicon Carny, SunWorld Magazine, February 2001); available online at http://cfcl.com/~rdm/carny/2001.02.html
XML::Simple An easy Perl API to read/write XML (esp config files). The XML::Simple module provides a simple API layer on top of the XML::Parser module. Two functions are exported: XMLin() and XMLout(). From www.cpan.org or theoryx5.uwinnipeg.ca/CPAN/data/XML-Simple/Simple.html
SyntaxThe following are either equivalent or can be readily derived from each other (transformed)
<foo val="abc" /> <foo val="abc"></> <foo val="abc"></foo> <def tag="foo" val="abc" /> <foo>abc</foo> <foo>abc</>"arg" and "val" aren't keywords, per se. They can actually be any string. Double quotes are only required around the value if it has embedded blanks.
a tells what kind of a tag this is (an "a" tag)
d is the bracket value (which could contain nested tag)
b is a parameter
b=c represents types of things that are arguments (parameters) & values. These can be anything but it's recommended that they be "annotative" - its easier to understand if they are related to the tag).
You get to define