Tag definitions and descriptions for Parser Configuration files

Index

Parser configuration files are written in a tagged data format we call OML, or "Ostensible Markup Language". OML can be considered to be functionally equivalent to XML in many ways. Note, however, that OML is largely a data packaging and transposition format. It doesn't make use of DTDs and it occasionally plays "fast&loose" with the strict syntax of XML.

OML format was chosen primarily as a convenient and standardized tag-data format, useful for moving raw data from a vendor-supplied "flat file" to the data base. Any resemblence to other uses of XML is purely coincidental.

This document describes the allowable set of elements (tags) and their attributes, which are permitted or expected in Parser configuration files.

See Also

Tags are described in order of normal use. Note that tags are allowable only within specified surrounding tags. Tags are described in blocks (the outer tag of each block is shown in bold below).

Config File Layout
    <config>
        <family>

            <info>
                <description> ... </description>
                <source>
                    <url> ... </url>
                </source>
                <path> ... </path>
                <filename> ... </filename>
                <headers> ... </headers>
           </info> 

            <parser>
                <format>
                    <comment_char> ... </comment_char>
                    <field_sep> ... </field_sep>
                    <record_sep> ... </record_sep>
                    <skip> ... </skip>
                    <read> ... </read>
                </format>

                <field  ...  />
                ...
            </parser>

            <loader>
                <table>
                    <load> ... </load>
                <table>
            </loader>

        </family>
   </config>


Comments
Tag Description
<!-- ... --> <!-- ... -->

Any text placed between <!-- and --> is considered to be a comment and is ignored by the XML Parser. Comments can be used anywhere within a configuration file.

Important - Note that all downstream programs following the Parser will strip out comments added by previous programs (or people). Only the Parser leaves comments in the original config file intact. For lasting comments, see the next two tags.

<comment> <comment> ... </comment>

To insert a permanent comment (one that will not be removed by XML::Parser), wrap the text in <comment> ... </comment> tags
<disabled> <disabled> ... </disabled>

Any otherwise "undefined" tags (and their contained text) will be ignored by the parser/loader chain of programs.

To temporarily disable a section of a config file, wrap the section in <disabled> ... </disabled> tags. Caution - Many portions of a config file are required and cannot be disabled. For example, although it is allowable to disable a given table load, you must not disable the entire loader section. (To prevent loading, do not run the loader program).


Config File Tag Descriptions

<config.../config> Block

Tag Description
<config> <config> ... </config>

Define a configuration file. The outermost set of tags should be <config> tags. There can be only one set of <config> tags in a configuration file.
<family> <family seq=n family_id=string> </family>

Define a record for a family configuration. One or more sets of <family> tags sit within one pair of <config> tags.

Each configuration is wrapped in <family> tags. There may be one or more sets of <family> tags in a given configuration file. However, only <family> tags and <!-- comments --> may be placed between <config> tags. The attributes uniquely identify each family record.

Note: The family tags are removed by the parser.

Attributes:

seq=#NUM# Specifies the sequence number of the family record in the configuration file
family_id=#STRING# Specifies the string that corresponds to this family name.


<info.../info> Block

Tag

Description

<info> <info> ... </info>

General information for this configuration. The <info> section comes first inside a record. The <info> section is informational but not, in general, used by the Parser or other downstream programs.
<description> <description> #STRING# </description>

Specifies descriptive text for this family; part of the <info> section.
<source> <source> name="#STRING#" </source>

Specifies the data source (i.e. vendor);

Attributes

name="#STRING#" Specifies the name of the data source.
  <url> <url> #STRING# </url>

[optional part of the source section]. Specifies a URL from which the data can be downloaded.
    TBD <...> ... <...>

Additional subsections to the source section may be defined in the future.]
<path> <path>#PATH#</path>

Specifies the path to where the data files can be found on the filesystem. #PATH# must be an existing directory. Part of the <info> section.
<filename> <filename> #STRING#FORMAT#STRING# </filename>

e.g. Specifies a filename format. In many cases, a filename is formed from one (or more) strings plus a date string. The date string format should be specified mnemonically. The parser will look up the mnemonic and convert a date accordingly. The @MACRO@ convention is used to specify the date format within the string. Part of the <info> section.
<headers> <headers> #STRING# </headers>

Specifies the field headers for an entire file; part of the <info> section. Header data is taken from an ascii description file. In the actual data structure, headers would be in one long row. Within <headers> tags, the headers are broken across lines. The headers in this section can be used for validation of the field header attributes within <field> tags. See below under <parser>
Example

<info>
  <description>
   Sample Contact Data
  </description>
  <source name="ContactsRUs" >
    <url>ftp://... <url>
  </source>
  <path>./data</path>
  <filename>sample0515.dat</filename>
  <headers>
      Last Name, First Name, Address Line 1, Address Line 2,
      City, State, Zip, Phone, Fax, Mobile, Email, Company
  </headers>
</info>


<parser.../parser> Block

Tag

Description

<parser> <parser> ... </parser>

Wraps the parser section of the configuration info. The <parser> section contains <format> and <field> tags.
<format> <format> ... </format>

Wraps the block that specifies data formatting information. The <format> section is placed within <parser> tags.
<comment_char> <comment_char>#CHAR#</comment_char>

Specifies a comment character in the data file (optional). Leave empty if not applicable. Part of the <format> section.
<field_sep> <field_sep>#CHAR#</field_sep>

Specifies the field separator for the data file. Part of the <format> section. Caution Be sure to escape, with a \, any characters that are Perl metacharacters (e.g. |, *, +, /). Tabs should be specified as \t. examples
    <field_sep>\t</field_sep>
    <field_sep>,</field_sep>
    <field_sep>\|</field_sep>
<record_sep> <record_sep>#CHAR#</record_sep>

Specifies the record separator for the data file. Part of the <format> section. The default is newline (\n); record_sep does not need to be specified (can be left empty) unless it is something other than newline.
<skip> <skip type="#STRING#">#VALUE#</skip>

Specifies whether to skip lines at the beginning of the file, before beginning to parse records. Part of the <format> section.

Attributes

type="#STRING# Valid types are: "lines", "until"

If type="lines", the value of #VALUE# must be a number of leading lines to skip, in the data file, before processing begins. If #VALUE# is 0, no lines will be skipped. That is, use
     <skip type="lines">0</skip>

If type="until", the value of #VALUE# must be a string that matches the last line before the data starts, e.g. "START-OF-DATA".

<read> <read type="#STRING#">#VALUE#</read>

Specifies when to stop reading data records. Part of the <format> section.

Attributes

type="#STRING# Valid types are: "lines", "until"

If type="lines", the value of #VALUE# must either be an integer number of data lines to read, in the data file, after processing begins.

If type="until", the value of #VALUE# must be a string that matches the first line after the data ends, e.g. "END-OF-DATA". As a special case, the string 'EOF' is considered to be the end of the file; that is, all lines will be read. That is, use

     <read type="until">EOF</read>
Example

 <format>
    <comment_char></comment_char>
    <field_sep>,</field_sep>
    <record_sep>\n</record_sep>
    <skip type="lines">4</skip>
    <read type="until">,,,,,,,,,,,,,,,,,,,</read>
 </format>


<field.../field> Block

Tag

Description

<field> <field pos="#NUM#"    header="#STRING#"

       parse="#BOOL#" tag="#STRING#" />

Specfication for each data field in the data file. <field> tags are part of the <parser> section. Be sure to list all fields that should be contained in a data file (not just the ones the be extracted). This information is used to cross-check whether incoing data records have an expected number of fields!

Attributes

pos="#NUM#" Position of field, counting from left to right, count begins at 1 (Human count not Perl style)
header="#STRING#" Header comment for this field
parse="#BOOL#" 1 if this field is to be extracted by the parser

0 if this field is not to be extracted

tag="#STRING#" Specifies a tag name for the parser to use when writing the data in this field to the out put stream. Only define a tag iff parse=1. Important: Although choice of tag name is not highly standardized or constrained, if the data in this field will be loaded into a single table field in the database, use the same name for clarity (e.g. price). Otherwise try to choose a mnemonic name. See also: <id_type>, <field_tag>, field_tag.
Example

 <field pos="1"   header="Last Name"
     parse="1" tag="last"     />
 <field pos="2"   header="First Name"
     parse="1" tag="first"     />
 <field pos="3"   header="Address Line 1"
     parse="1" tag="address>"     />
 <field pos="4"   header="Address Line 2"
     parse="0" tag=""     />
 <field pos="5"   header="City"
     parse="1" tag="city"     />
 ...



<loader.../loader> Block

Tag

Description

<loader> <loader> ... </loader>

Configuration information for the loader. The loader section should follow the data section. Nested tags include one or more <table.../table> sections.
<table> <table> name=#STRING# </table>

Specifies a table to be loaded; part of the <loader> section. Nested tags include one or more <load... sections.

Attributes:

name

Specifies the name of the table to be loaded, e.g. name="contacts";

Each <table...> section must contain at least one <load...> section as described below.
    <load> <load> type="numeric|string" xxx_tag|const="#VALUE#" > #STRING# </load>

Specifies what data to load into the table (and which database field to load it into). Part of the <table> section

The #STRING# must be replaced by the name of an appropriate (existing) field in the existing database. For example, record_id, from_dt, to_dt, etc.

Each <load> tag requires two attributes; the first is always a type which is either numeric or string. The second attribute tells the loader where to get the data value to load, either from a tag elsewhere in the datastream, or a constant value. Replace #VALUE# with the name of the tag (or the constant), e.g. date, Family, price, or whatever is appropriate.

Attributes:

type

Type is either string or numeric. Strings will be enclosed in quotes when loaded into the database. Numeric values will not.

Note that dates, which look like numbers, are actually strings.

The second attribute must be chosen from the list below. Each <load> entry must contain exactly one of global_tag, record_tag, field_tag, or const.

global_tag="#VALUE#"

A global tag is usually computer generated and is part of the <config> section. Global tags are global to a given family or data file. Examples of global tags include todays_date (the current date, as passed to the Parser) and family_id (originally encoded in the config file).

record_tag="#VALUE#"

A record tag is usually computer generated and is part of the <record> section. Record tags apply to an individual data record. An example of a record tag is record_id (determined by one of the programs in the stream and added to the config file).

field_tag="#VALUE#"

A field tag is usually chosen by the config file creator; field tags map to an appropriate tag attribute in the field section of the original <config> section. Examples of field tags might include price, shares, company_name, and so forth. (Note that field tags are also mapped into record tags by the parser; however, we refer to them here as field tags to distinguish them from programmatically-generated tags; field tags derive directly from the "tag" attribute of the "field" section for extracted data fields.)

const="#VALUE#"

const The value is a constant (not a tag, i.e. a variable). An example of a const is any literal string such as '20021231' or 'Monday' that cannot otherwise be extracted from the datastream.

Example

  <loader>
    ...
    <table name="t_price">
      <load type="numeric" record_tag="record_id">f_record_id</load>
      <load type="string" global_tag="family_id">f_family_id</load>
      <load type="string" field_tag="first_name">f_first_name</load>
      <load type="string" const="3.1415">pi</load>
    </table>



Descriptions of Tags added by Downstream Programs

    <parsed>
      <todays_date>...</todays_date>
      <path_to_file>...</path_to_file>
      <family_id>...</family_id>
      
      <config>
          ... <!-- original config file information here -->
      </config>
      <data> ...
        <record seq="1" record_id="1234567" status="id-confirmed">
        ...
        </record>
      </data> ...
      <stats> ...
      </stats> ...
    </parsed>
read-only.
<parsed.../parsed> Block

Tag

Description

<parsed> <parsed> ... </parsed>

Define a parsed configuration. The outermost set of tags should be <parsed> tags. There can be only one set of <parsed> tags in a parsed configuration file. Added by the Parser.
    <todays_date> <todays_date> ... </todays_date>

Specifies the date for the current file. Added by the Parser, based on data provided by the user (the -d flag to Parser.pl).
<family_id>

Specifies the Universe ID for this data stream. Added by the Parser, based on data provided by the user (the -U flag to AsgpParser.pl). The value of this tag is identical to the "family_id" attribute of the "<family..." tag in the original configuration file; the "<family.../family>" tags are removed by the parser.
<path_to_file>

Specifies the full pathname of the file that was parsed. Added by the Parser.
  <config> <config> ... </config>

Wraps the original config block from the parsed configuration file. (Not added by the parser, these tags were previously present but have been moved inward one level).
  <data> <data> ... </data>

The <data> tags delimit a set of parsed data records. Added by the Parser.
      <record> <record seq=n record_id=nnnnnn status="..." cleared="...">
...
</record>

Defines a record for a parsed file. One record corresponds to a row (input line) from the original parsed data file. Part of the <data> section. Added by the Parser.

Attributes:

seq=n Specifies the sequence number of the record in the original data file.

record_id=nnnnnn Specifies the Contact ID that corresponds to this data record.

status="..." Specifies a status value for the record. (N.B. - The status is added by the ID Mapper program; it will be modified by additional downstream programs, such as the Validator(s) or Loader).

cleared="..." Specifies that a record with a suspect or bad status value has been cleared for further processing. Use any reasonable value for the cleared parameter (initials or date recommended).

Each record section contains a set of tagged data; the data has been extracted from the original flat data file. Fields are named as specified in the tag attribute field section of the config file. Example:
  <record seq="4" record_id="12345678" >
    <last>Baker</last>
    <first>Betty</first>
    <address>1733 State Dr.</address>
    <city>Philmore</city>
    <state>CTX</state>
    <zip>06516</zip>
    <email>betty@myco.com</email>
  </record>

  <stats> <stats> ... </stats>

Defines a summary status and statistics section. This is a human readable section (created by the Parser) that supplies various status and statistics for a run. Information is added by the Parser, the Validator, and any other downstream programs that care to add information.