Detailed Description of Generic Parser

This step performs parsing of the input string data based on defined parsing rules. It tries to find a pattern that matches the input text. If such pattern is found, its definition is used to recognize individual parts (components) and store their values in the output.

The input text is first split into tokens using the defined tokenizer. Tokens are then matched against defined patterns and their components. For parsing purposes, pattern definitions and the input string are both pre-compiled using an identical tokenizer and therefore correct (identical) construction of all components is guaranteed.

Patterns consists of components. Generally there are two types of components:

Basic components

The predefined basic components are:

Component name

Popis

LETTER

A letter (no parameters available)

WORD

A (single) word. Available parameters:

  • int minLength - minimum number of required characters (default value: 2)
  • int maxLength - maximum number of required characters (default value: integer max)
  • string chars - string that defines which characters the words can be composed of; default behavior as:[:letter:] (see)

MULTIWORD

A multiple word string - a string consisting of more than one word. Available parameters:

  • int minLength - minimum number of required characters (default value: 2)
  • int maxLength - maximum number of required characters (default value: integer max)
  • string chars - see WORD
  • string wordSeparators A string that defines characters that are considered to be acceptable word delimiters (separators). It is not necessary to define a space as a separator; it is automatically considered as a delimiter. The default value is an empty string. For special delimiter characters it is recommended to use their predefined entities : "&' etc.

MULTIWORD accepts various numbers of input tokens, depending on the verifier and the pattern where is defined:

  • If a verifier is specified, then MULTIWORD accepts as many words as possible until:
    • the whole read string is present in multi-word lookup, or
    • each read chunk is present in single-word lookup
  • If a verifier is not specified: then the number of words read from the input depends on the pattern where MULTIWORD is used. MULTIWORD will then contain as many words as possible according to the rest of the pattern.

See the detailed multiword description here to find more information about this topic.

The main purpose and advantage of the MULTIWORD component over a set of WORD components is its flexibility. Contrary to a set of WORD components that matches only exactly the same number of input tokens, the MULTIWORD component allows you to read various numbers of input words depending on the given verifier or pattern.

Known limitations:

  • there must not be more separators in a sequence, otherwise MULTIWORD component will not accept the string (even it is specified in lookup file)
  • only single-character separators are supported

INTERLACED_WORD

Word where each character is separated by a space character.

  • string chars - see WORD

ANWORD

An alphanumeric word (no parameters available).

DIGIT

A single digit (no parameters available).

NUMBER

A natural number. Available parameters:

  • boolean acceptSeparators - specifies whether the delimiter specified in separatorChar is acceptable between digits within a number. Default value: false
  • char separatorChar - value which is acceptable as a delimiter of single digits within a number. Default value: ' '
  • int minLength - minimum number of required characters (default value: 2)
  • int maxLength - maximum number of required characters (default value: integer max)

INTEGER

Whole number. Available parameters: The same as in the case of NUMBER, with the following exceptions:

  • acceptSeparators - default value: true
  • operators are also acceptable ( '+' and '-')

REAL_NUMBER

Real number. Available parameters:

  • char floatingPoint - a character representing a decimal separator. Default value: '.'

ROMAN_NUMBER

Roman number (no parameters available).

REGEXP

Regular expression. Available parameters:

  • string pattern - a string with a Java regex.Pattern (see class Pattern).

A component defined using a string.

No parameters available.

*

Component accepting any text.

NOTE: The components NUMBER and WORD must be defined by default using at least 2 digits (for NUMBER) or 2 characters (for WORD) not considered as DIGIT or LETTER (which are defined using one digit or character, respectively).

MULTIWORD component - detailed information

This component reads individual tokens from the input (tokenizer's output; the tokenizer preprocesses data for the parser). When reading, single character tokens (and those which are equal to the one defined in the wordSeparators parameter) are considered to be a part of a word.

The result of this reading process (acquired words) are merged using the ' ' character. This way the component reads a maximum possible number of words and then processes the rest of the string against the rest of the pattern (i.e. consecutive parsing when the rest of the pattern is applied). If the consecutive parsing is not successful, the component detaches the last read word and performs consecutive parsing again. Finally the input string is either not parsed or when the process is successful, the component stores the result in the specified output column.

For accepting processes, the MULTIWORD component can be used either with the standard single word verifier or with the MultiwordVerifier verifier, which tries to lookup the actual input in the multi-word dictionary (if defined - it is optional). If the lookup is not successful, the step attempts to lookup individual words (i.e. sequences of characters delimited by spaces) in the single word dictionary. If any of this process succeeds, the token sequence is accepted.

NOTE: the MULTIWORD component is the only component that uses the multiword-lookup identifier 'Multi file name' in the related Verifier

If a dictionary based verification is not required for the MULTIWORD component, the first component defined this way accepts a maximum number of tokens. At the same time, the rest of the components must be considered to meet pattern's requirements. For example, two consecutive MULTIWORD components which have no verification will share an input with N tokens in the following way: the first component reads N-1 tokens and the second reads one token only.

Example:

Input string: Anna-Maria o'Donald - the string is split by the tokenizer of the parser in the following way:

Anna,-,Maria, ,o, ',Donald

Using wordSeparators="-'" definition, the MULTIWORD component merges this (the tokenizer's) output into this result:

Anna-Maria, o'

If using a multi-word dictionary, the following combination is looked up:

Anna-Maria o'Donald

in case of failure, the single-word dictionary is used and this combination is looked up:

Anna-Maria,o'Donald

If case of no match, the MULTIWORD component detaches the last word (o'Donald) and if a consecutive parsing succeeds, the test is performed again, this time with the same value Anna-Maria in the multi-word dictionary and subsequently in the single-word dictionary.

Pattern and component definitions

There are two ways to define a component.

Using a component in a pattern:

The component name is enclosed by brackets: '{' and '}' within the parsing pattern definition. Any other string defined outside brackets is considered literally; in the parsing process, such text must be part of the parsed input string to match the parsing pattern. The component usage specification is as follows:

{component_name[:name_par=value_par,name_par=value_par] [| list_of_components_defining_the_structure]}

Component customization

In order to locally modify behavior of the component, you can:

Change component parameters

Setting of component behavior and properties can be performed using the component's parameters (see the table). Parameters are specified in the following way: parameter_name=parameter_value. The specification requires that the component name and parameter definition be separated by a colon (':').

The component parameter's value can be escaped using both double and single quotes (' and "). When a quote is intended to be part of the value, it must be escaped using double quotes (i.e. "" or ''). It is also possible to use no quotation (the parameter value is not escaped) up to a first "," (comma), "|" (pipe) or "}" (right parenthesis). In this case, escape sequences for definition of the special characters \t\n\r\f cannot be defined (the only possible definition is when escaped by quotes). It is also possible to use a special definition using the @ character. When escaped by @, every character within escaping quotes has the explicit value (a value that has no special meaning) - e.g. definition of @"\t" means that \t is interpreted as the appropriate string (\t) and not as a tab.

Pay special attention to coding patterns of the regular expression component (REGEXP), where special characters such as '{' or '\' are often used in the pattern, e.g. when backslash should be used in the quoted parameter, you have to duplicate it.

Examples:

For some parameters a special acronym can be used. An acronym definition must follow the parameter's name (before the separating colon). Currently only one acronym is available : "!" (the exclamation mark). When used, the presence of the specified component in the dictionary is required, otherwise the component is not considered to be valid (it does not match the pattern).

Redefinition of existing components

For a named component definition, an attribute definition is used. The attribute contains a string of components representing actual components' structure definition. Redefining a component means changing the parsing pattern structure while other functional capabilities of the component remain unchanged (verifying against dictionaries, output column, etc.). For redefining components, a '|' character is used as a delimiter between the component name and its new definition.

Pattern processing

To allow for extended prioritization of parsing rules, they are divided into parsing groups and each of them is parsed in isolation from others. For a complete description see the Pattern Group configuration element.

Scoring

Scoring functionality of the Generic Parser step is performed via two types of scorers. The first is a main scorer applied to the overall step. The second is applied separately to each component participating in the parsing process. Every component scores its activity and thus it is possible to obtain detailed information about the process. Component scorers are not required.

<step id='alg' className='cz.adastra.cif.tasks.parse.GenericParserAlgorithm'>
        <properties>
                <in>input</in>
                <parserConfig>
                        <patternGroups>
                                <patternGroup>
                                        <patterns>
                                                <!-- use of custom component -->
                                                <pattern name="rule#1" definition="ulice {CP}" priority="0" />
                                                <!-- use of other named component -->
                                                <pattern name="rule#2" definition="Match {MATCHED_WORD}" priority="0" />
                                                <!-- component override -->
                                                <pattern name="rule#3" definition="{WORD | {REGEXP:pattern='[A-Z][a-z]+'}}" priority="0" />
                                                <!-- component override - the regular expression is not effective here, only the verifier if specified -->
                                                <!-- input text '123 456' is accepted by the parsing sequence: {NUMBER} {NUMBER} -->
                                                <pattern name="rule#4" definition="{MATCHED_WORD | {NUMBER} {NUMBER}}" priority="0" />
                                                <!-- input text 'abc + def' is accepted (again by the parsing sequence) -->
                                                <!-- whatever verifier is still effective -->
                                                <pattern name="rule#5" definition="{MATCHED_WORD | {WORD} + {WORD}}" priority="0" />
                                        </patterns>
                                </patternGroup>
                        </patternGroups>
                        <components>
                                <component name="CP" definition="{WORD}" storeInto="parsed">
                                        <verifier>
                                                <fileName>data/ext/street.sl.cif</fileName>
                                                <type>stringLookup</type>
                                        </verifier>
                                </component>
                                <component name="MATCHED_WORD" definition="{REGEXP:pattern='[a-zA-Z0-9]+'}" storeInto="eWord">
                                </component>
                        </components>
                </parserConfig>
                <scorer explanationColumn='expl'>
                        <scoringEntries>
                                <scoringEntry key='GP_NULL' score='300' explain='true' />
                                <scoringEntry key='GP_NO_PATTERN' score='300' explain='true' />
                                <scoringEntry key='GP_MORE_PATTERNS' score='300' explain='true' />
                        </scoringEntries>
                </scorer>
        </properties>
</step>

iWay Software