Element Configurable Tokenizer Config

This element is used by other steps for demarcating/splitting input text string into particular components (tokens) depending on defined rules. Every token type is specified using two sets of characters:

tokenStartCharacters and tokenCharacters

Within the tokenization process, the input string is analyzed one character at a time, and when any character corresponding to tokenStartCharacters is found, this character is considered the beginning of a new token of the defined type. Any other characters found and corresponding to tokenCharacters are then included into this new token.

The name of a tag associated with the following configuration depends on which step the ConfigurableTokenizer uses (for names of these tags refer to their descriptions).

The set of characters can be defined using the following (for detailed description see character set):

<tokenizerConfig whiteSpaceDefinition="[:white:]">
        <types>
                <type tokenStartCharacters="0-9" tokenCharacters="0-9"  />
                <type tokenStartCharacters="A-Z" tokenCharacters="a-z"  />
                <type tokenStartCharacters="[:lowercase:]" tokenCharacters="[:lowercase:]"  />
        </types>
</tokenizerConfig>

The configuration used in the example considers the following as individual tokens:

An example of input and output:

input

output

aaa1234bbb

aaa

1234

bbb

aaaBbbCcc Ddd 123eee

aaa

Bbb

Ccc

Ddd

123

eee

Name

Type

Required

Description

Types

List of Configurable Tokenizer Config $Token Type

No

A tag associating individual configurations for each tag type. If this tag is not defined or the configuration of a tokenizer is missing entirely , the default configuration is used. In this situation, only words and numbers are taken into account. That is, the following types are defined:

  • tokenStartCharacters=[:letter:] tokenCharacters=[:letter:]
  • tokenStartCharacters=[:digit:] tokenCharacters=[:digit:]

White Space Definition

String

No

Defines which characters are considered as delimiter characters. Only tokenizers derived from ConfigurableTokenizer use this property (e.g. WhiteSpaceTokenizer). Using this property within the ConfigurableTokenizer by itself is incorrect.


iWay Software