Element Configurable Tokenizer Config

This element is used by other steps for demarcating/splitting input text string into particular components (tokens) depending on defined rules. Every token type is specified using two sets of characters:

tokenStartCharacters and tokenCharacters

Within the tokenization process, the input string is analyzed one character at a time, and when any character corresponding to tokenStartCharacters is found, this character is considered the beginning of a new token of the defined type. Any other characters found and corresponding to tokenCharacters are then included into this new token.

The name of a tag associated with the following configuration depends on which step the ConfigurableTokenizer uses (for names of these tags refer to their descriptions).

The set of characters can be defined using the following (for detailed description see character set):

enumeration, e.g. abcdef
interval, e.g. a-f or a-cd-f, etc. If the character - is intended to be part of the set, it must be defined at the very end of the definition.
Link to predefined constant set of characters using bracket expression ([: and :]). When the characters "[:" or ":]" are intended to be a part of the set, they must be defined separately from the definition (not to be considered as beginning or definition set).
There are also predefined character classes available. An example of definition:
```
[:uppercase:]
```
Listed constants can also be used in an "exclude" form such as [:lowercase:-abc:], where characters "abc" are a list of exceptions in the defined characters set. An example of such definition: [:uppercase:-A:] (all capital letters except for 'A')
merging - all possibilities listed above can be combined to create a single set, e.g. a definition of a token that begins with a letter and continues with letters, numbers or a dash:
```
<type tokenCharacters="[:letter:][:digit:]-" tokenStartCharacters="[:letter:]">
```

<tokenizerConfig whiteSpaceDefinition="[:white:]">
        <types>
                <type tokenStartCharacters="0-9" tokenCharacters="0-9"  />
                <type tokenStartCharacters="A-Z" tokenCharacters="a-z"  />
                <type tokenStartCharacters="[:lowercase:]" tokenCharacters="[:lowercase:]"  />
        </types>
</tokenizerConfig>

The configuration used in the example considers the following as individual tokens:

words beginning with a capital letter
numbers
words beginning with a lowercase letter, which follows any character other than capital letter

An example of input and output:

input	output
aaa1234bbb	aaa 1234 bbb
aaaBbbCcc Ddd 123eee	aaa Bbb Ccc Ddd 123 eee

input

output

aaa1234bbb

aaa

1234

bbb

aaaBbbCcc Ddd 123eee

aaa

Bbb

Ccc

Ddd

123

eee

Name	Type	Required	Description
Types	List of Configurable Tokenizer Config $Token Type	No	A tag associating individual configurations for each tag type. If this tag is not defined or the configuration of a tokenizer is missing entirely , the default configuration is used. In this situation, only words and numbers are taken into account. That is, the following types are defined: tokenStartCharacters=[:letter:] tokenCharacters=[:letter:] tokenStartCharacters=[:digit:] tokenCharacters=[:digit:]
White Space Definition	String	No	Defines which characters are considered as delimiter characters. Only tokenizers derived from ConfigurableTokenizer use this property (e.g. WhiteSpaceTokenizer). Using this property within the ConfigurableTokenizer by itself is incorrect.

Name

Type

Required

Description

Types

List of Configurable Tokenizer Config $Token Type

A tag associating individual configurations for each tag type. If this tag is not defined or the configuration of a tokenizer is missing entirely , the default configuration is used. In this situation, only words and numbers are taken into account. That is, the following types are defined:

tokenStartCharacters=[:letter:] tokenCharacters=[:letter:]
tokenStartCharacters=[:digit:] tokenCharacters=[:digit:]

White Space Definition

String

Defines which characters are considered as delimiter characters. Only tokenizers derived from ConfigurableTokenizer use this property (e.g. WhiteSpaceTokenizer). Using this property within the ConfigurableTokenizer by itself is incorrect.

iWay Software