Element Configurable Tokenizer Config

This element is used by other steps for demarcating/splitting input text string into particular components (tokens) depending on defined rules. Every token type is specified using two sets of characters: tokenStartCharacters and tokenCharacters.

Within the tokenization process, the input string is analyzed one character at a time, and when any character corresponding to tokenStartCharacters is found, this character is considered the beginning of a new token of the defined type. Any other characters found and corresponding to tokenCharacters are then included into this new token.

The name of a tag associated with the following configuration depends on which step the ConfigurableTokenizer uses (for names of these tags refer to their descriptions).

The set of characters can be defined using the following (for detailed description see character set):

The configuration used in the example considers the following as individual tokens:

An example of input and output:

input

output

aaa1234bbb

aaa

1234

bbb

aaaBbbCcc Ddd 123eee

aaa

Bbb

Ccc

Ddd

123 eee


iWay Software