Detailed Description of Tokenizer

This step splits content provided by the input columns into specified output columns using the defined tokenizer. If the tokenizer is not defined, a default one is used. The default tokenizer recognizes tokens comprised of numbers and text characters only; other characters are considered as delimiters. Default tokenizer settings are illustrated in the following example (i.e., the output is the same whether the tokenizer is configured or not):

Tokenizer output columns are combined into a single string, and individual tokens are delimited with a character defined using the separator property. This property can be defined on both global and local levels. If a separator is not defined on the global level, a default value of ' ' is used. On the local level, the property can be defined within individual column properties. Otherwise, the global value is used instead.


Top of page

Example: Example
<step id='alg' className='cz.adastra.cif.tasks.clean.TokenizerAlgorithm'>
        <properties>
                <tokenizerConfig whiteSpaceDefinition="[:white:]">
                        <types>
                                <type tokenCharacters="[:letter:]" tokenStartCharacters="[:letter:]" />
                                <type tokenCharacters="[:digit:]" tokenStartCharacters="[:digit:]" />
                        </types>
                </tokenizerConfig>
                <separator>,</separator>
                <columns>
                        <column src='text1' target='text1_tokens' separator=','/>
                        <column src='text1' target='text2'/>
                        <column src='text2' target='text2_tokens' separator='%'/>
                </columns>
        </properties>
</step>

iWay Software