This step splits content provided by the input columns into specified output columns using the defined tokenizer. If the tokenizer is not defined, a default one is used. The default tokenizer recognizes tokens comprised of numbers and text characters only; other characters are considered as delimiters. Default tokenizer settings are illustrated in the following example (i.e., the output is the same whether the tokenizer is configured or not):
Tokenizer output columns are combined into a single string, and individual tokens are delimited with a character defined using the separator property. This property can be defined on both global and local levels. If a separator is not defined on the global level, a default value of ' ' is used. On the local level, the property can be defined within individual column properties. Otherwise, the global value is used instead.
<step id='alg' className='cz.adastra.cif.tasks.clean.TokenizerAlgorithm'>
<properties>
<tokenizerConfig whiteSpaceDefinition="[:white:]">
<types>
<type tokenCharacters="[:letter:]" tokenStartCharacters="[:letter:]" />
<type tokenCharacters="[:digit:]" tokenStartCharacters="[:digit:]" />
</types>
</tokenizerConfig>
<separator>,</separator>
<columns>
<column src='text1' target='text1_tokens' separator=','/>
<column src='text1' target='text2'/>
<column src='text2' target='text2_tokens' separator='%'/>
</columns>
</properties>
</step>
| iWay Software |