This step splits content provided by the input columns into specified output columns using the defined tokenizer. If the tokenizer is not defined, a default one is used. The default tokenizer recognizes tokens comprised of numbers and text characters only; other characters are considered as delimiters. Default tokenizer settings are illustrated in the following example (i.e., the output is the same whether the tokenizer is configured or not):
Tokenizer output columns are combined into a single string, and individual tokens are delimited with a character defined using the separator property. This property can be defined on both global and local levels. If a separator is not defined on the global level, a default value of ' ' is used. On the local level, the property can be defined within individual column properties. Otherwise, the global value is used instead.
<step id='alg' className='cz.adastra.cif.tasks.clean.TokenizerAlgorithm'> <properties> <tokenizerConfig whiteSpaceDefinition="[:white:]"> <types> <type tokenCharacters="[:letter:]" tokenStartCharacters="[:letter:]" /> <type tokenCharacters="[:digit:]" tokenStartCharacters="[:digit:]" /> </types> </tokenizerConfig> <separator>,</separator> <columns> <column src='text1' target='text1_tokens' separator=','/> <column src='text1' target='text2'/> <column src='text2' target='text2_tokens' separator='%'/> </columns> </properties> </step>
iWay Software |