Detailed Description of Text File Reader

The data source is a standard text file, where records are saved as rows and columns are separated by delimiter characters.

This step supports encodings supported by Java, including Unicode formats (supported Unicode formats are: UTF-8, UTF-16, UTF-16BE, UTF-16LE). The input is processed by UnicodeAwareReader, which allows correct processing of files with Byte Order Mark (BOM) signatures used in some Unicode formats (for their identification) and therefore allows correct processing of non-Unicode formats.

The file encoding detection procedure is as follows:The UnicodeAwareReader tries to read the BOM from the file and compares the received results with the encoding entered within the step configuration (the encoding parameter). If both types match and this file type is supported by the version of Java being used, the BOM is skipped and the file is read in the original format. If the file type in the configuration is not the same as the type set by the BOM, a format discrepancy warning is displayed and the file processing proceeds in the format set within the configuration (including BOM bytes, which are considered to be common content).

If the BOM cannot be detected from the file, it is assumed that the encoding used within the step configuration is valid (i.e., either Unicode format without BOM signature or non-unicode format) and the file is read in the format set within the configuration.

If the configuration has UTF-16 as the general encoding type and it is possible to get the exact sub-format from the BOM (i.e., UTF-16BE or UTF-16LE), the type from the BOM is used. But if there is no BOM signature contained within the file, processing ends with an error because it is not clear which format should be used.

The following table shows possible file encodings and configuration values for Unicode files:

Real File Format

Configuration Value

BOM Signature

Result (Used) Encoding

UTF-8

UTF-8

yes

UTF-8

UTF-8

UTF-8

no

UTF-8

UTF-8

non/bad-unicode

yes

non/bad-unicode + warning

UTF-8

non/bad-unicode

no

non/bad-unicode

   

UTF-16BE/LE

UTF-16BE/LE

yes

UTF-16BE/LE

UTF-16BE/LE

UTF-16BE/LE

no

UTF-16BE/LE

UTF-16BE/LE

UTF-16

yes

UTF-16BE/LE

UTF-16BE/LE

UTF-16

no

error

UTF-16BE/LE

non/bad-unicode

yes

non/bad-unicode + warning

UTF-16BE/LE

non/bad-unicode

no

non/bad-unicode

   

non-unicode

UTF-8

-

UTF-8

non-unicode

UTF-16

-

error

non-unicode

UTF-16BE/LE

-

UTF-16BE/LE

non-unicode

other-format

-

other-format

UTF-32(LE/BE) formats are not supported. Non-Unicode formats are untouched by this reader and they are read using formats entered within the configuration.

Input data records are processed using parameters specified in the element dataFormatParameters. Additional details are available in DataFormatParameters. If a column does not define (override) its own format settings, a global formatting setting is used by default.

This step may produce the following errors: SHORT_LINE, INVALID_DATE, UNPARSABLE_FIELD, LONG_LINE, EXTRA_DATA, PROCESSING_ERROR.

When a SHORT_LINE error occurs the input value is considered to be null for further parsing.

Error management is configured by the element errorHandlingStrategy. Error handling strategy allows disabling of processing of the incorrect entries, which can be send to the "rejected" output endpoint. For a more detailed description of error handling see Error Handling Strategy.

When creating a reject file, the following rules are observed:



Example: Example
<step id='input' className='cz.adastra.cif.tasks.io.text.read.TextFileReader'>
        <properties>
                <fileName>data/tmp/in/some_data.csv</fileName>
                <encoding>windows-1250</encoding>
                <fieldSeparator>;</fieldSeparator>
                <lineSeparator>\r\n</lineSeparator>
                <numberOfLinesInHeader>1</numberOfLinesInHeader>
                <stringQualifier>"</stringQualifier>
                <stringQualifierEscape>"</stringQualifierEscape>
                <dataFormatParameters>
                        <dateFormatLocale>en_US</dateFormatLocale>
                        <dateTimeFormat>yyyy-MM-dd HH:mm:ss</dateTimeFormat>
                        <dayFormat>yyyy-MM-dd</dayFormat>
                        <decimalSeparator>.</decimalSeparator>
                        <falseValue>false</falseValue>
                        <trueValue>true</trueValue>
                </dataFormatParameters>
                <errorHandlingStrategy>
                        <errorInstructions>
                                <errorInstruction putToLog="true" errorType="PROCESSING_ERROR"
                                                  dataStrategy="STOP" putToReject="false"/>
                                <errorInstruction putToLog="true" errorType="INVALID_DATE"
                                                  dataStrategy="READ_POSSIBLE" putToReject="false"/>
                                <errorInstruction putToLog="true" errorType="SHORT_LINE"
                                                  dataStrategy="READ_POSSIBLE" putToReject="false"/>
                                <errorInstruction putToLog="true" errorType="UNPARSABLE_FIELD"
                                                  dataStrategy="NULL_VALUE" putToReject="false"/>
                        </errorInstructions>
                </errorHandlingStrategy>
                <columns>
                        <column name='id' type='integer'/>
                        <column name='text' type='string'/>
                        <column name='dob' type='day'/>
                        <column name='modified' type='datetime'/>
                        <column name='priznak' type='boolean'>
                                <dataFormatParameters falseValue='N' trueValue='A'/>
                        </column>
                </columns>
                <shadowColumns>
                        <column name='data_old' type='string'/>
                        <column name='data_new' type='string' />
                </shadowColumns>
        </properties>
</step>

iWay Software