Detailed Description of Text File Reader

The data source is a standard text file, where records are saved as rows and columns are separated by delimiter characters.

This step supports encodings supported by Java, including Unicode formats (supported Unicode formats are: UTF-8, UTF-16, UTF-16BE, UTF-16LE). The input is processed by UnicodeAwareReader, which allows correct processing of files with Byte Order Mark (BOM) signatures used in some Unicode formats (for their identification) and therefore allows correct processing of non-Unicode formats.

The file encoding detection procedure is as follows:The UnicodeAwareReader tries to read the BOM from the file and compares the received results with the encoding entered within the step configuration (the encoding parameter). If both types match and this file type is supported by the version of Java being used, the BOM is skipped and the file is read in the original format. If the file type in the configuration is not the same as the type set by the BOM, a format discrepancy warning is displayed and the file processing proceeds in the format set within the configuration (including BOM bytes, which are considered to be common content).

If the BOM cannot be detected from the file, it is assumed that the encoding used within the step configuration is valid (i.e., either Unicode format without BOM signature or non-unicode format) and the file is read in the format set within the configuration.

If the configuration has UTF-16 as the general encoding type and it is possible to get the exact sub-format from the BOM (i.e., UTF-16BE or UTF-16LE), the type from the BOM is used. But if there is no BOM signature contained within the file, processing ends with an error because it is not clear which format should be used.

The following table shows possible file encodings and configuration values for Unicode files:

Real File Format	Configuration Value	BOM Signature	Result (Used) Encoding
UTF-8	UTF-8	yes	UTF-8
UTF-8	UTF-8	no	UTF-8
UTF-8	non/bad-unicode	yes	non/bad-unicode + warning
UTF-8	non/bad-unicode	no	non/bad-unicode

UTF-16BE/LE	UTF-16BE/LE	yes	UTF-16BE/LE
UTF-16BE/LE	UTF-16BE/LE	no	UTF-16BE/LE
UTF-16BE/LE	UTF-16	yes	UTF-16BE/LE
UTF-16BE/LE	UTF-16	no	error
UTF-16BE/LE	non/bad-unicode	yes	non/bad-unicode + warning
UTF-16BE/LE	non/bad-unicode	no	non/bad-unicode

non-unicode	UTF-8	-	UTF-8
non-unicode	UTF-16	-	error
non-unicode	UTF-16BE/LE	-	UTF-16BE/LE
non-unicode	other-format	-	other-format

UTF-32(LE/BE) formats are not supported. Non-Unicode formats are untouched by this reader and they are read using formats entered within the configuration.

Input data records are processed using parameters specified in the element dataFormatParameters. Additional details are available in DataFormatParameters. If a column does not define (override) its own format settings, a global formatting setting is used by default.

This step may produce the following errors: SHORT_LINE, INVALID_DATE, UNPARSABLE_FIELD, LONG_LINE, EXTRA_DATA, PROCESSING_ERROR.

When a SHORT_LINE error occurs the input value is considered to be null for further parsing.

Error management is configured by the element errorHandlingStrategy. Error handling strategy allows disabling of processing of the incorrect entries, which can be send to the "rejected" output endpoint. For a more detailed description of error handling see Error Handling Strategy.

When creating a reject file, the following rules are observed:

The initial name for the reject file is rejected.txt.
The encoding defined for the input data file is used as the encoding of the rejected file.
The line separator defined for the input data file is used as the line separator.
Every input row is written to the reject file at most once. So, even if there are more error fields in the same row whose instructions require writing to the reject file the row is written there only once.
Empty reject files are not created. A reject file is created when an instruction requires writing to this file.

Top of page

Example: Example

<step id='input' className='cz.adastra.cif.tasks.io.text.read.TextFileReader'>
        <properties>
                <fileName>data/tmp/in/some_data.csv</fileName>
                <encoding>windows-1250</encoding>
                <fieldSeparator>;</fieldSeparator>
                <lineSeparator>\r\n</lineSeparator>
                <numberOfLinesInHeader>1</numberOfLinesInHeader>
                <stringQualifier>"</stringQualifier>
                <stringQualifierEscape>"</stringQualifierEscape>
                <dataFormatParameters>
                        <dateFormatLocale>en_US</dateFormatLocale>
                        <dateTimeFormat>yyyy-MM-dd HH:mm:ss</dateTimeFormat>
                        <dayFormat>yyyy-MM-dd</dayFormat>
                        <decimalSeparator>.</decimalSeparator>
                        <falseValue>false</falseValue>
                        <trueValue>true</trueValue>
                </dataFormatParameters>
                <errorHandlingStrategy>
                        <errorInstructions>
                                <errorInstruction putToLog="true" errorType="PROCESSING_ERROR"
                                                  dataStrategy="STOP" putToReject="false"/>
                                <errorInstruction putToLog="true" errorType="INVALID_DATE"
                                                  dataStrategy="READ_POSSIBLE" putToReject="false"/>
                                <errorInstruction putToLog="true" errorType="SHORT_LINE"
                                                  dataStrategy="READ_POSSIBLE" putToReject="false"/>
                                <errorInstruction putToLog="true" errorType="UNPARSABLE_FIELD"
                                                  dataStrategy="NULL_VALUE" putToReject="false"/>
                        </errorInstructions>
                </errorHandlingStrategy>
                <columns>
                        <column name='id' type='integer'/>
                        <column name='text' type='string'/>
                        <column name='dob' type='day'/>
                        <column name='modified' type='datetime'/>
                        <column name='priznak' type='boolean'>
                                <dataFormatParameters falseValue='N' trueValue='A'/>
                        </column>
                </columns>
                <shadowColumns>
                        <column name='data_old' type='string'/>
                        <column name='data_new' type='string' />
                </shadowColumns>
        </properties>
</step>

iWay Software