Detailed Description of Join

This step joins two separate input data flows into a single output data flow based on corresponding input data flow keys. The data entry points (endpoints) are named respectively in_a and in_b. All data input flows (in_a and in_b) must contain columns with key values (primary keys), which are defined as column properties keyA and keyB. It is the keys that govern how the data records are paired up across the data input flows. All basic data types can be used as a key. The join operation can be defined as a standard SQL join operation with join types such as: inner, outer, left, and right join. The join type is specified by the property joinType, which must contain one of the supported join types.

If the data input flow contains multiple records with the same key values, the data output flow will contain a Cartesian Product of data input records.

Records having a null key value are not joined with any other (null keyed) records. Those records are processed as unpaired, depending on joinType.

The content of the output data is defined by expressions which operate with the conjuncted input data. In expressions, references to the particular columns must be made by dot notation (in_a.*, in_b.*). The step first groups input records by key values and computes the Cartesian Product for all records for each group. Then, the expression sub-properties defined in the columnDefinition properties are evaluated to determine the result records of the Cartesian Product. Results of these operations are stored in the column defined by the sub-property name of the property columnDefinition. In case of the join operations left, right or outer, for which a record from an input flow might be empty, input column values of such empty records are transformed to null values before expression evaluation.

NOTE: Only data that is defined in the sub-properties expressions of the columnDefinition properties are written to the output. If the property columnDefinitions is empty or missing, then the output remains empty. (Though the previous versions of the step generated a default set of columnDefinitions, the current version requires explicit definition of all of the columnDefinitions.)


Top of page

Example: Example

There are two input files, the structure of the first one is "integer id; string name" and the structure of the second one is "integer id;string lastName". The output columns are "integer ida,integer idb, string name, string lastName, string fullName".

        <!-- illustration of the connection configuration for steps readDataA and readDataB, which read input data-->
        <connection className='cz.adastra.cif.model.elements.connections.StandardFlowConnection'>
                <source step='readDataA' endpoint='out'/>
                <target step='merger' endpoint='in_a'/>
        </connection>
        <connection className='cz.adastra.cif.model.elements.connections.StandardFlowConnection'>
                <source step='readDataB' endpoint='out'/>
                <target step='merger' endpoint='in_b'/>
        </connection>
        <step id='merger' className='cz.adastra.cif.tasks.merge.Join'>
                <properties>
                        <columnDefinitions>
                                <columnDefinition name="ida" expression="nvl(in_a.id,0)" />
                                <columnDefinition name="idb" expression="nvl(in_b.id,0)" />
                                <columnDefinition name="name" expression="in_a.name" type="string" />
                                <columnDefinition name="lastName" expression="in_b.lastName" type="string" />
                                <columnDefinition name="fullName" expression="in_a.name+' '+in_b.lastName" type="string" />
                        </columnDefinitions>
                        <joinType>inner</joinType>
                    <keyA>id<keyA/>
                    <keyB>id<keyB/>
                </properties>
        </step>

iWay Software