This step is used to obtain a representative data sample from input data. It computes sizes of groups of records with the same keys. For each cluster of same-sized groups it selects a number of groups based on the percentage given. The sample groups are selected uniformly from the cluster in order to obtain a uniform data distribution across the whole input data source. If the percentage is too low to cover at least one whole group, only a number of records from the beginning of the chosen group is selected. More grouping rules can be defined to be applied to the records. Same-sized groups within a cluster can optionally be sorted.
<step id='alg' className='cz.adastra.cif.tasks.experimental.datasampler.DataSamplerAlgorithm'>
<properties>
<groups>
<dataSamplerGroup name='group_a' when='system="alfa"' percentage="30" >
<keyComponents>
<keyComponent expression="key1"></keyComponent>
<keyComponent expression="key2"></keyComponent>
</keyComponents>
</dataSamplerGroup>
<dataSamplerGroup name='group_b' when='system="beta"' percentage="15" >
<keyComponents>
<keyComponent expression="key1"></keyComponent>
</keyComponents>
<sorting>
<orderBy expression="key1"/>
<orderBy expression="key2"/>
</sorting>
</dataSamplerGroup>
</groups>
</properties>
</step>
| iWay Software |