Detailed Description of Statistics

This step is capable of multiple analytical operations in a single pass over multiple columns of input data. All date types supported by DQC can be used as long as they correspond to the applied date operations.

FLOAT, INTEGER, LONG

DAY, DATETIME

BOOLEAN

STRING

Record Count

yes

yes

yes

yes

Null Count

yes

yes

yes

yes

Not Null Count

yes

yes

yes

yes

Distinct Value Count

yes

yes

yes

yes

Unique Value Count

yes

yes

yes

yes

Sum

yes

-

yes

-

Average

yes

yes

yes

-

Median

yes

yes

yes

yes

Standard Deviation

yes

yes

-

-

Variance

yes

yes

-

-

Fist X values

yes

yes

yes

yes

Last X values

yes

yes

yes

yes

Minimum length of sequence

-

-

-

yes

Minimum length of non-empty sequence

-

-

-

yes

Average length of sequence

-

-

-

yes

Median length of sequence

-

-

-

yes

Maximum length of sequence

-

-

-

yes

Quantile Value

yes

yes

yes

yes

Step outputs:

For each statistic computed by DQC a single row is returned, as long as the statistic does not have the parameter count, or an input supplied by the parameter count.



Example: Example
<step id='alg' className='cz.adastra.cif.tasks.analysis.statistics.StatisticsAlgorithm'>
        <properties>
            <statName>stat_name</statName>
            <statDistinction>stat_distinction</statDistinction>
                <defaultLocale>cs_CZ</defaultLocale>
                <statistics>
                        <statistic>
                                <expression>numeric_value</expression>
                                <columnStatistics>
                                        <columnStatistic name="Record Number(Count)" type="count" />
                                        <columnStatistic name="Sum" type="sum" />
                                        <columnStatistic name="Median" type="median" />
                                        <columnStatistic name="Average" type="avg" />
                                        <columnStatistic name="Variace" type="var" />
                                        <columnStatistic name="Standard Deviation" type="std" />
                                        <columnStatistic name="Distinct Count" type="distinct" />
                                        <columnStatistic name="Unique Count" type="unique" />
                                        <columnStatistic name="Null Count" type="count_nulls" />
                                        <columnStatistic name="Not-Null Count" type="count_not_nulls" />
                                        <columnStatistic name="First 3" type="first_x" count="3" />
                                        <columnStatistic name="Last 5" type="last_x" count="5" />
                                </columnStatistics>
                        </statistic>
                        <statistic locale="en_US">
                                <expression>string_value</expression>
                                <columnStatistics>
                                        <columnStatistic name="Median Seq Length" type="median_length"/>
                                        <columnStatistic name="Average Seq Length" type="avg_length"/>
                                        <columnStatistic name="20 40 60 80 Percentile" type="percentiles" count="4"/>
                                </columnStatistics>
                        </statistic>
                </statistics>
        </properties>
</step>

iWay Software