Detailed Description of Profiling

This profiling step is used in statistical analysis of data. For each data column this step will compute statistics (values) such as minimum, maximum, standard and error values.

The profiling step is capable of multiple analytical operations in a single pass over multiple columns of input data.

All date types supported by DQC can be specified as long as they correspond to the applied date operations.

INTEGER, LONG

DAY, DATETIME

BOOLEAN

STRING

Data Count

yes

yes

yes

yes

Null Count

yes

yes

yes

yes

Not Null Count

yes

yes

yes

yes

Different Value Count

yes

yes

yes

yes

Unique Value Count

yes

yes

yes

yes

 

Sum

yes

-

yes

-

Variance Definition . Result for DAY/DATETIME values are in squared days.

yes

yes

-

-

Standard Deviation Definition . Result for DAY/DATETIME values are in days.

yes

yes

-

-

 

Average

yes

yes

yes

-

Median

yes

yes

yes

yes

Quantile

yes

yes

yes

yes

 

Maximum

yes

yes

yes

yes

Minimum

yes

yes

yes

yes

 

First X Values

yes

yes

yes

yes

Last X Values

yes

yes

yes

yes



Example: Example
<?xml version='1.0' encoding='UTF-8'?>
<step className="cz.adastra.cif.tasks.profiling.ProfilingAlgorithm" id="pa">
  <properties outputFile="output.profile" defaultLocale="en_US">
    <inputs>
      <profilingInput name="party">
        <dataToProfile>
          <profiledData expression="name">
            <frequencyAnalysis calculate="true" mask="true" />
            <standardStats extremeCount="5" quantilesCount="10" calculateAggegated="true" calculate="true"/>
          </profiledData>
        </dataToProfile>
        <bussinesRules>
          <rule name="prods_ok" expression="products_count > 0" />
        </bussinesRules>
        <pkAnalysis>
          <item name="party_pk">
            <components>
              <item expression="party_id" />
            </components>
          </item>
        </pkAnalysis>
        <fkAnalysis>
          <item name="party_prod" parentInputName="product">
            <components>
              <item localColumn="prod_id" parentColumn="product_id" />
            </components>
          </item>
        </fkAnalysis>
        <rollUps>
          <item name="by_dept" expression="dept_id" />
        </rollUps>
      </profilingInput>
      <profilingInput name="product">
        <dataToProfile>
          <profiledData expression="branch">
            <frequencyAnalysis calculate="true" mask="false" />
            <standardStats calculateAggegated="false" calculate="false"/>
          </profiledData>
        </dataToProfile>
      </profilingInput>
    </inputs>
  </properties>
</step>

iWay Software