Column Analysis

In this section:

The Column analyses tab presents statistical analyses and pattern information about the columns that have been profiled. Each column in the input data is listed as a row in the table, which presents information such as data type, value counts, and minimum/maximum values.


Top of page

x
Basic Analyses

The Basic tab provides simple statistics about the data that has been profiled and shows a chart of duplicate and distinct data as a percentage of the whole, as shown in the following image.



x
Interpreting Counts

The Counts table lists the following values:

To understand the meaning of these values, consider the data in the following table.

Record No.

Value

1

John Smith

2

John Smith

3

Rebecca Davis

4

Paul Adams

5

The Counts table for this data is represented in the following table.

Type

Count

Record

Description

Null

1

Record 5

The last record is empty.

Non-null

4

Records 1-4

The first 4 records contain data.

Duplicate

1

Record 2

There is one duplicate of the John Smith record.

Distinct

3

Records 1, 3, 4

There are 3 non-null values that are different from each other.

Non-unique

1

Record 1

John Smith has a duplicate record. Therefore, it is not unique.

Unique

2

Records 3 and 4

Rebecca Davis and Paul Adams appear only once in the list; they have no duplicates.



x
Frequency Analysis

The Frequency tab displays the Frequency Analysis. It shows the number of times that each value in the data occurs. This statistic is shown as both an absolute count and as a percentage of the whole, as shown in the following image.


Top of page

x
Domain Analysis

The Domains tab displays the Domain Analysis. This analysis determines the type of data likely to be found in each column, for example, whether the data is text, a number, or a date. The probable types are listed, along with exceptions. An example of an exception is a text string that occurs in a list of dates.

The following image shows a sample of Domain Analysis.


Top of page

x
Mask Analysis

The Mask tab displays the Mask Analysis. It shows the syntactic patterns of the data, that is, the structure of the data rather than the content of the data. Codes (masks) are used to describe these patterns.

For example, the code W is used by default to represent a word, while the code L is used to represent a letter. You can define the number of letters required to make a word in the Profiling step properties.

This type of analysis can be useful when, for example, you are looking at a column of names, in which one or two words are common, but single letters and numbers are not. Finding unexpected patterns in the data can provide information about the overall level of quality of the data.

The following image shows a sample of Mask Analysis.


Top of page

x
Quantiles

The Quantiles tab shows Quantile Samples (based on n values). These are the data values that occur at designated intervals in the ordered data set. The first value in the list is at 0%, and the last value is at 100%. The median value is at the 50% marker.

The following image is an example of Quantile Samples.


Top of page

x
Group Frequency Analysis

The Groups tab displays the Group Frequency Analysis. This presents a different analysis of the data in the Frequency tab. Group Frequency Analysis shows the number of times that each non-null frequency count is repeated. If all values are unique, the group size is 1, as there are no duplicate values. Each time a value is repeated, it forms a new group.

The following image shows a sample of Group Frequency Analysis.


iWay Software