In this section: |
The output file can have numerous columns during the processing of data in iWay DQC. It is a best practice to group columns according to their content.
The following guidelines apply:
The structure of the attribute name is
prefix_attribute meaning_suffix
where:
Is optional.
Prefixes and suffixes are described in the following tables.
Prefixes
Attribute prefixes and suffixes that are in bold in the tables are frequently used. Rarely used prefixes and suffixes are included in the tables to avoid their possible misuse.
The order of the prefixes in the table defines the recommended order in all the input steps (for example, Text File Reader, Integration Input, and Alter Format).
Attribute Prefix | Description | Additional Information |
---|---|---|
src_xxx | Source input values | Without any transformation on it |
dec_xxx | Decoded source input values | Pre-cleansed data with a single form for null values (for example, NULL, N/A, and N/K are transformed into null) |
meta_xxx | Source input metadata | |
pur_xxx | Operational columns (pre-cleansed values) | Very often used during cleansing of attributes |
cyr_xxx | Operational columns (attribute analysis of Cyrillic characters) | Special attributes for different characters |
lat_xxx | Operational columns (attribute analysis of Latin characters) | |
pat_xxx | Attribute structure description (patterns) | |
adr_xxx | Operational columns (address etalon data in general); formerly, operational columns for Czech environment (those are now cpo_xxx) | Used for any environment |
cpo_xxx | Operational columns (pre-cleansed address data), where cpo represents Czech Post Office etalon | For Czech environment only |
uir_xxx | Operational columns (address etalon data) | |
std_xxx | Attribute standardized values | Only structure valid values |
cln_xxx | Attribute cleansed/normalized values | Value compared against etalon |
out_xxx | Both standardized/cleansed and non-cleansed values | Given by business rules (will be std, cln, src, or other) |
score_xxx | Attribute score (highest number means the worst data, 0 means perfect data) | Attribute/instance data quality description |
score_instance | Instance score (the sum of attribute scores per single record) | |
exp_xxx | Quality explanation; cleansing codes for each attribute | |
cleansing_code | Instance-level cleansing code (list of error messages); aggregated attribute explanations | |
matching_xxx | Attribute matching values | Contains std or cln values (if available), or pur or src data (depending on the business need), all without accents and in uppercase |
matching_key | Matching key (obsolete) | |
uni_can_id | Candidate group ID | For match and merge process only |
uni_can_id_old | Candidate group ID (old, that is, ID assigned within the last unification process) | |
uni_cli_id | Client group ID | |
uni_cli_id_old | Client group ID (old, that is, ID assigned within the last unification process) | |
ins_uni_role | Instance unification role (for example, Master or Slave) | |
ins_msr_role | Merge surviving instance role | |
uni_rule | Name of the applied unification rule | |
grp_can_role | Group unification role (A, C, M, U) for candidate group | |
grp_cli_role | Group unification role (A, C, M, U) for client group | |
pri_xxx | Operational columns (primary unification) | Hierarchical match and merge attributes |
sec_xxx | Operational columns (secondary unification) | |
len_xxx | Operational columns (attribute length analysis; formerly known as length_xxx ) | Attributes for analytical purposes only (mainly used for so-called ABCDX profiling) Can be placed between meta_xxx and pur_xxx |
char_xxx | Operational columns (attribute char analysis) | |
word_xxx | Operational columns (attribute word analysis) | |
qma_xxx | Operational columns (attribute quality mark - ABCDX) | |
qme_xxx | Operational columns (instance quality mark - ABCDX) | |
qex_xxx | Operational columns (quality explanation column for the whole instance) | |
tmp_xxx | Operational columns (temporary columns) | Can be placed anywhere; typically used in cleansing processes after pur_xxx values |
aux_xxx | Operational columns (auxiliary columns) | |
cnt_xxx | Operational columns (counters) | |
rpl_can_xxx | Replacement candidates (incorrect data) | Rarely used attributes |
cor_xxx | Operational columns (auxiliary pre-cleansed values) | |
bin_xxx | Operational columns (dust bin for waste text) |
Suffixes
Attribute Suffix | Description | Additional Information |
---|---|---|
xxx_rpl | Data prepared for replacement | |
xxx_pat | Data prepared for parsing | Usually data after replacement |
xxx_id | Attribute IDs | |
xxx_orig | Original values found during parsing (for example, pur_first_name_orig) | For example, used by generic parser step |
Using Prefixes
Attributes with the prefix src_xxx (source values) or dec_xxx (decoded source values) are read only (dec_xxx is set only once at the beginning).
Use columns with the prefix std_xxx or cln_xxx for the standardized or cleansed values only. To store all the values in one column (both cleansed and non-cleansed values), use the out_xxx column prefix.
How do you handle std_xxx and cln_xxx? Typically, you want to store the data in the right column, according to the transformation used (standardization or cleansing). If doing so may cause a problem, you will not want to make a distinction. In that case, use std_xxx for both standardized and cleansed values.
If required or intended by the user, you can use cln_xxx for making a distinction. The std_xxx would store only standardized values, not values cleansed against the dictionary.
For an iWay DQC project, use the common interface between source systems and iWay DQC. The best practice is to use the canonical interface.
When you use the canonical interface, two possible situations can exist:
Remapping may involve the addition of the correct prefix to the canonical attribute name, or changing the attribute name to comply with the naming conventions for common attributes used in iWay DQC Plans. For example, assume that the project-specific name for the attribute that stores last name is C27LN. It is a better practice to map C27LN to src_last_name, instead of using src_c27ln throughout the configuration.
These mappings are defined in the Alter Format and the various Reader/Writer steps.
It is a best practice to use the following structure for the column name:
prefix + attribute_description
The proper names for other processing will be derived from this structure as required.
Examples:
The following are optional:
If the meaning of the attribute is the same during cleansing, do not change the name of the column. You can change only the prefix.
iWay Software |