This step anonymizes the input data. The content of the output data is different than the input, but all the essential characteristics of the data are retained. The step supports the following data types:
Data Type |
Description |
---|---|
Date (of birth) |
Date is mostly used in conjunction with birth number (RC) and therefore it preserves the RC eras (see below). If the date is outside of the allowed range the transformation doesn't take place, the input value is sent to output unchanged and the AA_BD_NOT_CHANGED flag is set. This step shifts the date within the original RC era. The older the date (meaning how far it is from birthDayMax), the wider the range from which the transformed date is chosen from. The minimum range is 1 year (ie. +/- 6 months from the input date) This range gets wider for dates further in the past(relative to birthDayMax. At least one year difference is required between birthDayMax and birthDayMin to ensure the minimum range. For the step to be able to generate dates that are random enough and within the right RC era, the following restrictions apply to birthDayMax and birthDayMin:
|
Birth Number (RC) |
The transformation of RC follows these rules:
|
First Name |
The transformation occurs for known names. It preserves:
The quality of the transformation (preserving sex etc..) depends on the dictionary used. When creating the dictionary, the names must be divided into 9 groups (masculine, feminine and neutral for first name, last name and both) and within these groups suitable replacements with similar frequency occurrences must be found. The transformation of a name (first and last) consists of its replacement with another name using a nameLookupFileName dictionary. Single and multiple word names are processed. Each token is considered a name if it consists only of letters and/or the apostrophe character ('). Other tokens (i.e., non-letter) are considered to be delimiters and are not transformed by the step. The transformed names are sent to the output separated by the original delimiters. In case the name (or one of the names, if the input contains more than one token) is not found in the dictionary, the input as a whole is considered invalid, the appropriate flag is set (AA_FN_NOT_CHANGED or AA_LN_NOT_CHANGED) and the input is copied to the output unmodified. The dictionary is an Indexed table. The input name is used as the key and the value is the replacement for the original name. Both values must be transformed to uppercase before being inserted into the dictionary (the dictionary must be created with the doUppercase parameter set to on). |
Last name |
The same rules and restrictions apply as with first names. |
This step ensures that for all pairs of identical input values, identical output values will be generated. If any input value is empty, then the output is also empty (no flag is set).
<step className="cz.adastra.cif.tasks.experimental.anonymizer.AnonymizerAlgorithm" id="some_id"> <properties rc="rcVaue" changeNames="true" nameLookupFileName="name-lookup.cif" birthDayMin="1901-01-01" birthDayMax="2008-04-23" firstName="fName" birthDate="bDate" lastName="lName"> <scorer explanationColumn="expl"> <scoringEntries> <scoringEntry key="AA_FN_NOT_CHANGED" explainAs="AA_FN_NOT_CHANGED" score="10" explain="true"/> <scoringEntry key="AA_LN_NOT_CHANGED" explainAs="AA_LN_NOT_CHANGED" score="20" explain="true"/> <scoringEntry key="AA_RC_NOT_CHANGED" explainAs="AA_RC_NOT_CHANGED" score="30" explain="true"/> <scoringEntry key="AA_RC_DATE_MISMATCH" explainAs="AA_RC_DATE_MISMATCH" score="40" explain="true"/> <scoringEntry key="AA_BD_NOT_CHANGED" explainAs="AA_BD_NOT_CHANGED" score="50" explain="true"/> <scoringEntry key="AA_RC_CENTURY_GUESS" explainAs="AA_RC_CENTURY_GUESS" score="60" explain="true"/> </scoringEntries> </scorer> </properties> </step>
iWay Software |