Detailed Description of Anonymizer

This step anonymizes the input data. The content of the output data is different than the input, but all the essential characteristics of the data are retained. The step supports the following data types:

Data Type

Description

Date (of birth)

Date is mostly used in conjunction with birth number (RC) and therefore it preserves the RC eras (see below). If the date is outside of the allowed range the transformation doesn't take place, the input value is sent to output unchanged and the AA_BD_NOT_CHANGED flag is set.

This step shifts the date within the original RC era. The older the date (meaning how far it is from birthDayMax), the wider the range from which the transformed date is chosen from. The minimum range is 1 year (ie. +/- 6 months from the input date) This range gets wider for dates further in the past(relative to birthDayMax. At least one year difference is required between birthDayMax and birthDayMin to ensure the minimum range.

For the step to be able to generate dates that are random enough and within the right RC era, the following restrictions apply to birthDayMax and birthDayMin:

  • minimum 1 year difference between birthDayMin and birthDayMax
  • birthDayMin must not be within 1 year before 1.1.1954 or within 1 year before 1.4.2004
  • birthDayMax must not be within 1 year after 1.1.1954 or within 1 year after 1.4.2004

Birth Number (RC)

The transformation of RC follows these rules:

  • The transformation occurs only if the date part of the RC is valid, not artificial and consists of either 6,9 or 10 characters. Otherwise the input is copied to the output and the AA_RC_NOT_CHANGED is set.
  • if a date is on the input and
    • the input date is outside the allowed range: the input date is copied to the output and the input RC is copied to the output. The AA_BD_NOT_CHANGED and AA_RC_NOT_CHANGED flags are set.
    • the input date matches the date from the RC: both dates are transformed to the same date
    • the input date doesn't match the date from the RC: both dates are transformed independently and the AA_RC_MISMATCH flag is set
  • if there is no date on the input and
    • the date from RC is outside the allowed range: the input value is copied to the output and the AA_RC_NOT_CHANGED is set.
    • the date from RC is within the allowed range: the date is transformed and the AA_RC_CENTURY_GUESSED flag is set. The missing year
    • from the RC is guessed as the last possible year before birthDateMax
  • preserves the era of the input RC, i.e., the output RC is from the same era as the input RC. 3 eras are recognized (as intervals): <birthDayMin,31.12.1953>, <1.1.1954,31.3.2004>, <1.4.2004,birthDayMax>
  • preserves the sex from RC
  • preserves the extended form of the month
  • preserves the length of the RC - the trailer in 10-digit RC is modified (see below), the trailer in 9-digit RC are sent to the output unmodified
  • preserves the CRC characteristics for 10-digit RC
    • if the input RC has a valid CRC - the output RC also has a valid CRC (the last digit of the trailer is modified accordingly)
    • if the input RC has an invalid CRC - the output RC also has an invalid CRC, which differs from the correct one by the same value as on the input (the last one or two digits of the trailer are modified accordingly)

First Name

The transformation occurs for known names. It preserves:

  • roughly the number and the positions of capital letters
  • sex (based on the name)

The quality of the transformation (preserving sex etc..) depends on the dictionary used. When creating the dictionary, the names must be divided into 9 groups (masculine, feminine and neutral for first name, last name and both) and within these groups suitable replacements with similar frequency occurrences must be found. The transformation of a name (first and last) consists of its replacement with another name using a nameLookupFileName dictionary. Single and multiple word names are processed. Each token is considered a name if it consists only of letters and/or the apostrophe character ('). Other tokens (i.e., non-letter) are considered to be delimiters and are not transformed by the step. The transformed names are sent to the output separated by the original delimiters. In case the name (or one of the names, if the input contains more than one token) is not found in the dictionary, the input as a whole is considered invalid, the appropriate flag is set (AA_FN_NOT_CHANGED or AA_LN_NOT_CHANGED) and the input is copied to the output unmodified. The dictionary is an Indexed table. The input name is used as the key and the value is the replacement for the original name. Both values must be transformed to uppercase before being inserted into the dictionary (the dictionary must be created with the doUppercase parameter set to on).

Last name

The same rules and restrictions apply as with first names.

This step ensures that for all pairs of identical input values, identical output values will be generated. If any input value is empty, then the output is also empty (no flag is set).



Example: Example
<step className="cz.adastra.cif.tasks.experimental.anonymizer.AnonymizerAlgorithm" id="some_id">
  <properties rc="rcVaue" 
              changeNames="true" 
              nameLookupFileName="name-lookup.cif" 
              birthDayMin="1901-01-01" 
              birthDayMax="2008-04-23"
              firstName="fName" 
              birthDate="bDate" 
              lastName="lName">
    <scorer explanationColumn="expl">
      <scoringEntries>
        <scoringEntry key="AA_FN_NOT_CHANGED" explainAs="AA_FN_NOT_CHANGED" score="10" explain="true"/>
        <scoringEntry key="AA_LN_NOT_CHANGED" explainAs="AA_LN_NOT_CHANGED" score="20" explain="true"/>
        <scoringEntry key="AA_RC_NOT_CHANGED" explainAs="AA_RC_NOT_CHANGED" score="30" explain="true"/>
        <scoringEntry key="AA_RC_DATE_MISMATCH" explainAs="AA_RC_DATE_MISMATCH" score="40" explain="true"/>
        <scoringEntry key="AA_BD_NOT_CHANGED" explainAs="AA_BD_NOT_CHANGED" score="50" explain="true"/>
        <scoringEntry key="AA_RC_CENTURY_GUESS" explainAs="AA_RC_CENTURY_GUESS" score="60" explain="true"/>
      </scoringEntries>
    </scorer>
  </properties>
</step>

iWay Software