Based on a specified first name and last name, this step determines gender value and verifies it against any provided input value. The final result can be one of the following:
Determination of the gender value is dictionary based. Dictionaries contain known first and last names together with information about a ratio in which the name is represented within men and women. The final decision about the gender value is then based on the threshold, which is 51 percent by default (i.e., only first names and last names with a minimum of 51 percent ratio each are considered to confirm the gender value unambiguously). The default value can be changed using the properties nameSurenessLevel and surnameSurenessLevel (a percentage; the step accepts values between 51 and 100).
Note: statistical data with first names and last names for Czech Republic were used as a resource for the dictionaries.
The original gender value is replaced only if the derived gender value is determined identically by both first and last name (both satisfy the defined threshold). In other cases the original (input) value is stored in the output. If the input value is incorrect (and/or not found in the dictionary), it is considered to be empty and the step sets the scoring flag GNDR_GENDER_MISMATCH.
This step supports also single-based verification - using the last name only, but in this case it only confirms the gender value according to the last name or it sets a scoring flag estimatedGender - a string indicating the estimated result. However, the original value is retained and stored in the output inGender (but a replacement in this situation can be forced using the the property overwriteIfGuess).
The same situation occurs (with the same result) when determination of the gender value is not definite, however one of the components suggests what the probable gender value is (i.e., only one of the components satisfies the threshold in the appropriate dictionary).
Because the step uses dictionaries with exact forms of first names and last names, input first name and last name need to be cleansed and identified beforehand (e.g. using the Guess Name Surname step).
Comparision of the determined and input gender value is case-insensitive.
The following table shows scoring flags set by the step for combinations of different input values. If no input gender value is specified, the step has no data to compare the result against, instead the values from "F","M" rows and columns (blue) are used.
If an input gender value is provided, the columns input gender value opposite a input gender value equals (green). The rest of table values (black) are not dependent on the input gender value, i.e. they are the same for both cases. Scoring flags conform to the overwriteIfGuess=false status. If this property is set to true, the scoring flag CHANGED applies (instead of CHANGE_SUGGESTION).
Last name |
First name | ||||||
---|---|---|---|---|---|---|---|
Not provided |
Not found |
Not conclusive |
F(female) |
M(male) |
Input gender value opposite |
Input gender value equals | |
Not provided |
UNDECIDABLE |
UNDECIDABLE NAME_UNKNOWN |
UNDECIDABLE |
CHANGE_SUGGESTION |
CHANGE_SUGGESTION |
CHANGE_SUGGESTION |
CONFIRMED |
Not found |
UNDECIDABLE SURNAME_UNKNOWN |
UNDECIDABLE NAME_UNKNOWN SURNAME_UNKNOWN |
UNDECIDABLE SURNAME_UNKNOWN |
CHANGE_SUGGESTION SURNAME_UNKNOWN |
CHANGE_SUGGESTION SURNAME_UNKNOWN |
CHANGE_SUGGESTION SURNAME_UNKNOWN |
CONFIRMED SURNAME_UNKNOWN |
Not conclusive |
UNDECIDABLE |
UNDECIDABLE NAME_UNKNOWN |
UNDECIDABLE |
CHANGE_SUGGESTION |
CHANGE_SUGGESTION |
CHANGE_SUGGESTION |
CONFIRMED |
F(female) |
CHANGE_SUGGESTION |
CHANGE_SUGGESTION NAME_UNKNOWN |
CHANGE_SUGGESTION |
CHANGED |
MISMATCH | ||
M(male) |
CHANGE_SUGGESTION |
CHANGE_SUGGESTION NAME_UNKNOWN |
CHANGE_SUGGESTION |
MISMATCH |
CHANGED | ||
Input gender value opposite |
CHANGE_SUGGESTION |
CHANGE_SUGGESTION NAME_UNKNOWN |
CHANGE_SUGGESTION |
CHANGED |
MISMATCH | ||
Input gender value equals |
CONFIRMED |
CONFIRMED NAME_UNKNOWN |
CONFIRMED |
MISMATCH |
CONFIRMED |
<step id='alg' className='cz.adastra.cif.tasks.clean.UpdateGenderAlgorithm'> <properties> <inGender>ingender</inGender> <inSurname>insurname</inSurname> <inName>inname</inName> <estimatedGender>estgender</estimatedGender> <firstNameRatioLookupFileName>ratio_names.cif</firstNameRatioLookupFileName> <surnameRatioLookupFileName>ratio_surnames.cif</surnameRatioLookupFileName> <maleDefinition>M</maleDefinition><!-- opt : M --> <femaleDefinition>F</femaleDefinition><!-- opt : F --> <nameSurenessLevel>99</nameSurenessLevel><!-- opt : 51--> <surnameSurenessLevel>99</surnameSurenessLevel><!-- opt : 51 --> <overwriteIfGuess>false</overwriteIfGuess> <scorer explanationColumn='expl'> <scoringEntries> <scoringEntry key='UG_CONFIRMED' score='0' explain='true' /> <scoringEntry key='UG_CHANGED' score='500' explain='true' /> <scoringEntry key='UG_MISMATCH' score='300' explain='true' /> <scoringEntry key='UG_UNDECIDABLE' score='700' explain='true' /> <scoringEntry key='UG_CHANGE_SUGGESTION' score='500' explain='true'/> <scoringEntry key='UG_GENDER_MISMATCH' score='200' explain='true'/> <scoringEntry key='UG_NAME_UNKNOWN' score='100' explain='true'/> <scoringEntry key='UG_SURNAME_UNKNOWN' score='100' explain='true'/> </scoringEntries> </scorer> </properties> </step>
iWay Software |