Detailed Description of Unification

List of comparison functions

hamming

Function returns the Hamming distance between two strings. The function has a relative variant where the result is divided by the length of the longer string.

levenshtein

Function returns the Levenshtein distance between two strings. The function has a relative variant where the result is divided by the length of the longer string.

editDistance

Function returns the edit distance between two strings. The difference between Levenshtein and Edit Distance lies in the definition of the distance of the two switched adjacent characters. Levenshtein considers the switch as two changes whereas editDistance considers the switch to be one change.

For example editDistance("edit", "edti") = 1 versus levenshtein("edit", "edti") = 2. The function has a relative variant where the result is divided by the length of the longer string.

symmetricDifference

Compared strings are split into two sets of words (space is used as a separator character). Function returns the number of words contained only in one set, e.g. cardinality of set (A \ B) U (B \ A). The function has a relative variant where the result is divided by the cardinality of the union of the sets.

Example: for strings "JOHN SMITH" and "SMITH JOHN GEORGE JOHN MARTIN" the sets are: A={ JOHN, SMITH } B={ JOHN, SMITH, GEORGE, MARTIN } result = |(A \ B) U (B \ A)| = |{ GEORGE, MARTIN}| = 2 relative result = 2 / |(A U B)| = 2 / |{ JOHN, SMITH, GEORGE, MARTIN}| = 2 / 4 = 0.5

symmetricDifferenceExt

The same as the symmetricDifference function, but when no word is common for both sets (A & B = empty), the result of the function is "very big number" (VBN=1000000). The function has a relative variant.

Example: symmetricDifferenceExt("JOHN SMITH", "GEORGE MARTIN") = VBN versus symmetricDifference("JOHN SMITH", "GEORGE MARTIN") = 2.

symmetricDifferenceMultiSet

Similar to the symmetricDifference function, but repeated words in each string are assumed to be different. The function has a relative variant where the result is divided by the cardinality of the union of the sets (again, with respect to repeating words).

Example: for strings "JOHN SMITH" and "SMITH JOHN GEORGE JOHN MARTIN" the sets are: A={ JOHN, SMITH } B={ JOHN, JOHN(second), SMITH, GEORGE, MARTIN } result = |(A \ B) U (B \ A)| = |{ GEORGE, MARTIN, JOHN(second) }| = 3 relative result = 3 / |(A U B)| = 3 / |{ JOHN, JOHN(second), SMITH, GEORGE, MARTIN }| = 3 / 5 = 0.6

symmetricDifferenceMultiSetExt

The same as the symmetricDifferenceMultiSet function, but when no word is common for both sets (A & B = empty), the result of the function is "very big number" (VBN=1000000). The function has a relative variant.

notSubset

Compared strings are split into two sets of words (space is used as a separator character). Function returns 0 if one of the set is subset of the other one, otherwise it returns 1.

Example: for strings "JOHN SMITH" and "JOHN BROWN" returns 1 (not subset). for strings "JOHN SMITH" and "JOHN JOHN" returns 0.

notSubsetMultiSet

Similar to the notSubset function, but repeated words in each string are assumed to be different.

Example: it returns 1 for the strings "JOHN SMITH" and "JOHN JOHN" because of the two JOHN's in the second string.

numberDistance

Function returns the absolute difference between two integers, e.g. |a - b|. There is no relative variant for this function.

anyIsTrue

Returns 1 if at least one boolean value is true. There is no relative variant for this function.

masterIsTrue

Returns 1 if the boolean value for the center record is true. The value for slave record is ignored. There is no relative variant for this function.

slaveIsTrue

Returns 1 if the boolean value for the slave record is true. The value for center record is ignored. There is no relative variant for this function.

bothAreNull

Returns 1 if both values are null. There is no relative variant for this function.


iWay Software