Normalization

Normalization is a critical process that ensures accuracy and relevance when comparing names and addresses using our matching algorithms.
Below, we outline the steps involved in this normalization process.
Case
Procedure
Stripping Punctuation
Remove all punctuation marks from names and addresses, keeping periods, commas, and hyphens.
Lowercasing
Convert all characters to lowercase for consistent comparison.
White Space Removal
Condense multiple white spaces to a single space for uniformity.
Stop Word Removal
Remove stop words that may not contribute significantly to the meaning of a name or address based on the specific language involved. These stop words are divided into two categories.
  • Stop Prefixes
A stop prefix eliminates specific beginnings from names during indexing and querying processes.
  • Stop Patterns
A stop pattern is a regular expression used to exclude specific name elements during indexing and querying processes. Please refer to the Appendix 1for a description of the stop prefixes and stop patterns categorized by language.
Transliteration and Translation
  • Add vocalizations or vowels in the correct location for languages such as Arabic and Thai which don't have vowels.
  • Insert spaces (segmentation) to languages, separating given and surname tokens such as Chinese, Japanese, Korean, and Thai that don't use spaces.
Diacritical Marks Removal
Eliminate diacritical marks from names and addresses to establish a consistent representation.