Normalization

Normalization is a critical process that ensures accuracy and relevance when comparing names and addresses using our matching algorithms.

Below, we outline the steps involved in this normalization process.

Case	Procedure
Stripping Punctuation	Remove all punctuation marks from names and addresses, keeping periods, commas, and hyphens.
Lowercasing	Convert all characters to lowercase for consistent comparison.
White Space Removal	Condense multiple white spaces to a single space for uniformity.
Stop Word Removal	Remove stop words that may not contribute significantly to the meaning of a name or address based on the specific language involved. These stop words are divided into two categories. Stop Prefixes A stop prefix eliminates specific beginnings from names during indexing and querying processes. Stop Patterns A stop pattern is a regular expression used to exclude specific name elements during indexing and querying processes. Please refer to the Appendix 1for a description of the stop prefixes and stop patterns categorized by language.
Transliteration and Translation	Add vocalizations or vowels in the correct location for languages such as Arabic and Thai which don't have vowels. Insert spaces (segmentation) to languages, separating given and surname tokens such as Chinese, Japanese, Korean, and Thai that don't use spaces.
Diacritical Marks Removal	Eliminate diacritical marks from names and addresses to establish a consistent representation.

This article applies to: