Back

Fuzzy Searching Using PostgreSQL

There are many use cases for performing person searches, using identifiers like names and addresses, on very large databases containing detailed, sensitive and private information about individuals and entities. These may range from customer relationship management initiatives with a need for understanding consumer behavior and developing marketing strategies to identity management and fraud detection in the financial services arena.

A key challenge in uniquely identifying a person or entity based on non-deterministic elements like name and address information is because of the use of nicknames, spelling variations or mistakes as well as the use of abbreviations. We have built a solution based on a customized multi-tier approach for fuzzy matching implemented by leveraging various text search capabilities provided by the PostgreSQL database.

At the basic level, we can start by using fuzzy matching algorithms like Soundex and Levenshtein distance. When applied to name-based search queries, both these are effective improvements on an exact match straight away. For name-based queries, Soundex works well to match names that are spelled differently but have the same phonetics. The use of the Levenshtein distance helps to absorb spelling mistakes in the search. However, we do have to work out a strategy to dynamically tune the Levenshtein distance for our searches. Keeping the Levenshtein distance large will end up retrieving a lot of false positives, keeping it too small might not give you any result. Another drawback of Levenshtein implementation is that the queries can slow down substantially on very large database tables having rows of the magnitude of 100 million or more.

For matching addresses of persons, some elements like the city, state and zip are fairly deterministic. That’s the easy part! The challenge lies in matching the free text address lines containing apartment numbers, street names, etc. The address text being searched for by the user can vary substantially from what is stored in the database. For example, the words “Street” and “Drive” can be represented in many ways, like – Drive, Dr, Drv, Dr., etc. Also, the address line – “Apartment 115A, Park Road” can be written as “Park Rd, Apt 115A” amongst many other ways. The key in fuzzy searches is to be able to represent the text as a vector of words that have been normalized using text-processing filters.

PostgreSQL has features that provide a good platform to start with for text search, like the fuzzystrmatch module which provides vanilla implementations of various text matching algorithms out of the box. PostgreSQL has also implemented the concept of Dictionaries, that can be customized as well. With the underlying database doing the basic work, we were able to focus on the complex part of finding the right person in the haystack that was the 500 million+ records.

If searching for textual information in very large databases lies in part of the problem that you are trying to solve, talk to us.