It is a common task for a developer – Given a set of inputs, come up with some logic that goes through the data tables and finds matching records. For someone working at the IRS, the requirement might be to find tax returns with a high income level and many deductions [to find people to audit]. For an HR application, a query might be needed to find employees based on name and date of birth. Simple, right? All that’s needed is a simple condition in the
whereclause and you’re done, right?
Well, it depends. Some searches need to be more flexible than others. Assuming there’s a UI for the search, what if the user misspells the name? What if the date of birth is one day off? What if, on an address lookup, the street names have a slightly different wording [like 'Street' instead of 'St']?
When I approach writing a search query, unless the input criteria are clear-cut or flexibility isn’t wanted by the users, I try to build flexibility into the search query. For date of birth, this might entail checking not just the exact date, but one day off, and weight it somewhat but less heavily than an exact match. For text-based matches, one of the key functions I use is difference.
‘Difference’ is an interesting function. It takes in 2 text inputs, converts each into a numerical representation of how the word ‘sounds’, does a comparison between the two numbers, and outputs how close the two words are phonetically. A value of 4 indicates the words sound the same. 0 means the words sound completely different. 3 is a close match. It is based on the soundex algorithm, which is further discussed here.
Here are some examples:
SELECT difference('Weather','Whether') -- 4.
SELECT difference('Red Mailbox','Red Robin') -- 4. Notice that it only looks
-- at the first word
SELECT difference('Mailbox','Robin') -- 1.
SELECT difference('Sizzle','Grizzly') -- 2.
SELECT difference('Sizzle','Fizzle') -- 3.
SELECT difference('Fizzle','Flop') -- 1.
SELECT difference('Andrew','Bobby') -- 0.
SELECT difference('Billy','Bobby') -- 3.
Is soundex the magic bullet for text search and matching logic? Not completely. It has its limitations, in that it only looks at one word, and it only goes as far as 4 phonetic parts. It should be combined with other mechanisms, including logic that uses substrings of the word as well as like clauses for partial matches. But I consider difference to be a major component to the text search or match queries that I write.
In the next post, I’ll provide an example of how I used difference and other SQL logic for a stored procedure used to find existing students in MyBPS that match a certain list of input criteria. This logic is used by the application to help avoid duplicate student records from being entered into the system.