Click here to Skip to main content
Click here to Skip to main content

Tagged as

Simple Search Engine for Languages with Accents

, 23 Dec 2009
Rate this:
Please Sign up or sign in to vote.
A simple search engine to perform relevant search for English as well as languages with accents like Vietnamese (e.g. Việt Nam), French (e.g. résumé)

In this article, we’re creating a simple search engine to perform relevant search for English as well as languages with accents like Vietnamese (e.g. Việt Nam), French (e.g. résumé)... 

Identify whether the data string contains word(s) in the search string

“program”. Contains(“a”) is true; but imagine Google uses Contains method Smile | :) what will happen if we do a Google search for “a”? Our extension method will do something like this: "program".ContainsWord("pro") is false, "a program".ContainsWord("program") is true, "a program".ContainsWord("a b") is true... 

public static bool ContainsWord(this string dataString, string searchString)
{
    string[] dataWords = dataString
        .Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
    string[] searchWords = searchString
        .Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

    foreach (string searchWord in searchWords)
    {
        if (dataWords.Contains(searchWord))
            return true;
    }
    return false;
} 

Extension Methods: You might have noticed the keyword "this" in the first parameter. "this string dataString" extends the string class so we can use either ContainsWord(dataString, searchString) or dataString.ContainsWord(searchString). More on Extension Methods.

dataString and searchString are separated into 2 string arrays (dataWords and searchWords), each member of the arrays is a word in the strings. If dataWords contains any of the words in searchWords, the method will return true.

Simple Search

A search for "some resume" returns "This is a resume" because "This is a resume" contains the word "resume". But as a Vietnamese, I know people usually prefer to type what they search for without accents; so a search for "some resume" should also return "Résumé", a search for "cong bang" should return "công bằng". That's what we're gonna do next.

Accents in Search String? Optional!  

private static string NormalizeString(string comparedString)
{
    StringBuilder stringBuilder = new StringBuilder();            
    foreach (char c in comparedString.Trim().ToLower().ToCharArray())
    {
        string normalizedChar = c.ToString()
            .Normalize(NormalizationForm.FormD).Substring(0, 1);
        stringBuilder.Append(normalizedChar);
    }
    return stringBuilder.ToString();
} 

Trim() removes all leading and trailing white-space characters, ToLower() converts the string to lowercase, and ToCharArray() returns a character array of the string so we can normalize char by char

Normalize method turns a character like "ằ" into: "a" + circumflex accent + grave accent; by retrieving the first character with Substring method, we have a normalized character (without accents): "a". stringBuilder connects the normalized characters together.

So NormalizeString("Công Bằng") returns "cong bang", NormalizeString("Résumé") returns "resume"...

Then we need to use NormalizeString in ContainsWord to normalize both the dataString and the searchString, just normalize the strings before splitting them: 

string[] dataWords = NormalizeString(dataString)
    .Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
string[] searchWords = NormalizeString(searchString)
    .Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries); 

That's enough for our simple search engine, do the search again and you'll have:

Simple Search for Languages with Accents

This is a very simple search engine. I'm sure you can make lots of improvements to it, and I hope you let me know if you do.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Duy H. Thai
Software Developer
Vietnam Vietnam

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web04 | 2.8.140821.2 | Last Updated 23 Dec 2009
Article Copyright 2009 by Duy H. Thai
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid