Click here to Skip to main content
15,885,214 members
Articles / Programming Languages / C#

Simple Search Engine for Languages with Accents

Rate me:
Please Sign up or sign in to vote.
4.00/5 (1 vote)
23 Dec 2009CPOL2 min read 7.5K   3  
A simple search engine to perform relevant search for English as well as languages with accents like Vietnamese (e.g. Việt Nam), French (e.g. résumé)

In this article, we’re creating a simple search engine to perform relevant search for English as well as languages with accents like Vietnamese (e.g. Việt Nam), French (e.g. résumé)... 

Identify Whether the Data String Contains Word(s) in the Search String

program”. Contains(“a”) is true; but imagine Google uses Contains method :-) what will happen if we do a Google search for “a”? Our extension method will do something like this: "program".ContainsWord("pro") is false, "a program".ContainsWord("program") is true, "a program".ContainsWord("a b") is true... 

C#
public static bool ContainsWord(this string dataString, string searchString)
{
    string[] dataWords = dataString
        .Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
    string[] searchWords = searchString
        .Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

    foreach (string searchWord in searchWords)
    {
        if (dataWords.Contains(searchWord))
            return true;
    }
    return false;
} 

Extension Methods: You might have noticed the keyword "this" in the first parameter. "this string dataString" extends the string class so we can use either ContainsWord(dataString, searchString) or dataString.ContainsWord(searchString). More on Extension Methods.

dataString and searchString are separated into 2 string arrays (dataWords and searchWords), each member of the arrays is a word in the strings. If dataWords contains any of the words in searchWords, the method will return true.

Simple Search

A search for "some resume" returns "This is a resume" because "This is a resume" contains the word "resume". But as a Vietnamese, I know people usually prefer to type what they search for without accents; so a search for "some resume" should also return "Résumé", a search for "cong bang" should return "công bằng". That's what we're gonna do next.

Accents in Search String? Optional!  

C#
private static string NormalizeString(string comparedString)
{
    StringBuilder stringBuilder = new StringBuilder();            
    foreach (char c in comparedString.Trim().ToLower().ToCharArray())
    {
        string normalizedChar = c.ToString()
            .Normalize(NormalizationForm.FormD).Substring(0, 1);
        stringBuilder.Append(normalizedChar);
    }
    return stringBuilder.ToString();
} 

Trim() removes all leading and trailing white-space characters, ToLower() converts the string to lowercase, and ToCharArray() returns a character array of the string so we can normalize char by char

Normalize method turns a character like "ằ" into: "a" + circumflex accent + grave accent; by retrieving the first character with Substring method, we have a normalized character (without accents): "a". stringBuilder connects the normalized characters together.

So NormalizeString("Công Bằng") returns "cong bang", NormalizeString("Résumé") returns "resume"...

Then we need to use NormalizeString in ContainsWord to normalize both the dataString and the searchString, just normalize the strings before splitting them: 

C#
string[] dataWords = NormalizeString(dataString)
    .Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
string[] searchWords = NormalizeString(searchString)
    .Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries); 

That's enough for our simple search engine, do the search again and you'll have:

Simple Search for Languages with Accents

This is a very simple search engine. I'm sure you can make lots of improvements to it, and I hope you let me know if you do.

This article was originally posted at http://x189.blogspot.com/feeds/posts/default

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
Vietnam Vietnam
ASP.NET MVC enthusiast.
Developer @ X189 blog.

Comments and Discussions

 
-- There are no messages in this forum --