In this article, we’re creating a simple search engine to perform relevant search for English as well as languages with accents like Vietnamese (e.g. Việt Nam), French (e.g. résumé)...
Identify Whether the Data String Contains Word(s) in the Search String
“program
”. Contains(“a”)
is true
; but imagine Google uses Contains
method :-) what will happen if we do a Google search for “a
”? Our extension method will do something like this: "program".ContainsWord("pro")
is false
, "a program".ContainsWord("program")
is true
, "a program".ContainsWord("a b")
is true
...
public static bool ContainsWord(this string dataString, string searchString)
{
string[] dataWords = dataString
.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
string[] searchWords = searchString
.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
foreach (string searchWord in searchWords)
{
if (dataWords.Contains(searchWord))
return true;
}
return false;
}
Extension Methods: You might have noticed the keyword "this
" in the first parameter. "this string dataString
" extends the string
class so we can use either ContainsWord(dataString, searchString)
or dataString.ContainsWord(searchString)
. More on Extension Methods.
dataString
and searchString
are separated into 2 string
arrays (dataWords
and searchWords
), each member of the arrays is a word in the string
s. If dataWords
contains any of the words in searchWords
, the method will return true
.
A search for "some resume" returns "This is a resume" because "This is a resume" contains the word "resume". But as a Vietnamese, I know people usually prefer to type what they search for without accents; so a search for "some resume" should also return "Résumé", a search for "cong bang" should return "công bằng". That's what we're gonna do next.
Accents in Search String? Optional!
private static string NormalizeString(string comparedString)
{
StringBuilder stringBuilder = new StringBuilder();
foreach (char c in comparedString.Trim().ToLower().ToCharArray())
{
string normalizedChar = c.ToString()
.Normalize(NormalizationForm.FormD).Substring(0, 1);
stringBuilder.Append(normalizedChar);
}
return stringBuilder.ToString();
}
Trim()
removes all leading and trailing white-space characters, ToLower()
converts the string
to lowercase
, and ToCharArray()
returns a character array of the string
so we can normalize char
by char
.
Normalize
method turns a character like "ằ" into: "a" + circumflex accent + grave accent; by retrieving the first character with Substring
method, we have a normalized character (without accents): "a
". stringBuilder
connects the normalized characters together.
So NormalizeString("Công Bằng")
returns "cong bang
", NormalizeString("Résumé")
returns "resume
"...
Then we need to use NormalizeString
in ContainsWord
to normalize both the dataString
and the searchString
, just normalize the string
s before splitting them:
string[] dataWords = NormalizeString(dataString)
.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
string[] searchWords = NormalizeString(searchString)
.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
That's enough for our simple search engine, do the search again and you'll have:
This is a very simple search engine. I'm sure you can make lots of improvements to it, and I hope you let me know if you do.
CodeProject