Click here to Skip to main content
15,881,172 members
Articles / Programming Languages / C#

Arabic Soundex

Rate me:
Please Sign up or sign in to vote.
4.91/5 (35 votes)
3 Dec 2012CPOL4 min read 97.3K   3.6K   30   42
An article about an Arabic version of the Soundex algorithm.

ArSoundex

Introduction

Soundex is a phonetic algorithm for indexing names by sound. Many applications use algorithms like this to add fantastic features like the Google Spelling Corrections and MS Word Autocorrect.

ArabicSoundex/google.gif

ArabicSoundex/msword.gif

Background

Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result. And, only a few programs support this feature for Arabic language (like the Google Spelling Corrections). You can find a lot of examples of English Soundex here in CodeProject or on the Internet. But what about for the Arabic language? Really, the resources for using Soundex with the Arabic language are rare, and after long search and study, I found some academic researches. And, I think this article is the first article illustrating Arabic Soundex. The Soundex algorithm stands on grouping similar sounding letters depending on special sounding features, as follows:

ArabicSoundex/english.gif

To encode a word, the algorithm holds the first letter of the word, then replaces the consonants after this character with the digital values in the previous table, vowels, and the characters (h, w, y) ignored because it makes some confusion or ambiguity when accompanied with other characters.

And, collapse adjacent identical digits into a single digit of that value. This is the simplest description for Soundex, and there is another version of this algorithm with some improvements in the encoding (as an example, some versions replace the “x” character with “ecs” before encoding). So, what we will do is:

  • Hold the first letter.
  • Replace the characters (a, e, I, o, u, h, w, y) with the value 0.
  • Replace the characters (b, f, p, v) with the value 1.
  • Replace the characters (c, g, j, k, q, s, x, z) with the value 2.
  • Replace the characters (d, t) with the value 3.
  • Replace the character (l) with the value 4.
  • Replace the characters (d, t) with the value 5.
  • Replace the character (r) with the value 6.

After that, we save the code for each word , and when the user enters a word to search, we look for the word which has the same sound code.

For the Arabic language, I did a lot of searches for an Arabic version of Soundex until I found a research document of a team of five professors at the Illinois Institute of Technology. This paper was my base to write this Arabic Soundex.

What they do is:

  • Hold the first letter.
  • Replace the characters (ا, أ, إ, آ, ح, ع, غ, ش,و,ي) with the value 0.
  • Replace the characters (ف, ب) with the value 1.
  • Replace the characters (خ, ج, ز, س, ص, ظ, ق, ك) with the value 2.
  • Replace the characters (ت, ث,د,ذ,ض,ط) with the value 3.
  • Replace the character (ل) with the value 4.
  • Replace the characters (م, ن) with the value 5.
  • Replace the character (ر) with the value 6.

But this strategy is still not good enough to give perfect results like the English one, so I do some improvements to the research and I got better results. My changes were:

  • Remove the ( ا, أ, آ, إ ) characters from the beginning of the word if found, because I noticed they added more confusion.
  • Ignore the first character handling: in English language, it is important to handle the first character, but I noticed that in Arabic, there are many words with the same sound but with different first characters, so I ignored the first letter handling.
  • Update the character sound categories by removing or adding some characters.

Using the code

C#
public static string ArComputeintial(string word, int length)
{
    // Value to return
    string value = "";


    switch (word[0])
    {
        case 'ا':
        case 'أ':
        case 'إ':
        case 'آ':
            {
                word = word.Substring(1, word.Length - 1);
            }
            break;

    }

    // Size of the word to process
    int size = word.Length;
    // Make sure the word is at least two characters in length
    if (size > 1)
    {

        // Convert the word to character array for faster processing
        char[] chars = word.ToCharArray();
        // Buffer to build up with character codes
        StringBuilder buffer = new StringBuilder();
        buffer.Length = 0;
        // The current and previous character codes
        int prevCode = 0;
        int currCode = 0;
        // Ignore first character and replace it with fixed value

        buffer.Append('x');
       
        // Loop through all the characters and convert them to the proper character code
        for (int i = 1; i < size; i++)
        {
            switch (chars[i])
            {
                case 'ا':
                case 'أ':
                case 'إ':
                case 'آ':
                case 'ح':
                case 'خ':
                case 'ه':
                case 'ع':
                case 'غ':
                case 'ش':
                case 'و':
                case 'ي':
                    currCode = 0;
                    break;
                case 'ف':
                case 'ب':
                    currCode = 1;
                    break;
                
                case 'ج':
                case 'ز':
                case 'س':
                case 'ص':
                case 'ظ':
                case 'ق':
                case 'ك':
                    currCode = 2;
                    break;
                case 'ت':
                case 'ث':
                case 'د':
                case 'ذ':
                case 'ض':
                case 'ط':
                    currCode = 3;
                    break;
                case 'ل':
                    currCode = 4;
                    break;
                case 'م':
                case 'ن':
                    currCode = 5;
                    break;
                case 'ر':
                    currCode = 6;
                    break;
            }

            // Check to see if the current code is the same as the last one
            if (currCode != prevCode)
            {
                // Check to see if the current code is 0 (a vowel); do not process vowels
                if (currCode != 0)
                    buffer.Append(currCode);
            }
            // Set the new previous character code
            prevCode = currCode;
            // If the buffer size meets the length limit, then exit the loop
            if (buffer.Length == length)
                break;
        }
        // Pad the buffer, if required
        size = buffer.Length;
        if (size < length)
            buffer.Append('0', (length - size));
        // Set the value to return
        value = buffer.ToString();
    }
    // Return the value
    return value;
}

Points of Interest

This is a simple version of the algorithm and regards the complex nature of the Arabic language. We can do additional research to improve it. 

I believe that my algorithm needs more tests to ensure it is working correctly. This is my first article in The Code Project. I hope you it gives you some new ideas and sorry for my language errors. 

More ... 

A research about Text similarities for Arabic language written by Moath Ibrahim Al-hadlaq at Al-Imam Muhammad Ibn Saud Islamic University in Kingdom of Saudi Arabia contain a very good information and methods for  Arabic phonetic algorithm, it also include an improvement version of my  Soundex Arabic algorithm. 

the research available to download from here: 

ftp://ftp3.ie.freebsd.org/pub/sourceforge/t/project/te/textsimilaritie/Text_Similarities.pdf 

also I have attached a copy into downloads. 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Germany Germany
010011000110100101101011011001010010000001000011011011110110010001101001011011100110011100100001

Comments and Discussions

 
GeneralGreat work...! Arabic and Hebrew are tough ones... Pin
Symatrix23-May-09 13:03
Symatrix23-May-09 13:03 
GeneralExcellent Work Pin
Member 142847516-Sep-08 21:45
Member 142847516-Sep-08 21:45 
GeneralGood work! Pin
Muammar©16-Jun-08 19:32
Muammar©16-Jun-08 19:32 
GeneralGood work Pin
Member 9616-Jun-08 13:51
Member 9616-Jun-08 13:51 
GeneralUnicode Pin
Jonathan C Dickinson12-Jun-08 4:48
Jonathan C Dickinson12-Jun-08 4:48 
GeneralRe: Unicode Pin
Tammam Koujan17-Sep-08 18:52
professionalTammam Koujan17-Sep-08 18:52 
Generalgood article Pin
mahan11011011011-Jun-08 19:18
mahan11011011011-Jun-08 19:18 
GeneralFormatting Pin
#realJSOP11-Jun-08 11:52
mve#realJSOP11-Jun-08 11:52 
Fix your formatting.


"Why don't you tie a kerosene-soaked rag around your ankles so the ants won't climb up and eat your candy ass..." - Dale Earnhardt, 1997
-----
"...the staggering layers of obscenity in your statement make it a work of art on so many levels." - Jason Jystad, 10/26/2001


General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.