Click here to Skip to main content
Click here to Skip to main content

Arabic Soundex

, 3 Dec 2012 CPOL
Rate this:
Please Sign up or sign in to vote.
An article about an Arabic version of the Soundex algorithm.

ArSoundex

Introduction

Soundex is a phonetic algorithm for indexing names by sound. Many applications use algorithms like this to add fantastic features like the Google Spelling Corrections and MS Word Autocorrect.

ArabicSoundex/google.gif

ArabicSoundex/msword.gif

Background

Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result. And, only a few programs support this feature for Arabic language (like the Google Spelling Corrections). You can find a lot of examples of English Soundex here in CodeProject or on the Internet. But what about for the Arabic language? Really, the resources for using Soundex with the Arabic language are rare, and after long search and study, I found some academic researches. And, I think this article is the first article illustrating Arabic Soundex. The Soundex algorithm stands on grouping similar sounding letters depending on special sounding features, as follows:

ArabicSoundex/english.gif

To encode a word, the algorithm holds the first letter of the word, then replaces the consonants after this character with the digital values in the previous table, vowels, and the characters (h, w, y) ignored because it makes some confusion or ambiguity when accompanied with other characters.

And, collapse adjacent identical digits into a single digit of that value. This is the simplest description for Soundex, and there is another version of this algorithm with some improvements in the encoding (as an example, some versions replace the “x” character with “ecs” before encoding). So, what we will do is:

  • Hold the first letter.
  • Replace the characters (a, e, I, o, u, h, w, y) with the value 0.
  • Replace the characters (b, f, p, v) with the value 1.
  • Replace the characters (c, g, j, k, q, s, x, z) with the value 2.
  • Replace the characters (d, t) with the value 3.
  • Replace the character (l) with the value 4.
  • Replace the characters (d, t) with the value 5.
  • Replace the character (r) with the value 6.

After that, we save the code for each word , and when the user enters a word to search, we look for the word which has the same sound code.

For the Arabic language, I did a lot of searches for an Arabic version of Soundex until I found a research document of a team of five professors at the Illinois Institute of Technology. This paper was my base to write this Arabic Soundex.

What they do is:

  • Hold the first letter.
  • Replace the characters (ا, أ, إ, آ, ح, ع, غ, ش,و,ي) with the value 0.
  • Replace the characters (ف, ب) with the value 1.
  • Replace the characters (خ, ج, ز, س, ص, ظ, ق, ك) with the value 2.
  • Replace the characters (ت, ث,د,ذ,ض,ط) with the value 3.
  • Replace the character (ل) with the value 4.
  • Replace the characters (م, ن) with the value 5.
  • Replace the character (ر) with the value 6.

But this strategy is still not good enough to give perfect results like the English one, so I do some improvements to the research and I got better results. My changes were:

  • Remove the ( ا, أ, آ, إ ) characters from the beginning of the word if found, because I noticed they added more confusion.
  • Ignore the first character handling: in English language, it is important to handle the first character, but I noticed that in Arabic, there are many words with the same sound but with different first characters, so I ignored the first letter handling.
  • Update the character sound categories by removing or adding some characters.

Using the code

public static string ArComputeintial(string word, int length)
{
    // Value to return
    string value = "";


    switch (word[0])
    {
        case 'ا':
        case 'أ':
        case 'إ':
        case 'آ':
            {
                word = word.Substring(1, word.Length - 1);
            }
            break;

    }

    // Size of the word to process
    int size = word.Length;
    // Make sure the word is at least two characters in length
    if (size > 1)
    {

        // Convert the word to character array for faster processing
        char[] chars = word.ToCharArray();
        // Buffer to build up with character codes
        StringBuilder buffer = new StringBuilder();
        buffer.Length = 0;
        // The current and previous character codes
        int prevCode = 0;
        int currCode = 0;
        // Ignore first character and replace it with fixed value

        buffer.Append('x');
       
        // Loop through all the characters and convert them to the proper character code
        for (int i = 1; i < size; i++)
        {
            switch (chars[i])
            {
                case 'ا':
                case 'أ':
                case 'إ':
                case 'آ':
                case 'ح':
                case 'خ':
                case 'ه':
                case 'ع':
                case 'غ':
                case 'ش':
                case 'و':
                case 'ي':
                    currCode = 0;
                    break;
                case 'ف':
                case 'ب':
                    currCode = 1;
                    break;
                
                case 'ج':
                case 'ز':
                case 'س':
                case 'ص':
                case 'ظ':
                case 'ق':
                case 'ك':
                    currCode = 2;
                    break;
                case 'ت':
                case 'ث':
                case 'د':
                case 'ذ':
                case 'ض':
                case 'ط':
                    currCode = 3;
                    break;
                case 'ل':
                    currCode = 4;
                    break;
                case 'م':
                case 'ن':
                    currCode = 5;
                    break;
                case 'ر':
                    currCode = 6;
                    break;
            }

            // Check to see if the current code is the same as the last one
            if (currCode != prevCode)
            {
                // Check to see if the current code is 0 (a vowel); do not process vowels
                if (currCode != 0)
                    buffer.Append(currCode);
            }
            // Set the new previous character code
            prevCode = currCode;
            // If the buffer size meets the length limit, then exit the loop
            if (buffer.Length == length)
                break;
        }
        // Pad the buffer, if required
        size = buffer.Length;
        if (size < length)
            buffer.Append('0', (length - size));
        // Set the value to return
        value = buffer.ToString();
    }
    // Return the value
    return value;
}

Points of Interest

This is a simple version of the algorithm and regards the complex nature of the Arabic language. We can do additional research to improve it. 

I believe that my algorithm needs more tests to ensure it is working correctly. This is my first article in The Code Project. I hope you it gives you some new ideas and sorry for my language errors. 

More ... 

A research about Text similarities for Arabic language written by Moath Ibrahim Al-hadlaq at Al-Imam Muhammad Ibn Saud Islamic University in Kingdom of Saudi Arabia contain a very good information and methods for  Arabic phonetic algorithm, it also include an improvement version of my  Soundex Arabic algorithm. 

the research available to download from here: 

ftp://ftp3.ie.freebsd.org/pub/sourceforge/t/project/te/textsimilaritie/Text_Similarities.pdf 

also I have attached a copy into downloads. 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Tammam Koujan
Software Developer (Senior)
Syrian Arab Republic Syrian Arab Republic
Syrian Developer work and live in Dubai - UAE, Focusing on .NET technologies and data aware applications.
Follow on   LinkedIn

Comments and Discussions

 
Questionhelp me plz PinmemberMember 1035858420-May-14 0:00 
Questionquestion PinmemberMember 103585844-May-14 9:50 
AnswerRe: question PinprofessionalTammam Koujan5-May-14 2:59 
GeneralRe: question PinmemberMember 103585846-May-14 3:17 
GeneralRe: question PinprofessionalTammam Koujan6-May-14 5:46 
Generalhelp plzz [modified] PinmemberMember 1035858419-May-14 4:54 
GeneralMy vote of 5 Pinmember Gun Gun Febrianza4-Jun-13 9:54 
GeneralRe: My vote of 5 PinmemberTammam Koujan24-Jun-13 1:11 
QuestionARABIC SOUNDEX PinmemberIMAN MANO3-Mar-13 23:42 
AnswerRe: ARABIC SOUNDEX PinmemberTammam Koujan24-Jun-13 1:13 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.141223.1 | Last Updated 3 Dec 2012
Article Copyright 2008 by Tammam Koujan
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid