Click here to Skip to main content
Licence CPOL
First Posted 11 Jun 2008
Views 18,284
Downloads 571
Bookmarked 19 times

Arabic Soundex

By | 11 Jun 2008 | Article
An article about an Arabic version of the Soundex algorithm.

ArSoundex

Introduction

Soundex is a phonetic algorithm for indexing names by sound. Many applications use algorithms like this to add fantastic features like the Google Spelling Corrections and MS Word Autocorrect.

google.gif

MSword.gif

Background

Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result. And, only a few programs support this feature for Arabic language (like the Google Spelling Corrections). You can find a lot of examples of English Soundex here in CodeProject or on the Internet. But what about for the Arabic language? Really, the resources for using Soundex with the Arabic language are rare, and after long search and study, I found some academic researches. And, I think this article is the first article illustrating Arabic Soundex. The Soundex algorithm stands on grouping similar sounding letters depending on special sounding features, as follows:

english.gif

To encode a word, the algorithm holds the first letter of the word, then replaces the consonants after this character with the digital values in the previous table, vowels, and the characters (h, w, y) ignored because it makes some confusion or ambiguity when accompanied with other characters.

And, collapse adjacent identical digits into a single digit of that value. This is the simplest description for Soundex, and there is another version of this algorithm with some improvements in the encoding (as an example, some versions replace the “x” character with “ecs” before encoding). So, what we will do is:

  • Hold the first letter.
  • Replace the characters (a, e, I, o, u, h, w, y) with the value 0.
  • Replace the characters (b, f, p, v) with the value 1.
  • Replace the characters (c, g, j, k, q, s, x, z) with the value 2.
  • Replace the characters (d, t) with the value 3.
  • Replace the character (l) with the value 4.
  • Replace the characters (d, t) with the value 5.
  • Replace the character (r) with the value 6.

After that, we save the code for each word , and when the user enters a word to search, we look for the word which has the same sound code.

For the Arabic language, I did a lot of searches for an Arabic version of Soundex until I found a research document of a team of five professors at the Illinois Institute of Technology. This paper was my base to write this Arabic Soundex.

What they do is:

  • Hold the first letter.
  • Replace the characters (ا, أ, إ, آ, ح, ع, غ, ش,و,ي) with the value 0.
  • Replace the characters (ف, ب) with the value 1.
  • Replace the characters (خ, ج, ز, س, ص, ظ, ق, ك) with the value 2.
  • Replace the characters (ت, ث,د,ذ,ض,ط) with the value 3.
  • Replace the character (ل) with the value 4.
  • Replace the characters (م, ن) with the value 5.
  • Replace the character (ر) with the value 6.

But this strategy is still not good enough to give perfect results like the English one, so I do some improvements to the research and I got better results. My changes were:

  • Remove the ( ا, أ, آ, إ ) characters from the beginning of the word if found, because I noticed they added more confusion.
  • Ignore the first character handling: in English language, it is important to handle the first character, but I noticed that in Arabic, there are many words with the same sound but with different first characters, so I ignored the first letter handling.
  • Update the character sound categories by removing or adding some characters.

Using the code

public static string ArComputeintial(string word, int length)
{
    // Value to return
    string value = "";


    switch (word[0])
    {
        case 'ا':
        case 'أ':
        case 'إ':
        case 'آ':
            {
                word = word.Substring(1, word.Length - 1);
            }
            break;

    }

    // Size of the word to process
    int size = word.Length;
    // Make sure the word is at least two characters in length
    if (size > 1)
    {

        // Convert the word to character array for faster processing
        char[] chars = word.ToCharArray();
        // Buffer to build up with character codes
        StringBuilder buffer = new StringBuilder();
        buffer.Length = 0;
        // The current and previous character codes
        int prevCode = 0;
        int currCode = 0;
        // Ignore first character and replace it with fixed value

        buffer.Append('x');
       
        // Loop through all the characters and convert them to the proper character code
        for (int i = 1; i < size; i++)
        {
            switch (chars[i])
            {
                case 'ا':
                case 'أ':
                case 'إ':
                case 'آ':
                case 'ح':
                case 'خ':
                case 'ه':
                case 'ع':
                case 'غ':
                case 'ش':
                case 'و':
                case 'ي':
                    currCode = 0;
                    break;
                case 'ف':
                case 'ب':
                    currCode = 1;
                    break;
                
                case 'ج':
                case 'ز':
                case 'س':
                case 'ص':
                case 'ظ':
                case 'ق':
                case 'ك':
                    currCode = 2;
                    break;
                case 'ت':
                case 'ث':
                case 'د':
                case 'ذ':
                case 'ض':
                case 'ط':
                    currCode = 3;
                    break;
                case 'ل':
                    currCode = 4;
                    break;
                case 'م':
                case 'ن':
                    currCode = 5;
                    break;
                case 'ر':
                    currCode = 6;
                    break;
            }

            // Check to see if the current code is the same as the last one
            if (currCode != prevCode)
            {
                // Check to see if the current code is 0 (a vowel); do not process vowels
                if (currCode != 0)
                    buffer.Append(currCode);
            }
            // Set the new previous character code
            prevCode = currCode;
            // If the buffer size meets the length limit, then exit the loop
            if (buffer.Length == length)
                break;
        }
        // Pad the buffer, if required
        size = buffer.Length;
        if (size < length)
            buffer.Append('0', (length - size));
        // Set the value to return
        value = buffer.ToString();
    }
    // Return the value
    return value;
}

Points of Interest

This is a simple version of the algorithm and regards the complex nature of the Arabic language. We can do additional research to improve it.

I believe that my algorithm needs more tests to ensure it is working correctly. This is my first article in The Code Project. I hope you it gives you some new ideas and sorry for my language errors.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Tammam Koujan

Software Developer (Senior)

Syrian Arab Republic Syrian Arab Republic

Member



Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralGreat article and idea PinmemberMazen el Senih3:58 7 Apr '12  
GeneralNice Application PingroupNewPast.Net20:07 6 Jan '12  
Generalgreat work Pinmemberzooero0:56 7 Jan '10  
GeneralRe: great work PinmemberTammam Koujan18:40 14 Mar '10  
GeneralProblems running the code Pinmembersharon3064:00 16 Sep '09  
GeneralRe: Problems running the code PinmemberTammam Koujan20:20 26 Sep '09  
GeneralMy vote of 2 Pinmembercodesoftconsult9:39 1 Jul '09  
GeneralRe: My vote of 2 PinmemberTammam Koujan13:20 22 Jul '09  
GeneralGreat work...! Arabic and Hebrew are tough ones... PinmemberSymatrix13:03 23 May '09  
GeneralExcellent Work PinmemberMember 142847521:45 16 Sep '08  
GeneralGood work! Pinmember Muammar© 19:32 16 Jun '08  
GeneralGood work PinmemberJohn C13:51 16 Jun '08  
GeneralUnicode PinmemberJonathan C Dickinson4:48 12 Jun '08  
GeneralRe: Unicode PinmemberTammam Koujan18:52 17 Sep '08  
Generalgood article Pinmembermahan11011011019:18 11 Jun '08  
GeneralFormatting PinmvpJohn Simmons / outlaw programmer11:52 11 Jun '08  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web04 | 2.5.120517.1 | Last Updated 11 Jun 2008
Article Copyright 2008 by Tammam Koujan
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid