Click here to Skip to main content
Licence GPL3
First Posted 14 Aug 2009
Views 19,740
Downloads 735
Bookmarked 46 times

The Pauper Man Dictionary

By | 14 Aug 2009 | Article
Create your own PPC English dictionary downloading info from web pages
The_Pauper_Man_Dictionary

Introduction

The Pauper Man Dictionary is a Windows Mobile 2003 Phone Application for an English-English Dictionary. The idea came to my mind from mixing the articles Google Suggest like Dictionary and Dictionary for Google Suggest like Dictionary and due to the necessity of an English dictionary in my old Windows mobile cell phone.

Background

The Visual Studio 2008 solution includes two projects.

The first one is a WinForms application for downloading the data from The Online Plain Text English Dictionary that is based on "The Project Gutenberg Etext of Webster's Unabridged Dictionary" which in turn is based on the 1913 US Webster's Unabridged Dictionary and is used to create the SQLite.Net database file. 

The second project is the PPC implementation for use in a Windows Mobile 2003 cell phone using the same SQLite database file.

Using the Code

The first problem to solve is “to read” the HTML page and split each word in order to accommodate into DB file. For a better performance of the application, I am using a background worker control to use another thread for the download and word processing. Additionally, it is necessary to remove all HTML tags from the page. I found a good example here.

 class HTMLremover
    {
        /// <summary>
        /// Remove HTML from string with Regex.
        /// </summary>
        public static string StripTagsRegex(string source)
        {
            return Regex.Replace(source, "<.*?>", string.Empty);
        }

        /// <summary>
        /// Compiled regular expression for performance.
        /// </summary>
        static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

        /// <summary>
        /// Remove HTML from string with compiled Regex.
        /// </summary>
        public static string StripTagsRegexCompiled(string source)
        {
            return _htmlRegex.Replace(source, string.Empty);
        }

        /// <summary>
        /// Remove HTML tags from string using char array.
        /// </summary>
        public static string StripTagsCharArray(string source)
        {
            char[] array = new char[source.Length];
            int arrayIndex = 0;
            bool inside = false;

            for (int i = 0; i < source.Length; i++)
            {
                char let = source[i];
                if (let == '<')
                {
                    inside = true;
                    continue;
                }
                if (let == '>')
                {
                    inside = false;
                    continue;
                }
                if (!inside)
                {
                    array[arrayIndex] = let;
                    arrayIndex++;
                }
            }
            return new string(array, 0, arrayIndex);
        }
    }

After I analyzed the text in pages, I found that the characters ‘(‘and ‘)’ are the key to solve the problem of word processing.

This is the portion of the code where the main work takes place:

/// <summary>
/// Method to download data and insert to DB
/// </summary>
/// <param name="worker"></param>
/// <param name="e"></param>
/// <returns></returns>
private bool DowloadData(BackgroundWorker worker, DoWorkEventArgs e)
{
    string[] dataReturn = new string[2];

    int wordCount = 0;

    for (int asciiCode = 97; asciiCode <= 122; asciiCode++) //Processing from 'a' to 'z'
    {
        char page = (char)asciiCode;
        string connString = "Data Source = dict.db";
        SQLiteConnection sqConnection = new SQLiteConnection(connString);
        sqConnection.Open();

        dataReturn[0] = wordCount.ToString();
        dataReturn[1] = page.ToString();

        worker.ReportProgress(0, dataReturn);

        SQLiteTransaction sqTrans = 
        sqConnection.BeginTransaction(System.Data.IsolationLevel.ReadCommitted);

        SQLiteCommand sqCommand = new SQLiteCommand();

        sqCommand.Transaction = sqTrans;
        sqCommand.Connection = sqConnection;

        sqCommand.Parameters.Add(new SQLiteParameter());
        sqCommand.Parameters.Add(new SQLiteParameter());
        sqCommand.Parameters.Add(new SQLiteParameter());

        WebRequest request = WebRequest.Create
        		("http://www.mso.anu.edu.au/~ralph/OPTED/v003/wb1913_" + 
		page.ToString() + ".html");
        request.Credentials = CredentialCache.DefaultCredentials;
        request.Proxy.Credentials = CredentialCache.DefaultNetworkCredentials;

        WebResponse response = request.GetResponse();

        StreamReader responseReader =
            new StreamReader(response.GetResponseStream());

        string responseData = responseReader.ReadToEnd();

        //Remove tags
        string textInPage = HTMLremover.StripTagsRegex(responseData);
        StreamWriter tempOutput = new StreamWriter("temp.txt");
        tempOutput.Write(textInPage);
        tempOutput.Close();
        int letterSize = textInPage.Length;  //Used to calculate % of the letter
        StreamReader text = new StreamReader("temp.txt");

        //Add data to DB
        string line;
        int textProcessed = 0;

        try
        {
            while ((line = text.ReadLine()) != null)
            {
                textProcessed += line.Length;
                int percentage = (int)(textProcessed * 100 / letterSize);

                if (line != string.Empty && line.Contains('('))
                {
                    string[] field = new string[3];
                    field[0] = string.Empty;
                    field[1] = string.Empty;
                    field[2] = string.Empty;

                    char[] letters = line.ToCharArray();

                    int fieldNumber = 0;

                    foreach (char character in letters)
                    {
                        if (fieldNumber == 0 && character == '(')
                        {
                            fieldNumber++;
                        }

                        field[fieldNumber] += character.ToString();

                        if (fieldNumber == 1 && character == ')')
                        {
                            fieldNumber++;
                        }
                    }

                    if (field[0].Length < 30)
                    {
                        dataReturn[0] = wordCount.ToString();
                        dataReturn[1] = page.ToString();
                        worker.ReportProgress(percentage, dataReturn);

                        wordCount++;

                        sqCommand.Parameters[0].Value = field[0];
                        sqCommand.Parameters[1].Value = field[1];
                        sqCommand.Parameters[2].Value = field[2];

                        sqCommand.CommandText =
                                @"INSERT INTO [dict] ([word], [type], [mean]) " +
                                "VALUES (?, ?, ?)";
                        sqCommand.ExecuteNonQuery();
                    }
                }
            }

            dataReturn[0] = wordCount.ToString();
            dataReturn[1] = page.ToString();
            worker.ReportProgress(100, dataReturn);
            sqTrans.Commit();
        }
        catch (Exception ex)
        {
            MessageBox.Show(ex.ToString());
        }
        finally
        {
            sqConnection.Close();
        }

        text.Close();
    }
    File.Delete("temp.txt");
    return true;
}	

Points of Interest

This code is very useful if you want to check out how to open and “read” an internet page inside your code. Or if you want to check how to use backgroundworker control for receiving additional information and not only the percentage of advance of the process. Additionally it shows how to remove the HTML tags from “downloaded” internet pages. It is using SQLite.Net for DB work in both platforms, Win 7 (that’s what I'm using) and Win Mobile. At this point, it is good to mention that there is a speed problem with “SELECT” clause in SQLite.Net.

Finally, this code is a simple way to show you how to build your own dictionary in your old cell phone. But, if you want a free better one (without the source code), you can visit MDict.

History

  • 14th August, 2009: Initial post

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

About the Author

ignotus confutatis

Software Developer

Mexico Mexico

Member

Civil Engineer and C# Developer

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralMy vote of 5 PinmvpDalek Dave14:27 9 Jan '11  
GeneralRe: My vote of 5 PinmemberEber Ramirez11:41 31 Jan '11  
GeneralSimilar problem Pinmemberssonby19:31 3 Jun '10  
GeneralRe: Similar problem PinmemberEber Ramirez19:09 4 Jul '10  
GeneralMessage Removed PinmemberAli BaderEddin9:04 26 Dec '09  
GeneralRe: My vote of 1 PinmemberEber Ramirez15:55 28 Dec '09  
GeneralRe: My vote of 1 PinmemberAli BaderEddin12:27 12 Feb '10  
GeneralRe: My vote of 1 PinmentorTrollslayer9:31 27 Jan '10  
GeneralVery cool idea! Pinmemberddarko10022:04 22 Nov '09  
GeneralRe: Very cool idea! PinmemberEber Ramirez18:11 30 Nov '09  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web03 | 2.5.120517.1 | Last Updated 14 Aug 2009
Article Copyright 2009 by ignotus confutatis
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid