Introduction
The Pauper Man Dictionary is a Windows Mobile 2003 Phone Application for an English-English Dictionary. The idea came to my mind from mixing the articles Google Suggest like Dictionary and Dictionary for Google Suggest like Dictionary and due to the necessity of an English dictionary in my old Windows mobile cell phone.
Background
The Visual Studio 2008 solution includes two projects.
The first one is a WinForms application for downloading the data from The Online Plain Text English Dictionary that is based on "The Project Gutenberg Etext of Webster's Unabridged Dictionary" which in turn is based on the 1913 US Webster's Unabridged Dictionary and is used to create the SQLite.Net database file.
The second project is the PPC implementation for use in a Windows Mobile 2003 cell phone using the same SQLite database file.
Using the Code
The first problem to solve is “to read” the HTML page and split each word in order to accommodate into DB file. For a better performance of the application, I am using a background worker control to use another thread for the download and word processing. Additionally, it is necessary to remove all HTML tags from the page. I found a good example here.
class HTMLremover
{
public static string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}
static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);
public static string StripTagsRegexCompiled(string source)
{
return _htmlRegex.Replace(source, string.Empty);
}
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
}
After I analyzed the text in pages, I found that the characters ‘(‘and ‘)’ are the key to solve the problem of word processing.
This is the portion of the code where the main work takes place:
private bool DowloadData(BackgroundWorker worker, DoWorkEventArgs e)
{
string[] dataReturn = new string[2];
int wordCount = 0;
for (int asciiCode = 97; asciiCode <= 122; asciiCode++) {
char page = (char)asciiCode;
string connString = "Data Source = dict.db";
SQLiteConnection sqConnection = new SQLiteConnection(connString);
sqConnection.Open();
dataReturn[0] = wordCount.ToString();
dataReturn[1] = page.ToString();
worker.ReportProgress(0, dataReturn);
SQLiteTransaction sqTrans =
sqConnection.BeginTransaction(System.Data.IsolationLevel.ReadCommitted);
SQLiteCommand sqCommand = new SQLiteCommand();
sqCommand.Transaction = sqTrans;
sqCommand.Connection = sqConnection;
sqCommand.Parameters.Add(new SQLiteParameter());
sqCommand.Parameters.Add(new SQLiteParameter());
sqCommand.Parameters.Add(new SQLiteParameter());
WebRequest request = WebRequest.Create
("http://www.mso.anu.edu.au/~ralph/OPTED/v003/wb1913_" +
page.ToString() + ".html");
request.Credentials = CredentialCache.DefaultCredentials;
request.Proxy.Credentials = CredentialCache.DefaultNetworkCredentials;
WebResponse response = request.GetResponse();
StreamReader responseReader =
new StreamReader(response.GetResponseStream());
string responseData = responseReader.ReadToEnd();
string textInPage = HTMLremover.StripTagsRegex(responseData);
StreamWriter tempOutput = new StreamWriter("temp.txt");
tempOutput.Write(textInPage);
tempOutput.Close();
int letterSize = textInPage.Length; StreamReader text = new StreamReader("temp.txt");
string line;
int textProcessed = 0;
try
{
while ((line = text.ReadLine()) != null)
{
textProcessed += line.Length;
int percentage = (int)(textProcessed * 100 / letterSize);
if (line != string.Empty && line.Contains('('))
{
string[] field = new string[3];
field[0] = string.Empty;
field[1] = string.Empty;
field[2] = string.Empty;
char[] letters = line.ToCharArray();
int fieldNumber = 0;
foreach (char character in letters)
{
if (fieldNumber == 0 && character == '(')
{
fieldNumber++;
}
field[fieldNumber] += character.ToString();
if (fieldNumber == 1 && character == ')')
{
fieldNumber++;
}
}
if (field[0].Length < 30)
{
dataReturn[0] = wordCount.ToString();
dataReturn[1] = page.ToString();
worker.ReportProgress(percentage, dataReturn);
wordCount++;
sqCommand.Parameters[0].Value = field[0];
sqCommand.Parameters[1].Value = field[1];
sqCommand.Parameters[2].Value = field[2];
sqCommand.CommandText =
@"INSERT INTO [dict] ([word], [type], [mean]) " +
"VALUES (?, ?, ?)";
sqCommand.ExecuteNonQuery();
}
}
}
dataReturn[0] = wordCount.ToString();
dataReturn[1] = page.ToString();
worker.ReportProgress(100, dataReturn);
sqTrans.Commit();
}
catch (Exception ex)
{
MessageBox.Show(ex.ToString());
}
finally
{
sqConnection.Close();
}
text.Close();
}
File.Delete("temp.txt");
return true;
}
Points of Interest
This code is very useful if you want to check out how to open and “read” an internet page inside your code. Or if you want to check how to use backgroundworker control for receiving additional information and not only the percentage of advance of the process. Additionally it shows how to remove the HTML tags from “downloaded” internet pages. It is using SQLite.Net for DB work in both platforms, Win 7 (that’s what I'm using) and Win Mobile. At this point, it is good to mention that there is a speed problem with “SELECT” clause in SQLite.Net.
Finally, this code is a simple way to show you how to build your own dictionary in your old cell phone. But, if you want a free better one (without the source code), you can visit MDict.
History
- 14th August, 2009: Initial post