How To Convert an HTML Table into an XML File
This article demonstrates how to efficiently convert an HTML table into an XML file
Introduction
We may discover some useful information on the internet and want to use it as an XML file. Unfortunately, most of the HTML pages are not formatted that way.
This article tells you how to efficiently convert an HTML table into an XML file.
Background
XML and HTML are both markup languages, but most HTML pages do not follow XML's rules. So we need something to convert a roughly made HTML page into a standard XHTML page.
HTML Tidy is a useful tool that will do all the job for you. It's developed by Dave Raggett. It's a light and free tool. You can find out more information at SourceForge.
You can download the command line *.exe file or DLL here.
Example
I was developing an XML file for most commonly used English words. I found out a page here. Then I downloaded the page.
The content of the HTM file is a mess. Just take a peek at it:
<html><head><title>Word frequency list</title></head><body><br>
<strong>Words listed alphabetically: the first 2000 most frequent
words from the Brown Corpus (1,015,945 words)</strong> <hr color="#ff0000">
<table><tbody><tr><td> </td><td>Word</td><td>Instances</td><td>% Frequency</td></tr>
<tr><td>1.</td><td><a href="http://www.edict.com.hk/scripts/cgi-bin/lexicon.exe?
SearchStr=a" target="Vocabulary">a</a></td><td align="center" bgcolor="#ffffcc">
23363</td><td align="center" bgcolor="aqua">2.2996</td></tr>
<tr><td>2.</td><td><a href="http://www.edict.com.hk/scripts/cgi-bin/lexicon.exe?
SearchStr=ability" target="Vocabulary">ability</a></td><td align="center"
bgcolor="#ffffcc">74</td><td align="center" bgcolor="aqua">0.0073</td></tr>
<tr><td>3.</td><td><a href="http://www.edict.com.hk/scripts/cgi-bin/lexicon.exe?
SearchStr=able" target="Vocabulary">able</a></td><td align="center"
bgcolor="#ffffcc">216</td><td align="center" bgcolor="aqua">0.0213</td></tr>
Then I ran the following under the command line:
tidy -asxhtml -numeric <words2000abc.htm> word2000.xml
So an XHTML standard file named word2000.xml is generated. As XML has adopted XHTML, so you can use this file directly. But it would be nice to trim a little bit.
The content of the XHTML file looks much better now:
<tbody>
<tr>
<td>1.</td>
<td><a href=
"http://www.edict.com.hk/scripts/cgi-bin/lexicon.exe?SearchStr=a"
target="Vocabulary">a</a></td>
<td align="center" bgcolor="#FFFFCC">23363</td>
<td align="center" bgcolor="aqua">2.2996</td>
</tr>
<tr>
<td>2.</td>
<td><a href=
"http://www.edict.com.hk/scripts/cgi-bin/lexicon.exe?SearchStr=ability"
target="Vocabulary">ability</a></td>
<td align="center" bgcolor="#FFFFCC">74</td>
<td align="center" bgcolor="aqua">0.0073</td>
</tr>
Furthermore, we don't need all the columns in the table. For instance on the HTML I downloaded, I only use the column with words and the frequency column.
Now let's use C#'s strong XML functionality to solve this problem.
Firstly, we need a class definition that denotes the XML entry we need.
[Serializable]
public class WordEntry
{
public string word;
public double probability;
}
Remember to declare it as [Serializable]
and "public
", otherwise the XML Writer will cause some errors.
Then, we have:
List<WordEntry> listofEntry = new List<WordEntry>();
List
can be serialized directly.
Then let's read the XHTML file from word2000.xml that we generated via tidy.exe:
private static void ReadFromXMLFile(ref List<WordEntry> listOfEntry)
{
try
{
XmlDocument doc = new XmlDocument();
doc.Load(@"C:\Code\c#\ConsoleTestTemp\ConsoleTestTemp\word2000.xml");
foreach (XmlNode node in doc.GetElementsByTagName("tr"))
{
WordEntry newEntry = new WordEntry();
int counter = 0;
The counter here denotes the column number.
foreach (XmlNode childNode in node.ChildNodes)
{
if(counter==1)
We need the second and fourth columns.
newEntry.word=childNode.InnerText.ToLower();
if (counter==3)
newEntry.probability =Convert.ToDouble( (childNode.InnerText));
counter++;
}
listOfEntry.Add(newEntry);
}
}
catch (Exception e)
{Console.WriteLine("Error: " + e.Message + "\r\n" + e.StackTrace); }
}
Done.
Then let's write to our own XML file:
List<WordEntry> listofEntry = new List<WordEntry>();
ReadFromXMLFile(ref listofEntry);
XmlSerializer s = new XmlSerializer(typeof(List<WordEntry>));
TextWriter w = new StreamWriter
(@"C:\Code\c#\ConsoleTestTemp\ConsoleTestTemp\vocabulary.xml");
s.Serialize(w, listofEntry);
w.Close();
Done, now you have a standard XML file.
<?xml version="1.0" encoding="utf-8"?>
<ArrayOfWordEntry xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<WordEntry>
<word>a</word>
<probability>2.2996</probability>
</WordEntry>
<WordEntry>
<word>ability</word>
<probability>0.0073</probability>
</WordEntry>
<WordEntry>
<word>able</word>
<probability>0.0213</probability>
</WordEntry>
<WordEntry>
<word>about</word>
<probability>0.1787</probability>
</WordEntry>
<WordEntry>
<word>above</word>
<probability>0.0291</probability>
</WordEntry>
Conclusion
So, a few dozen lines of code can be a big help, isn't it? I hope this is useful to you.
History
- 23rd August, 2008: Initial post