Click here to Skip to main content
15,881,424 members
Articles / Web Development / HTML
Article

How To Convert an HTML Table into an XML File

Rate me:
Please Sign up or sign in to vote.
3.44/5 (6 votes)
23 Aug 2007CPOL2 min read 126.5K   3K   33   8
This article demonstrates how to efficiently convert an HTML table into an XML file

Introduction

We may discover some useful information on the internet and want to use it as an XML file. Unfortunately, most of the HTML pages are not formatted that way.

This article tells you how to efficiently convert an HTML table into an XML file.

Background

XML and HTML are both markup languages, but most HTML pages do not follow XML's rules. So we need something to convert a roughly made HTML page into a standard XHTML page.

HTML Tidy is a useful tool that will do all the job for you. It's developed by Dave Raggett. It's a light and free tool. You can find out more information at SourceForge.

You can download the command line *.exe file or DLL here.

Example

I was developing an XML file for most commonly used English words. I found out a page here. Then I downloaded the page.

The content of the HTM file is a mess. Just take a peek at it:

ASP.NET
<html><head><title>Word frequency list</title></head><body><br>
<strong>Words listed alphabetically: the first 2000 most frequent
 words from the Brown Corpus (1,015,945 words)</strong> <hr color="#ff0000">
<table><tbody><tr><td> </td><td>Word</td><td>Instances</td><td>% Frequency</td></tr>
<tr><td>1.</td><td><a href="http://www.edict.com.hk/scripts/cgi-bin/lexicon.exe?
SearchStr=a" target="Vocabulary">a</a></td><td align="center" bgcolor="#ffffcc">
23363</td><td align="center" bgcolor="aqua">2.2996</td></tr>
<tr><td>2.</td><td><a href="http://www.edict.com.hk/scripts/cgi-bin/lexicon.exe?
SearchStr=ability" target="Vocabulary">ability</a></td><td align="center" 
bgcolor="#ffffcc">74</td><td align="center" bgcolor="aqua">0.0073</td></tr>
<tr><td>3.</td><td><a href="http://www.edict.com.hk/scripts/cgi-bin/lexicon.exe?
SearchStr=able" target="Vocabulary">able</a></td><td align="center" 
bgcolor="#ffffcc">216</td><td align="center" bgcolor="aqua">0.0213</td></tr>

Then I ran the following under the command line:

tidy -asxhtml -numeric <words2000abc.htm> word2000.xml

So an XHTML standard file named word2000.xml is generated. As XML has adopted XHTML, so you can use this file directly. But it would be nice to trim a little bit.

The content of the XHTML file looks much better now:

ASP.NET
<tbody>

<tr>
<td>1.</td>
<td><a href=
"http://www.edict.com.hk/scripts/cgi-bin/lexicon.exe?SearchStr=a"
target="Vocabulary">a</a></td>
<td align="center" bgcolor="#FFFFCC">23363</td>
<td align="center" bgcolor="aqua">2.2996</td>
</tr>

<tr>
<td>2.</td>
<td><a href=
"http://www.edict.com.hk/scripts/cgi-bin/lexicon.exe?SearchStr=ability"
target="Vocabulary">ability</a></td>
<td align="center" bgcolor="#FFFFCC">74</td>
<td align="center" bgcolor="aqua">0.0073</td>
</tr>

Furthermore, we don't need all the columns in the table. For instance on the HTML I downloaded, I only use the column with words and the frequency column.

Now let's use C#'s strong XML functionality to solve this problem.

Firstly, we need a class definition that denotes the XML entry we need.

C#
[Serializable]
public class WordEntry
 {
    public string word;
    public double probability;
 }

Remember to declare it as [Serializable] and "public", otherwise the XML Writer will cause some errors.

Then, we have:

C#
List<WordEntry> listofEntry = new List<WordEntry>();

List can be serialized directly.

Then let's read the XHTML file from word2000.xml that we generated via tidy.exe:

C#
private  static void ReadFromXMLFile(ref List<WordEntry> listOfEntry)
        {
            try
            {
                XmlDocument doc = new XmlDocument();
                doc.Load(@"C:\Code\c#\ConsoleTestTemp\ConsoleTestTemp\word2000.xml");
              
  foreach (XmlNode node in doc.GetElementsByTagName("tr"))
                {
                    WordEntry newEntry = new WordEntry();
                    int counter = 0;

The counter here denotes the column number.

C#
foreach (XmlNode childNode in node.ChildNodes)
{
   if(counter==1)

We need the second and fourth columns.

C#
                newEntry.word=childNode.InnerText.ToLower();
                if (counter==3)
         newEntry.probability =Convert.ToDouble( (childNode.InnerText));
                 counter++;
            }
            listOfEntry.Add(newEntry);
        }
    }
    catch (Exception e)
    {Console.WriteLine("Error: " + e.Message + "\r\n" + e.StackTrace); }
}

Done.

Then let's write to our own XML file:

C#
List<WordEntry> listofEntry = new List<WordEntry>();
           ReadFromXMLFile(ref listofEntry);

           XmlSerializer s = new XmlSerializer(typeof(List<WordEntry>));
           TextWriter w = new StreamWriter
       (@"C:\Code\c#\ConsoleTestTemp\ConsoleTestTemp\vocabulary.xml");
           s.Serialize(w, listofEntry);
           w.Close();

Done, now you have a standard XML file.

XML
<?xml version="1.0" encoding="utf-8"?>
<ArrayOfWordEntry xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance 
	xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <WordEntry>
    <word>a</word>
    <probability>2.2996</probability>
  </WordEntry>

  <WordEntry>
    <word>ability</word>
    <probability>0.0073</probability>
  </WordEntry> 

  <WordEntry>
    <word>able</word>
    <probability>0.0213</probability>
  </WordEntry>

  <WordEntry>
    <word>about</word>
    <probability>0.1787</probability>
  </WordEntry>

  <WordEntry>
    <word>above</word>
    <probability>0.0291</probability>
  </WordEntry>

Conclusion

So, a few dozen lines of code can be a big help, isn't it? I hope this is useful to you.

History

  • 23rd August, 2008: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer Horizon Ideas
United States United States
My name is Jia Chen and I want you tell you about my childhood dream: being a problem-solver. My mom told me it was silly because it wasn't really a profession. Through the last decade, I have been a software engineer, a product manager, a repetitive student, a management consult and an entrepreneur. They appear far from my childhood dream. But I still think I am living it. Because the essence of it is to find problems and solve problems. Some times I may not solve new problems, but I always want to solve old problems in a new way.

Comments and Discussions

 
GeneralMy vote of 2 Pin
Jordan Wilde25-Sep-13 13:25
Jordan Wilde25-Sep-13 13:25 
Questionhello [modified] Pin
wordplay6-Oct-07 0:24
wordplay6-Oct-07 0:24 
AnswerRe: hello Pin
Jia.Chen6-Oct-07 6:17
professionalJia.Chen6-Oct-07 6:17 
GeneralRe: hello [modified] Pin
wordplay6-Oct-07 22:47
wordplay6-Oct-07 22:47 
GeneralRe: hello Pin
Jia.Chen7-Oct-07 21:37
professionalJia.Chen7-Oct-07 21:37 
QuestionNice article...but how about XPath? Pin
yangdingning29-Aug-07 19:13
yangdingning29-Aug-07 19:13 
QuestionMAKEUP ? Pin
NinjaCross23-Aug-07 11:54
NinjaCross23-Aug-07 11:54 
AnswerRe: MAKEUP ? Pin
Jia.Chen23-Aug-07 11:55
professionalJia.Chen23-Aug-07 11:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.