Click here to Skip to main content
13,445,301 members (43,698 online)
Click here to Skip to main content
Add your own
alternative version


33 bookmarked
Posted 23 Aug 2007

How To Convert an HTML Table into an XML File

, 23 Aug 2007
Rate this:
Please Sign up or sign in to vote.
This article demonstrates how to efficiently convert an HTML table into an XML file


We may discover some useful information on the internet and want to use it as an XML file. Unfortunately, most of the HTML pages are not formatted that way.

This article tells you how to efficiently convert an HTML table into an XML file.


XML and HTML are both markup languages, but most HTML pages do not follow XML's rules. So we need something to convert a roughly made HTML page into a standard XHTML page.

HTML Tidy is a useful tool that will do all the job for you. It's developed by Dave Raggett. It's a light and free tool. You can find out more information at SourceForge.

You can download the command line *.exe file or DLL here.


I was developing an XML file for most commonly used English words. I found out a page here. Then I downloaded the page.

The content of the HTM file is a mess. Just take a peek at it:

<html><head><title>Word frequency list</title></head><body><br>
<strong>Words listed alphabetically: the first 2000 most frequent
 words from the Brown Corpus (1,015,945 words)</strong> <hr color="#ff0000">
<table><tbody><tr><td> </td><td>Word</td><td>Instances</td><td>% Frequency</td></tr>
<tr><td>1.</td><td><a href="
SearchStr=a" target="Vocabulary">a</a></td><td align="center" bgcolor="#ffffcc">
23363</td><td align="center" bgcolor="aqua">2.2996</td></tr>
<tr><td>2.</td><td><a href="
SearchStr=ability" target="Vocabulary">ability</a></td><td align="center" 

bgcolor="#ffffcc">74</td><td align="center" bgcolor="aqua">0.0073</td></tr>
<tr><td>3.</td><td><a href="
SearchStr=able" target="Vocabulary">able</a></td><td align="center" 

bgcolor="#ffffcc">216</td><td align="center" bgcolor="aqua">0.0213</td></tr>

Then I ran the following under the command line:

tidy -asxhtml -numeric <words2000abc.htm> word2000.xml

So an XHTML standard file named word2000.xml is generated. As XML has adopted XHTML, so you can use this file directly. But it would be nice to trim a little bit.

The content of the XHTML file looks much better now:


<td><a href=

<td align="center" bgcolor="#FFFFCC">23363</td>
<td align="center" bgcolor="aqua">2.2996</td>

<td><a href=

<td align="center" bgcolor="#FFFFCC">74</td>
<td align="center" bgcolor="aqua">0.0073</td>

Furthermore, we don't need all the columns in the table. For instance on the HTML I downloaded, I only use the column with words and the frequency column.

Now let's use C#'s strong XML functionality to solve this problem.

Firstly, we need a class definition that denotes the XML entry we need.

public class WordEntry
    public string word;
    public double probability;

Remember to declare it as [Serializable] and "public", otherwise the XML Writer will cause some errors.

Then, we have:

List<WordEntry> listofEntry = new List<WordEntry>();

List can be serialized directly.

Then let's read the XHTML file from word2000.xml that we generated via tidy.exe:

private  static void ReadFromXMLFile(ref List<WordEntry> listOfEntry)
                XmlDocument doc = new XmlDocument();
  foreach (XmlNode node in doc.GetElementsByTagName("tr"))
                    WordEntry newEntry = new WordEntry();
                    int counter = 0;

The counter here denotes the column number.

foreach (XmlNode childNode in node.ChildNodes)

We need the second and fourth columns.

                if (counter==3)
         newEntry.probability =Convert.ToDouble( (childNode.InnerText));
    catch (Exception e)
    {Console.WriteLine("Error: " + e.Message + "\r\n" + e.StackTrace); }


Then let's write to our own XML file:

List<WordEntry> listofEntry = new List<WordEntry>();
           ReadFromXMLFile(ref listofEntry);

           XmlSerializer s = new XmlSerializer(typeof(List<WordEntry>));
           TextWriter w = new StreamWriter
           s.Serialize(w, listofEntry);

Done, now you have a standard XML file.

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfWordEntry xmlns:xsi= 







So, a few dozen lines of code can be a big help, isn't it? I hope this is useful to you.


  • 23rd August, 2008: Initial post


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Web Developer Horizon Ideas
United States United States
My name is Jia Chen and I want you tell you about my childhood dream: being a problem-solver. My mom told me it was silly because it wasn't really a profession. Through the last decade, I have been a software engineer, a product manager, a repetitive student, a management consult and an entrepreneur. They appear far from my childhood dream. But I still think I am living it. Because the essence of it is to find problems and solve problems. Some times I may not solve new problems, but I always want to solve old problems in a new way.

You may also be interested in...


Comments and Discussions

GeneralMy vote of 2 Pin
Jordan Wilde25-Sep-13 13:25
memberJordan Wilde25-Sep-13 13:25 
Questionhello [modified] Pin
wordplay6-Oct-07 0:24
memberwordplay6-Oct-07 0:24 
AnswerRe: hello Pin
Jia.C6-Oct-07 6:17
memberJia.C6-Oct-07 6:17 
GeneralRe: hello [modified] Pin
wordplay6-Oct-07 22:47
memberwordplay6-Oct-07 22:47 
GeneralRe: hello Pin
Jia.C7-Oct-07 21:37
memberJia.C7-Oct-07 21:37 
QuestionNice article...but how about XPath? Pin
yangdingning29-Aug-07 19:13
memberyangdingning29-Aug-07 19:13 
QuestionMAKEUP ? Pin
NinjaCross23-Aug-07 11:54
memberNinjaCross23-Aug-07 11:54 
XML and HTML are both makeup languages"

... MAKEUP ???? Laugh | :laugh:
Maybe you meant "markup" Wink | ;)

Anyway, interesting approach Smile | :)


AnswerRe: MAKEUP ? Pin
Jia.C23-Aug-07 11:55
memberJia.C23-Aug-07 11:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web01-2016 | 2.8.180314.2 | Last Updated 23 Aug 2007
Article Copyright 2007 by Jia.Chen
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid