Click here to Skip to main content
Click here to Skip to main content
Go to top

Converting a HTML file to an XHTML file

, 19 Mar 2007
Rate this:
Please Sign up or sign in to vote.
Converting a HTML file to an XHTML file

Introduction

The articles on our web site are mainly in HTML 4.0, however, many of them don't conform the W3C standard; there are a lot of bad tags in these articles, and I wanted to convert these files to XHTML files in order to conform the W3C standard.

Sometimes I want to extract some information from web pages. If the web page is in XHTML format, then I can get the information more easily since I can use an XML document prototype to parse the file.

Background

There are several tools that can convert HTML to an XHTML. Dreamweaver is able to convert file by using File-->Convert Menu. But there are some issues with Dreamweaver: it's not free, and sometimes Dreamweaver is not able to fix some errors. Also, you can use a free famous tool called "HTML Tidy". However, HTML Tidy can process some languages only. This article is based on HTML Tidy.

Since XHTML 2.0 is not compatible with HTML and XHTML 1.0, it's not universally used. For example, the default schema for .NET web application is XHTML 1.0. In this article, XHTML refers to XHTML 1.0 transitional format.

Step I. Convert HTML file to UTF-8 format

In order to process all languages, we first have to convert the file to UTF-8 format. (Note: If the source file is already in UTF-8 format, then you can just ignore this step)

We can use FileStream and BinaryReader class read the HTML file as byte array, then convert it to UTF-8 String.

Here we suppose the HTML encoding method is the default encoding of the operation system.

/// <summary>
/// read all the content from a file as byte array
/// </summary>
/// <param name="strFilePath">source file path</param>
/// <returns>dest byte array on succced</returns>
public static byte[] ReadFileAsBytes(String strFilePath)
{
    System.IO.FileStream fs = new System.IO.FileStream(strFilePath, 
        System.IO.FileMode.Open, System.IO.FileAccess.Read, 
        System.IO.FileShare.ReadWrite);
    System.IO.BinaryReader br = new System.IO.BinaryReader(fs);
    byte[] baResult = null;
    try
    {
        baResult = new byte[fs.Length];
        br.Read(baResult, 0, baResult.Length);
    }
    finally
    {
        br.Close();
        fs.Close();
    }
    return baResult;
}
/// <summary>
/// convert a byte array to string using default encoding
/// </summary>
/// <param name="bData">the content of the array</param>
/// <returns>converted string</returns>
public static String BytesToString(byte[] bData)
{
    return System.Text.Encoding.GetEncoding(0).GetString(bData);
}

Step II. Convert file to XHTML

We use HTML Tidy to convert HTML files to XHTML files. Tidy has lots of parameters. If you want to know the details, you can read the manual.

If we want to convert a UTF-8 html file to XHTML file, you can use use it like this:

tidy.exe -raw -utf8 -asxhtml -i -f logfilename -o outputfilename inputfilename

By using the System.Diagnostics.Process class you startup a process, point out the specified the name of the input file and output file, and read the entire output as the converted XHTML file. If the output file does not exist, there may be a server error with the input file, in which case you may have to check it manually.

/// <summary>
/// This methond convert a html file to an xhtml file
/// </summary>
/// <param name="strOriginalContent">input html file</param>
/// <param name="strTempPath">Temppath,if this parameter is 
/// null,then it refers to the temp path of the system</param>
/// <returns>converted xhtml file content from input file</returns>
public static String HTML2XHTML(String strOriginalContent,String strOutputPath)
{
    String strTempPath = strOutputPath != null ? strOutputPath : 
        System.IO.Path.GetTempPath();

    String strFileName = String.Format("{0}tidy.exe",strTempPath);
    //check wether tidy execuble exists
    if (!System.IO.File.Exists(strFileName))
    {
        ChinaCars.Util.SysUtil.WriteFile(strFileName,
            ChinaCars.Util.App_GlobalResources.Resource.tidy);
    }

    //Create process
    System.Diagnostics.ProcessStartInfo psiInfo = 
        new System.Diagnostics.ProcessStartInfo();
    psiInfo.FileName = strFileName;
    psiInfo.CreateNoWindow = true;
    psiInfo.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
    psiInfo.WorkingDirectory = strTempPath;

    String strMainFileName = System.Guid.NewGuid().ToString("N");
    //Specify the in/out/error file name,which is located in the temporary 
    //path
    String strInFileName = String.Format("{0}{1}.in", 
        strTempPath,strMainFileName);
    String strOutFileName = String.Format("{0}{1}.out", 
        strTempPath,strMainFileName);
    String strErrorFileName = String.Format("{0}{1}.log", 
        strTempPath,strMainFileName);
    System.IO.File.Delete(strInFileName);
    //UTF8 Version,and we suppose the original content is encoded though the  
    //default encoding of the system
    byte[] baUTF8Data = Encoding.Convert(Encoding.GetEncoding(0), 
        Encoding.UTF8, Encoding.GetEncoding(0).GetBytes(strOriginalContent));
    ChinaCars.Util.SysUtil.WriteFile(strInFileName, baUTF8Data);

    //UTF8 Version
    psiInfo.Arguments = String.Format(" -raw -utf8 -asxhtml -i -f 
        {0}.log -o {0}.out {0}.in", strMainFileName);
    System.IO.File.Delete(strOutFileName);
    System.Diagnostics.Process proc = 
        System.Diagnostics.Process.Start(psiInfo);
    proc.WaitForExit();
    System.IO.File.Delete(strInFileName);
    System.IO.File.Delete(strErrorFileName);

    byte[] baResult = ChinaCars.Util.SysUtil.ReadFileAsBytes(strOutFileName);
    //We need a head for xhtml processing
    String strContent = 
        Encoding.GetEncoding(0).GetString(Encoding.Convert(Encoding.UTF8, 
            Encoding.GetEncoding(0), baResult));
    strContent = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 
        Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-
        transitional.dtd\">" + strContent;
    System.IO.File.Delete(strOutFileName);
    return strContent;
}

Step III. Developing your own XHTML reslover

Now you may use System.Xml.XmlDocument class to load the XHTML document, but you may find the loading process is really long! Sometimes the LoadXML will fail! Why?

The DOCTYPE header in the XHTML tells the .NET XML parser to load corresponding file resource from World Wide Web Consortium (W3C), and may take several or more rounds! Fortunately, the .NET Framework allows us to resolve XML files by ourselves. By overriding ResolveUri and GetEntity of the XmlRelolver, we can reduce the XHTML loading time. The code is shown below:

public class XHTMLResolver:XmlResolver
{
    override public ICredentials Credentials
    {
        set {  }
    }

    public XHTMLResolver()
    {

    }

    public override Uri ResolveUri(Uri baseUri, String relativeUri)
    {
        if (String.Compare(relativeUri, "-//W3C//DTD XHTML 1.0 
            Transitional//EN", true) == 0)
        {
            return new Uri("http://www.w3.org/tr/xhtml1/DTD/xhtml1-
                transitional.dtd");
        }
        else if (String.Compare(relativeUri, "-//W3C//DTD XHTML 1.0 
            Transitional//EN", true) == 0)
        {
            return new Uri("http://www.w3.org/tr/xhtml1/DTD/
                xhtml1-strict.dtd");
        }
        else if (String.Compare(relativeUri, "-//W3C//DTD XHTML 1.0 
            Transitional//EN", true) == 0)
        {
            return new Uri("http://www.w3.org/tr/xhtml1/DTD/
                xhtml1-frameset.dtd");
        }
        else if (String.Compare(relativeUri, "-//W3C//DTD XHTML 
            1.1//EN", true) == 0)
        {
            return new Uri("http://www.w3.org/tr/xhtml11/DTD/xhtml11.dtd");
        }

        return base.ResolveUri(baseUri,relativeUri);
    }
    override public object GetEntity(Uri absoluteUri, string role, 
        Type ofObjectToReturn)
    {
        Object entityObj = null;
        String strURI = absoluteUri.AbsoluteUri;
        System.IO.MemoryStream msStream=null;


        switch (strURI.ToLower())
        {
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml1-transitional.dtd":
            msStream = new MemoryStream(Resource.xhtml1_transitional);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml1.dcl":
            msStream = new MemoryStream(Resource.xhtml1);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml-lat1.ent":
            msStream = new MemoryStream(Resource.xhtml_lat1);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml-special.ent":
            msStream = new MemoryStream(Resource.xhtml_special);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml-symbol.ent":
            msStream = new MemoryStream(Resource.xhtml_symbol);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml1-strict.dtd":
            msStream = new MemoryStream(Resource.xhtml1_strict);
            break;
        case "http://www.w3.org/tr/xhtml1/dtd/xhtml1-frameset.dtd":
            msStream = new MemoryStream(Resource.xhtml1_frameset);
            break;
        case "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd":
            msStream = new MemoryStream(Resource.xhtml11);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-inlstyle-1.mod":
            msStream = new MemoryStream(Resource.xhtml_inlstyle_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-framework-1.mod":
            msStream = new MemoryStream(Resource.xhtml_framework_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-datatypes-1.mod":
            msStream = new MemoryStream(Resource.xhtml_datatypes_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-qname-1.mod":
            msStream = new MemoryStream(Resource.xhtml_qname_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-events-1.mod":
            msStream = new MemoryStream(Resource.xhtml_events_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-attribs-1.mod":
            msStream = new MemoryStream(Resource.xhtml_attribs_1);
            break;
        case "http://www.w3.org/tr/xhtml11/dtd/
            xhtml11-model-1.mod":
            msStream = new MemoryStream(Resource.xhtml11_model_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-charent-1.mod":
            msStream = new MemoryStream(Resource.xhtml_charent_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-lat1.ent":
            msStream = new MemoryStream(Resource.xhtml_lat11);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-symbol.ent":
            msStream = new MemoryStream(Resource.xhtml_symbol11);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-special.ent":
            msStream = new MemoryStream(Resource.xhtml_special11);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-text-1.mod":
            msStream = new MemoryStream(Resource.xhtml_text_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-inlstruct-1.mod":
            msStream = new MemoryStream(Resource.xhtml_inlstruct_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-inlphras-1.mod":
            msStream = new MemoryStream(Resource.xhtml_inlphras_1);
            break;
        case "http://www.w3.org/tr/ruby/xhtml-ruby-1.mod":
            msStream = new MemoryStream(Resource.xhtml_ruby_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-blkstruct-1.mod":
            msStream = new MemoryStream(Resource.xhtml_blkstruct_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-blkphras-1.mod":
            msStream = new MemoryStream(Resource.xhtml_blkphras_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-hypertext-1.mod":
            msStream = new MemoryStream(Resource.xhtml_hypertext_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-list-1.mod":
            msStream = new MemoryStream(Resource.xhtml_list_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-edit-1.mod":
            msStream = new MemoryStream(Resource.xhtml_edit_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-bdo-1.mod":
            msStream = new MemoryStream(Resource.xhtml_bdo_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-pres-1.mod":
            msStream = new MemoryStream(Resource.xhtml_pres_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-inlpres-1.mod":
            msStream = new MemoryStream(Resource.xhtml_inlpres_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-blkpres-1.mod":
            msStream = new MemoryStream(Resource.xhtml_blkpres_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-link-1.mod":
            msStream = new MemoryStream(Resource.xhtml_link_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-meta-1.mod":
            msStream = new MemoryStream(Resource.xhtml_meta_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-base-1.mod":
            msStream = new MemoryStream(Resource.xhtml_base_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-script-1.mod":
            msStream = new MemoryStream(Resource.xhtml_script_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-style-1.mod":
            msStream = new MemoryStream(Resource.xhtml_style_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-image-1.mod":
            msStream = new MemoryStream(Resource.xhtml_image_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-csismap-1.mod":
            msStream = new MemoryStream(Resource.xhtml_csismap_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-ssismap-1.mod":
            msStream = new MemoryStream(Resource.xhtml_ssismap_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-param-1.mod":
            msStream = new MemoryStream(Resource.xhtml_param_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-object-1.mod":
            msStream = new MemoryStream(Resource.xhtml_object_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-table-1.mod":
            msStream = new MemoryStream(Resource.xhtml_table_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-form-1.mod":
            msStream = new MemoryStream(Resource.xhtml_form_1);
            break;
        case "http://www.w3.org/tr/xhtml-modularization/dtd/
            xhtml-struct-1.mod":
            msStream = new MemoryStream(Resource.xhtml_struct_1);
            break;
        }


        if (msStream != null)
        {
            entityObj = msStream;
        }
        else
        {
            XmlUrlResolver xur = new XmlUrlResolver();
            entityObj = xur.GetEntity(absoluteUri, role, ofObjectToReturn);
        }
        return entityObj;
    }

Using the code

By using the HTML2XHTML method, you can convert an HTML file to an XHTML file.

System.Net.WebClient webClient = new System.Net.WebClient();
String strHTMLContent = webClient.DownloadString("http://www.codeproject.com");
String strXHTMLContent = ChinaCars.Util.XMLUtil.HTML2XHTML(strHTMLContent);

By using the XHTMLResolver, you can resolve the XHTML file as XML very quickly.

System.Xml.XmlDocument xmlDoc=new System.Xml.XmlDocument();
xmlDoc.XmlResolver =new ChinaCars.Util.XHTMLResolver();
xmlDoc.LoadXml(xmlContent);

History

Mar 12th,2007 Publish the first version

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author

Jinjun Xie
Web Developer
China China
About the author:
Jinjin Xie is the technical director of ChinaCars Co.LTD, expertise in Application Architect,Performance Tunning and VLDB design.
email:jinjun@jinjun.com.
office phone:8610-64014646 ext. 779

Comments and Discussions

 
Questioni got an error how to resolve it ---- App_GlobalResources namespace does not exists Pinmembersivakumarhyd124-Oct-12 21:04 
GeneralPeople can also try Html2Xhtml for .NET 4.0 Pinmembercetinsert22-Mar-10 11:53 
GeneralThanks! Pinmemberfelipecsl31-Jul-08 18:08 
I'd like to thank you very much for the article. This helped me a lot and saved me time!
Greetings,
Felipe
GeneralTypo in ResolveUri PinmemberD Waterworth15-Apr-07 21:49 
GeneralRe: Typo in ResolveUri PinmemberJinjun Xie25-Apr-07 15:20 
GeneralI am glad I am not the only one Pinmemberalex turner11-Mar-07 23:36 
GeneralRe: I am glad I am not the only one PinadminChris Maunder20-Mar-07 0:41 
GeneralRe: I am glad I am not the only one PinmemberMartin Jericho20-Mar-07 1:22 
GeneralRe: I am glad I am not the only one PinmemberSaltire27-Mar-07 5:41 
GeneralRe: I am glad I am not the only one PinadminChris Maunder27-Mar-07 11:52 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140926.1 | Last Updated 19 Mar 2007
Article Copyright 2007 by Jinjun Xie
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid