Click here to Skip to main content
Licence CPOL
First Posted 24 Jun 2005
Views 132,112
Downloads 2,043
Bookmarked 82 times

Convert HTML to XHTML and Clean Unnecessary Tags and Attributes

By | 24 Jun 2005 | Article
Convert HTML to XHTML while applying tag and attribute filters in order to produce nice and clean HTML for web posting.

Introduction

This is a class library that helps you produce valid XHTML from HTML. It also provides tag and attribute filtering support. You can specify exactly which tags and attributes are allowed in the output and the other tags are filtered out. You can use this library to clean the bulky HTML that Microsoft Word documents produce when converted to HTML. You can also use it to cleanup HTML before posting to blog sites so that your HTML does not get rejected by blog engines like WordPress, B2evolution, etc.

How it Works

There are two classes: HtmlReader and HtmlWriter.

HtmlReader extends the famous SgmlReader from Chris Clovett. When it reads the HTML, it skips any node that has some kind of a prefix. As a result, all those nasty tags like <o:p>, <o:Document>, <st1:personname> and hundreds of other tags are filtered out. Thus, the HTML you read is free of tags that are not core HTML tags.

HtmlWriter extends the regular XmlWriter, which makes it produce XML. XHTML is basically HTML in XML format. All the familiar tags you use -- like <img>, <br> and <hr>, which have no closing tags -- must be in empty element format in XHTML, i.e. <img .. />, <br/> and <hr/>. As XHTML is a well-formed XML document, you can easily read a XHTML document using XML parsers. This gives you the opportunity to apply XPath searching.

HtmlReader

HtmlReader is pretty simple. Here's the entire class:

/// <summary>
/// This class skips all nodes which has some
/// kind of prefix. This trick does the job 
/// to clean up MS Word/Outlook HTML markups.
/// </summary>
public class HtmlReader : Sgml.SgmlReader
{
    public HtmlReader( TextReader reader ) : base( )
    {
        base.InputStream = reader;
        base.DocType = "HTML";
    }
    public HtmlReader( string content ) : base( )
    {
        base.InputStream = new StringReader( content );
        base.DocType = "HTML";
    }
    public override bool Read()
    {
        bool status = base.Read();
        if( status )
        {
            if( base.NodeType == XmlNodeType.Element )
            {
                // Got a node with prefix. This must be one
                // of those "<o:p>" or something else.
                // Skip this node entirely. We want prefix
                // less nodes so that the resultant XML 
                // requires not namespace.
                if( base.Name.IndexOf(':') > 0 )
                    base.Skip();
            }
        }
        return status;
    }
}

HtmlWriter

This class is a bit trickier. Here are the tricks that have been used:

  • Overrides the WriteString method of XmlWriter and prevents it from encoding content using regular XML encoding. The encoding is done manually for HTML documents.
  • WriteStartElement is overridden to prevent tags from being written to the output that are not allowed.
  • WriteAttributes is overridden to prevent unwanted attributes.

Let's take a look at the entire class part-by-part:

Configurability

You can configure HtmlWriter by modifying the following block:

public class HtmlWriter : XmlTextWriter
{
    /// <summary>
    /// If set to true, it will filter the output
    /// by using tag and attribute filtering,
    /// space reduce etc
    /// </summary>
    public bool FilterOutput = false;
    /// <summary>
    /// If true, it will reduce consecutive &nbsp; with one instance
    /// </summary>
    public bool ReduceConsecutiveSpace = true;
    /// <summary>
    /// Set the tag names in lower case which are allowed to go to output
    /// </summary>
    public string [] AllowedTags = 
        new string[] { "p", "b", "i", "u", "em", "big", "small", 
        "div", "img", "span", "blockquote", "code", "pre", "br", "hr", 
        "ul", "ol", "li", "del", "ins", "strong", "a", "font", "dd", "dt"};
    /// <summary>
    /// If any tag found which is not allowed, it is replaced by this tag.
    /// Specify a tag which has least impact on output
    /// </summary>
    public string ReplacementTag = "dd";
    /// <summary>
    /// New lines \r\n are replaced with space 
    /// which saves space and makes the
    /// output compact
    /// </summary>
    public bool RemoveNewlines = true;
    /// <summary>
    /// Specify which attributes are allowed. 
    /// Any other attribute will be discarded
    /// </summary>
    public string [] AllowedAttributes = new string[] 
    { 
        "class", "href", "target", "border", "src", 
        "align", "width", "height", "color", "size" 
    };
}

WriteString Method

/// <summary>
/// The reason why we are overriding
/// this method is, we do not want the output to be
/// encoded for texts inside attribute
/// and inside node elements. For example, all the &nbsp;
/// gets converted to &amp;nbsp in output. But this does not 
/// apply to HTML. In HTML, we need to have &nbsp; as it is.
/// </summary>
/// <param name="text"></param>
public override void WriteString(string text)
{
    // Change all non-breaking space to normal space
    text = text.Replace( " ", "&nbsp;" );
    /// When you are reading RSS feed and writing Html, 
    /// this line helps remove those CDATA tags
    text = text.Replace("<![CDATA[","");
    text = text.Replace("]]>", "");

    // Do some encoding of our own because
    // we are going to use WriteRaw which won't
    // do any of the necessary encoding
    text = text.Replace( "<", "<" );
    text = text.Replace( ">", ">" );
    text = text.Replace( "'", "&apos;" );
    text = text.Replace( "\"", ""e;" );

    if( this.FilterOutput )
    {
        text = text.Trim();

        // We want to replace consecutive spaces
        // to one space in order to save horizontal width
        if( this.ReduceConsecutiveSpace ) 
            text = text.Replace("&nbsp;&nbsp;&nbsp;", "&nbsp;");
        if( this.RemoveNewlines ) 
            text = text.Replace(Environment.NewLine, " ");

        base.WriteRaw( text );
    }
    else
    {
        base.WriteRaw( text );
    }
}

WriteStartElement: Applying Tag Filtering

public override void WriteStartElement(string prefix, 
    string localName, string ns)
{
    if( this.FilterOutput ) 
    {
        bool canWrite = false;
        string tagLocalName = localName.ToLower();
        foreach( string name in this.AllowedTags )
        {
            if( name == tagLocalName )
            {
                canWrite = true;
                break;
            }
        }
        if( !canWrite ) 
        localName = "dd";
    }
    base.WriteStartElement(prefix, localName, ns);
}

WriteAttributes Method: Applying Attribute Filtering

bool canWrite = false;
string attributeLocalName = reader.LocalName.ToLower();
foreach( string name in this.AllowedAttributes )
{
    if( name == attributeLocalName )
    {
        canWrite = true;
        break;
    }
}
// If allowed, write the attribute
if( canWrite ) 
    this.WriteStartAttribute(reader.Prefix, 
    attributeLocalName, reader.NamespaceURI);
while (reader.ReadAttributeValue())
{
    if (reader.NodeType == XmlNodeType.EntityReference)
    {
        if( canWrite ) this.WriteEntityRef(reader.Name);
        continue;
    }
    if( canWrite )this.WriteString(reader.Value);
}
if( canWrite ) this.WriteEndAttribute();

Conclusion

The sample application is a utility that you can use right now to clean HTML files. You can use this class in applications like blogging tools where you need to post HTML to some web service.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Omar Al Zabir

Architect
BT, UK (ex British Telecom)
United Kingdom United Kingdom

Member

I am: Chief Architect, SaaS Platform, BT (ex British Telecom). Visual C# MVP '05-'07, ASP.NET/IIS MVP '08-'12
I was: Co-founder & CTO, Pageflakes(www.pageflakes.com)
I like: Performance and Scalability Challenges.
My Book: Building a Web 2.0 portal using ASP.NET 3.5. Also on Amazon
My Blog: http://omaralzabir.com
My Specialization: Web 2.0 Rich AJAX Applications, Level 4 SaaS, Performance and Scalability of Web Apps.
My Email: OmarALZabir at gmail dot com
 
Follow Me: twitter.com/omaralzabir
 
My Projects:
Open Source Web 2.0 AJAX Portal
PlantUML Editor - Super fast UML editor
Smart UML - Freehand UML Designer
RSS Aggregator both Outlook and Standalone
Store Front in JSP but ASP.NET style
 
My Articles:
Top 10 caching mistakes
99.99% Available Production Architecture
Build GoogleIG like Ajax Start Page in 7 days
10 ASP.NET Performance and Scalability Secrets
ASP.NET AJAX under the hood secrets
UFrame: UpdatePanel and IFRAME combined
Fast ASP.NET web page loading
Fast Streaming AJAX Proxy
Using COM safely inside "using" block without requiring interop assembly
Implementing Word Like Automation Model
Distributed Command Pattern

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
Generalheading tags are not closing..? Pinmembersamiran bhuin3:18 13 Jun '11  
GeneralCDATA error,, PinmemberMember 79477551:07 30 May '11  
GeneralMy vote of 5 PinmemberVarunKumarGB0:05 4 Feb '11  
GeneralReplacementTag not working PinmemberSlarti4223:40 20 Dec '10  
GeneralUse HTML string instead of file Pinmemberdhams_developer3:44 17 Feb '10  
GeneralTo correct a malformed HTML PinmemberSkpananghat2:09 10 Feb '09  
Questioncan I use this class vc++ 6 PinmemberJD8120:56 14 Dec '07  
Question& not supported? Pinmembermavedrive23:45 8 Jul '07  
QuestionRe: & not supported? Pinmembermavedrive23:46 8 Jul '07  
QuestionRe: & not supported? Pinmembermavedrive23:57 8 Jul '07  
Question& not supported? Pinmembermavedrive23:44 8 Jul '07  
GeneralConvert all html tag-names in xml PinmemberAndreas Hollmann2:49 9 Oct '06  
GeneralSmall code change to fix crash PinmemberJafin21:42 28 Aug '06  
Generalthanx! PinmemberKevin James10:20 24 Mar '06  
GeneralConversion without filtering PinmemberAlexChi13:09 13 Oct '05  
GeneralRe: Conversion without filtering PinmemberOmar Al Zabir22:39 13 Oct '05  
GeneralFeature missing PinmemberStephan Pilz5:29 29 Jul '05  
GeneralRe: Feature missing PinmemberOmar Al Zabir9:47 29 Jul '05  
GeneralGood Stuff Pinmemberskhan_bd@hotmail.com13:09 28 Jun '05  
GeneralUtility crashes PinmemberMehfuz Hossain23:40 27 Jun '05  
Utility crashes while cleaning up the following html file
 
//-----------------------------------------------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title></title>
<meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1">
<meta name="vs_targetSchema" content="http://schemas.microsoft.com/intellisense/ie5">
</head>
<body>

<IMG height="10" alt="" hspace="0" src="PrePaidATM - Account Profile_files/spacer.gif"
width="1" border="0">





































Account Information
Account Type: MERCHANT
Account Code: M171614575761422
Registration Date: 05/26/2005
Account Status: ACTIVE
Rewards Program Rate: 3%
Referrer's Account Code:
PrePaidATM newsletter subscriber: NO
Elite Status: NO



Profile Information
Business Name: VELOCITY GLOBAL
First Name: SIMON
Last Name: ISLAM
Street Address: 4031 CRYSTAL LAKE CIRCLE S
Additional Address:
City: PEARLAND
State: TEXAS
Postal Code: 77584
Country: UNITED STATES
Country Of Citizenship:
Phone: 832 640 3371
Notification EMail: simon@velocityglobal.com
Management EMail: simon@velocityglobal.com
Technical EMail: simon@velocityglobal.com



Transfer Rates and Limits
Sender Rate: 3.00%
Receiver Rate: 5.00%
Sender Minimum Fee: $1.50
Receiver Minimum Fee: $1.50
Sender Maximum Fee: $5.00
Receiver Maximum Fee: $125.00



Merchant Reserves
Activation Fee: $1,000.00
Paid On:  
Reserve Percentage: 10.00%
Reserve Duration: 26 (weeks)











Web Site Information
Server IP: URL: Status:
64.226.20.33 www.spywarekill.com ACTIVE



To refer customers to PrePaidATM, the anchor tag on your web site
should resemble the following:
<a
href="http://www.prepaidatm.com/linktrack.cfm?acode=M171614575761422"><img
src="filename.gif"></a>
<IMG height="10" alt="" hspace="0" src="PrePaidATM - Account Profile_files/spacer.gif"
width="1" border="0">
<IMG height="9" alt="" src="PrePaidATM - Account Profile_files/arrow.gif" width="7" border="0"><IMG height="1" alt="" hspace="0" src="PrePaidATM - Account Profile_files/spacer.gif"
width="5" border="0">Merchants
CLICK HERE
to view the developer's transfer processing API guide.
<IMG height="10" alt="" hspace="0" src="PrePaidATM - Account Profile_files/spacer.gif"
width="1" border="0">
</body>
</html>
//--------------------------------------------
GeneralRe: Utility crashes PinmemberOmar Al Zabir21:45 29 Jun '05  
GeneralMyHTMLTidy PinmemberWiebe Tijsma2:34 27 Jun '05  
GeneralRe: MyHTMLTidy PinmemberOmar Al Zabir20:52 27 Jun '05  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web02 | 2.5.120528.1 | Last Updated 24 Jun 2005
Article Copyright 2005 by Omar Al Zabir
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid