|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionThis is a class library that helps you produce valid XHTML from HTML. It also provides tag and attribute filtering support. You can specify exactly which tags and attributes are allowed in the output and the other tags are filtered out. You can use this library to clean the bulky HTML that Microsoft Word documents produce when converted to HTML. You can also use it to cleanup HTML before posting to blog sites so that your HTML does not get rejected by blog engines like WordPress, B2evolution, etc. How it WorksThere are two classes:
HtmlReader
/// <summary>
/// This class skips all nodes which has some
/// kind of prefix. This trick does the job
/// to clean up MS Word/Outlook HTML markups.
/// </summary>
public class HtmlReader : Sgml.SgmlReader
{
public HtmlReader( TextReader reader ) : base( )
{
base.InputStream = reader;
base.DocType = "HTML";
}
public HtmlReader( string content ) : base( )
{
base.InputStream = new StringReader( content );
base.DocType = "HTML";
}
public override bool Read()
{
bool status = base.Read();
if( status )
{
if( base.NodeType == XmlNodeType.Element )
{
// Got a node with prefix. This must be one
// of those "<o:p>" or something else.
// Skip this node entirely. We want prefix
// less nodes so that the resultant XML
// requires not namespace.
if( base.Name.IndexOf(':') > 0 )
base.Skip();
}
}
return status;
}
}
HtmlWriterThis class is a bit trickier. Here are the tricks that have been used:
Let's take a look at the entire class part-by-part: ConfigurabilityYou can configure public class HtmlWriter : XmlTextWriter
{
/// <summary>
/// If set to true, it will filter the output
/// by using tag and attribute filtering,
/// space reduce etc
/// </summary>
public bool FilterOutput = false;
/// <summary>
/// If true, it will reduce consecutive with one instance
/// </summary>
public bool ReduceConsecutiveSpace = true;
/// <summary>
/// Set the tag names in lower case which are allowed to go to output
/// </summary>
public string [] AllowedTags =
new string[] { "p", "b", "i", "u", "em", "big", "small",
"div", "img", "span", "blockquote", "code", "pre", "br", "hr",
"ul", "ol", "li", "del", "ins", "strong", "a", "font", "dd", "dt"};
/// <summary>
/// If any tag found which is not allowed, it is replaced by this tag.
/// Specify a tag which has least impact on output
/// </summary>
public string ReplacementTag = "dd";
/// <summary>
/// New lines \r\n are replaced with space
/// which saves space and makes the
/// output compact
/// </summary>
public bool RemoveNewlines = true;
/// <summary>
/// Specify which attributes are allowed.
/// Any other attribute will be discarded
/// </summary>
public string [] AllowedAttributes = new string[]
{
"class", "href", "target", "border", "src",
"align", "width", "height", "color", "size"
};
}
WriteString Method/// <summary>
/// The reason why we are overriding
/// this method is, we do not want the output to be
/// encoded for texts inside attribute
/// and inside node elements. For example, all the
/// gets converted to &nbsp in output. But this does not
/// apply to HTML. In HTML, we need to have as it is.
/// </summary>
/// <param name="text"></param>
public override void WriteString(string text)
{
// Change all non-breaking space to normal space
text = text.Replace( " ", " " );
/// When you are reading RSS feed and writing Html,
/// this line helps remove those CDATA tags
text = text.Replace("<![CDATA[","");
text = text.Replace("]]>", "");
// Do some encoding of our own because
// we are going to use WriteRaw which won't
// do any of the necessary encoding
text = text.Replace( "<", "<" );
text = text.Replace( ">", ">" );
text = text.Replace( "'", "'" );
text = text.Replace( "\"", ""e;" );
if( this.FilterOutput )
{
text = text.Trim();
// We want to replace consecutive spaces
// to one space in order to save horizontal width
if( this.ReduceConsecutiveSpace )
text = text.Replace(" ", " ");
if( this.RemoveNewlines )
text = text.Replace(Environment.NewLine, " ");
base.WriteRaw( text );
}
else
{
base.WriteRaw( text );
}
}
WriteStartElement: Applying Tag Filteringpublic override void WriteStartElement(string prefix,
string localName, string ns)
{
if( this.FilterOutput )
{
bool canWrite = false;
string tagLocalName = localName.ToLower();
foreach( string name in this.AllowedTags )
{
if( name == tagLocalName )
{
canWrite = true;
break;
}
}
if( !canWrite )
localName = "dd";
}
base.WriteStartElement(prefix, localName, ns);
}
WriteAttributes Method: Applying Attribute Filteringbool canWrite = false;
string attributeLocalName = reader.LocalName.ToLower();
foreach( string name in this.AllowedAttributes )
{
if( name == attributeLocalName )
{
canWrite = true;
break;
}
}
// If allowed, write the attribute
if( canWrite )
this.WriteStartAttribute(reader.Prefix,
attributeLocalName, reader.NamespaceURI);
while (reader.ReadAttributeValue())
{
if (reader.NodeType == XmlNodeType.EntityReference)
{
if( canWrite ) this.WriteEntityRef(reader.Name);
continue;
}
if( canWrite )this.WriteString(reader.Value);
}
if( canWrite ) this.WriteEndAttribute();
ConclusionThe sample application is a utility that you can use right now to clean HTML files. You can use this class in applications like blogging tools where you need to post HTML to some web service.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms
of Use
Last Updated: 24 Jun 2005 Editor: Genevieve Sovereign |
Copyright 2005 by Omar Al Zabir Everything else Copyright © CodeProject, 1999-2008 Web11 | Advertise on the Code Project |