65.9K
CodeProject is changing. Read more.
Home

EfTidyNet: .NET Wrapper for Tidy library

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.87/5 (11 votes)

Mar 5, 2008

GPL3

8 min read

viewsIcon

120923

downloadIcon

1609

Free component for parsing HTML, .NET version of EfTidyCom

Introduction

Before I go into details, I want you to know what EfTidy actually is. EfTidy is a wrapper component of Tidy library, and if you don't know what Tidy is, here is a little description:

"TidyLib is an open source utility for tidying up HTML. Tidy is composed from an HTML parser and an HTML pretty printer. The parser goes to considerable lengths to correct common markup errors. It also provides advice on how to make your pages more accessible to people with disabilities, and can be used to convert HTML content into XML as XHTML. Tidy is W3C open source and available free. It has been successfully compiled on a large number of platforms, and is being integrated into many HTML authoring tools."

- By Mr. Dave Raggett

This is the .NET version of the EfTidyCom component (also present on The Code Project). Before moving further, this library is dedicated to the memory of my mother Late Mrs. Saroj Gupta, whom I lost recently (29th January, 2008), just want to say Mummy!, I love you.

I have had a lot of demand to provide the .NET version of EfTidyCom library as COM is losing focus and .NET seems to be the future. This library is written in VC++.NET (by mixing managed and unmanaged code). Please find a reference and test cases in this article. Thanks and just pray for my mother that she live happy wherever she is.

This is also an updated version of EfTidyCom. Some features (Node and Attribute classes) have been removed as I think they are not of much use!

Library Reference

EfTidy contains two classes:

  • TidyNetOpt [under EfTidyNet namespace]
  • TidyNet [under EfTidyNet::EfTidyOpt namespace]

EfTidy also contains four enumerations:

  • ECharEncodingType
  • EOutputType
  • EIndentScheme
  • EDoctypeModes

Now, let's take each interface one by one.

1. TidyNet

First, let's check out each and every method or property present in this interface, and the functions they perform:

Property/Method name Parameters Get/Put Description
TidyFiletoMem const String^ SFileName , String^ % SResult n/a Write output to memory.
TidyFileToFile const String^ SsourceFileName , const String^ SDestFile n/a Write output in file.
TidyMemToMem String^ SsourceData , String^ % SResult n/a Write output to memory.
TidyMemtoFile String^ SBuffer , String^ SDestFile n/a Take input as buffer and output in file.
TotalWarnings long %pVal Get Return the total number of warnings after the above four operations.
TotalErrors long %pVal Get Return the total number of errors after the above four operations.
ErrorWarning void String^ Return the buffer, which contains human readable errors/ warnings.
Option void EfTidyOpt:: TidyNetOpt^ Set the Option for the Tidy library.

2. TidyNetOpt

Here is a list of properties and methods for the ItidyOption interface:

Property/Method name Parameter Get/Put Description
LoadConfigFile String^ n/a Load option settings from a configuration file.
ResetToDefaultValue Void n/a Reset options to default settings.
Doctype String^ Both Doctype declaration generated by Tidy.
TidyMark BOOL Both For meta element indicating tidied doc.
HideEndTag BOOL Both Suppress optional end tags.
EncloseText BOOL Both If yes, text in the body is wrapped in <p>.
EncloseBlockText BOOL Both If yes, text in blocks is wrapped in <p>
LogicalEmphasis BOOL Both Replace i by em and b by strong.
DefaultAltText String^ Both Default text for alt attribute.
Clean BOOL Both Replace presentational clutter by style rules.
DropFontTags BOOL Both Discard presentation tags.
DropEmptyParas BOOL Both Discard empty p elements.
Word2000 BOOL Both Both draconian cleaning for Word2000.
FixBadComment BOOL Both Both fix comments with adjacent hyphens.
FixBackslash BOOL Both Both fix URLs by replacing \ with /.
NewEmptyTags String^ Both Declared empty tags.
NewInlineTags String^ Both Declared inline tags.
NewBlockLevelTags String^ Both Declared block tags.
NewPreTags String^ Both Declared pre tags.
OutputType EOutputType Both You can set the output type from here, like you can get the output as XML, XHTML or pure HTML.
InputAsXML BOOL Both Treat input as XML.
ADDXmlDecl BOOL Both Add >?xml ?< for XML docs.
AddXmlSpace BOOL Both If set to yes, adds XML: space attr as needed.
Bare BOOL Both Make bare HTML.
AssumeXmlProcins BOOL Both If set to yes, PIs must end with ?>.
CharEncoding ECharEncodingType Both Set/Get in/out character encoding.
InCharEncoding ECharEncodingType Both Input character encoding (if different).
OutCharEncoding ECharEncodingType Both Output character encoding (if different).
NumericsEntities BOOL Both Use numeric entities for symbols.
QuoteMarks BOOL Both Output " marks as ".
QuoteNBSP BOOL Both Both output non-breaking space as entity.
QuoteAmpersand BOOL Both Output naked ampersand as &.
OutputTagInUpperCase BOOL Both Output tags in upper not lower case.
OutputAttrInUpperCase BOOL Both Output attributes in upper not lower case.
WrapScriptlets BOOL Both Wrap within JavaScript string literals.
WrapAttVals BOOL Both Wrap within attribute values.
WrapSection BOOL Both Wrap within section tags.
WrapAsp BOOL Both Wrap within ASP pseudo elements.
WrapJste BOOL Both Wrap within JSTE pseudo elements.
WrapPhp BOOL Both Wrap within PHP pseudo elements.
Indent EIndentScheme Both Indent the content of appropriate tags.
IndentSpace long Both Indentation of n spaces.
WrapLen long Both Set wrap margin for output.
TabSize long Both Expand tabs to n spaces.
IndentAttributes long Both New-line + indent before each attribute.
BreakBeforeBR BOOL Both Output new-line before or not.
LiteralAttribs BOOL Both If true, attributes may use new-lines.
MarkUp BOOL Both
ShowWarnings BOOL Both On/Off
Quiet BOOL Both No 'Parsing X', guessed DTD or summary.
KeepTime BOOL Both If yes, last modified time is preserved.
ErrorFile String^ Both File name to write errors to.
GnuEmacs BOOL Both If true, format error output for GNU Emacs
FixUrl BOOL Both Applies URI encoding if necessary.
BodyOnly BOOL Both Output BODY content only.
HideComments BOOL Both Hides all (real) comments in output.
DoctypeMode EDoctypeModes Both Sets the doctype mode for output.

Using the Code

I have used the Test.htm (included with the project) to test EfTidyNet responses. Here is what Test.htm contains:

<html>
    <head><title>tidy Library</title></head>
    <body>
      <blockquote>
        <p> </p> --(1)

        <p><fontsize="5"color=
      "#FF00FF">TidyLibrary</font></p>
      </blockquote>
      <P><p><fontsize="5"color="#FF00FF"></font></p>

      <table border="1" cellpadding="0" cellspacing="0"
         style="border-collapse: collapse"
         bordercolor="#111111" width="100%" id="AutoNumber1">

       <tr>
         <td width="50%" style="border-left-style: solid;
           border-left-width: 1; border-right-style: none;
           border-right-width: medium; border-top-style: solid;
           border-top-width: 1; border-bottom-style:
           none; border-bottom-width: medium"> --(2)
         </td>
         <td width="50%" style="border-left-style: none;
           border-left-width: medium; border-right-style:solid;
           border-right-width: 1; border-top-style: solid;
           border-top-width: 1;border-bottom-style: none;
           border-bottom-width: medium">

         </td>
       </tr>
      </table>
      <b>Tidy  --- (3)
      </h1> <tidy> ---(4)

    </body>
</html>

In test.htm, I have added the following mistakes:

  • A dummy <Tidy> tag at (4)
  • Missing <h1> tag at (4)
  • Empty para <p> tag (1)
  • Un-closed <b> tag at (3)
Test Case # 1 using TidyNet

First, create an object of our component. Here is a listing of how to achieve that:

TidyNet objTidyNet = new TidyNet(); 

Now, clean the test.htm file using this object. The code listing for that is given below:

private void button1_Click(object sender, EventArgs e)
{
 int iTotalWarn = 0,iTotalErrs = 0;
 String SReturnData ="";
 String SError = "";

 TidyNet objTidyNet = new TidyNet();
 objTidyNet.TidyFiletoMem("C:\\MyProjects\\Test\\hello.htm",
   ref SReturnData);

 objTidyNet.TotalWarnings(ref iTotalWarn);
 SError = objTidyNet.ErrorWarning();
 objTidyNet.TotalErrors(ref iTotalErrs);
}

And here is the result produced by Tidy listing showing what test1.htm (created by EfTidyNet) contains:

<html>
<head>
 <meta name="generator"
       content="HTML Tidy for Windows (vers 1st September 2004),
                see www.w3.org">
    <title>tidy Library</title>

</head>
<body>
    <blockquote>
        <p> </p>
        <p><font size="5" color="#FF00FF">Tidy Library</font>

        </p>
    </blockquote>
    <p><font size="5" color= "#FF00FF"> </font></p>

    <table border="1" cellpadding="0" cellspacing="0"
         style= "border-collapse: collapse" bordercolor="#111111"

         width="100%" id= "AutoNumber1">
     <tr>
        <td width="50%" style= "border-left-style: solid;
           border-left-width: 1; border-right-style: none;
           border-right-width: medium; border-top-style: solid;
           border-top-width: 1; border-bottom-style: none;
           border-bottom-width: medium">

        </td>
        <td width="50%"
           style= "border-left-style: none;border-left-width: medium;
           border-right-style: solid; border-right-width: 1;
           border-top-style: solid; border-top-width: 1;
           border-bottom-style: none;border-bottom-width: medium">
        </td>
     </tr>

    </table>
    <b>Tidy</b> --(1)
</body>
</html>

If you see the above cleaned HTML page - the dummy <tidy> tag and the </h1> have been removed near (1), and </b> is added after Tidy at (1).

Here is a summary of the errors/warnings produced by EfTidyNet, showing you the details of each action it has performed:

line 1 column 1   - Warning: missing <!DOCTYPE> declaration
line 22 column 10 - Warning: discarding unexpected </h1>
line 23 column 1  - Error: <tidy> is not recognized!
line 23 column 1  - Warning: discarding unexpected <tidy>

line 15 column 1  - Warning: <table> proprietary attribute
                    "bordercolor"
line 15 column 1  - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary

5 warnings, 1 error were found!
Test Case # 2 using TidyNet with TidyNetOpt

Now, apply some options to Test.htm to get the custom output. So, I am using these options:

  • Clean =TRUE (to make separate class for style)
  • DoctypeMode = DoctypeUser (to enable display string)
  • Doctype = "Ef Tidy library" (display string)
  • OutputType = XhtmlOut (output type)
  • NewInlineTags = "tidy" (Make our dummy <tidy> tag legal)

Here is the code listing to achieve the above:

private void TestCase2_Click(object sender, EventArgs e)
{
  int iTotalWarn = 0, iTotalErrs = 0;
  String SReturnData = "";
  String SError = "";

  TidyNet objTidyNet = new TidyNet();

  objTidyNet.Option.Clean(true);
  objTidyNet.Option.NewInlineTags("tidy");
  objTidyNet.Option.OutputType(EfTidyNet.EfTidyOpt.EOutputType.XhtmlOut);
  objTidyNet.Option.DoctypeMode(EfTidyNet.EfTidyOpt.EDoctypeModes.DoctypeUser);
  objTidyNet.Option.Doctype("Ef Tidy Library");

  objTidyNet.TidyFiletoMem("C:\\MyProjects\\Test\\hello.htm", ref SReturnData);
  objTidyNet.TotalWarnings(ref iTotalWarn);
  SError = objTidyNet.ErrorWarning();
  objTidyNet.TotalErrors(ref iTotalErrs);
}

And here is the result produced by Tidy listing showing what test1.htm (created by EfTidyNet) contains after applying our options:

<!DOCTYPE html PUBLIC "Ef Tidy library" ""> --(1)

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <meta name="generator"

    content="HTML Tidy for Windows (vers 1st September 2004),
            see www.w3.org" />

  <title>tidy Library</title>
  <style type="text/css">  --(2)

     /*<![CDATA[*/
       table.c4 {border-collapse: collapse}
       td.c3 {border-left-style: none;
          border-left-width: medium; border-right-style: solid;
          border-right-width: 1; border-top-style: solid;
          border-top-width: 1;
          border-bottom-style: none; border-bottom-width: medium}
       td.c2 {border-left-style: solid; border-left-width: 1;
          border-right-style: none;
          border-right-width: medium; border-top-style: solid;
          border-top-width: 1;
          border-bottom-style: none; border-bottom-width: medium}
       h2.c1 {color: #FF00FF}
     /*]]>*/
  </style>

  </head>
  <body>
    <blockquote>
      <p> </p>

      <h2 class="c1">Tidy Library</h2>

    </blockquote>
    <h2 class="c1">
    </h2>
    <table border="1" cellpadding="0" cellspacing="0" class="c4"

           bordercolor="#111111" width="100%" id="AutoNumber1">
        <tr>
            <td width="50%" class="c2"> </td> ----(3)

            <td width="50%" class="c3"> </td>
        </tr>
    </table>
    <b>Tidy <tidy></tidy></b> ----(4)

  </body>
</html>

Now, let us see what Tidy cleans for us:

  • In (1), our custom string "Ef Tidy Library" is visible.
  • In (2) and (3), the styles are cleaned and a class is created for that.
  • In (4), our <Tidy> tag gets legal, though it does nothing in the actual HTML page.

Here is a summary of all the errors/warnings:

line 1 column 1  - Warning: missing <!DOCTYPE> declaration
line 22 column 10- Warning: discarding unexpected </h1>
line 23 column 1 - Warning: <tidy> is not approved by W3C
line 23 column 1 - Warning: missing </tidy> before </body>

line 22 column 2 - Warning: missing </b> before </body>

line 15 column 1 - Warning: <table> proprietary attribute
                   "bordercolor"
line 15 column 1 - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary

7 warnings, 0 errors were found!

Here, all I have given is a small overview of the Tidy library and EfTidyCom. For more information on the Tidy library, visit Tidy home page.

Author Comment

I know there is much scope for improvement in this component. I promise these improvements will be there in the next version/update of the library. If you encounter any bugs, please intimate so that I could improve the code further.

Files Listed with the Project

EfTidy Version 1.0.2.0
  • Source zip contains:
    • TidyLib (original Tidy library) 2009 March  release source code
    • EfTidyNet source code with multilingual support
    • Source code updated for Visual Studio 2010 
  • Project zip contains:
    • Release version of EfTidyNet Library
    • C# test project (with source)
    • Test.htm

EfTidy Version 1.0.1.3
  • Source zip contains:
    • TidyLib (original Tidy library) 2009 March  release source code
    • EfTidyNet source code with multilanguage support
  • Project zip contains:
    • Release version of EfTidyNet Library
    • C# test project (with source)
    • Test.htm

EfTidy Version 1.0.1.2 (Latest)

  • Source zip contains:
    • TidyLib (original Tidy library) 2008 release source code
    • EfTidyNet source code with multilanguage support
    • Thanks to Wingogo and megger83 for bug reporting!
  • Project zip contains:
    • Release version of EfTidyNet Library

EfTidy Version 1.0.1.1

  • Source zip contains:
    • TidyLib (original Tidy library) 2008 release source code
    • EfTidyNet source code with multilanguage support
    • EfTidyNetx64 version by Spike!
  • Project zip contains:
    • Release version of EfTidyNet Library
    • C# test project (with source)
    • Test.htm

EfTidy Version 1.0

  • Source zip contains:
    • TidyLib (original Tidy library) source code
    • EfTidyNet source code
  • Project zip contains:
    • Release version of EfTidyNet library
    • C# Test project (with source)
    • Test.htm

Special Thanks

  • Mr. Saurabh Gupta [Director Efextra eSolutions Pvt. Ltd.]
  • Mr Spike! for creating X64 version of EfTidyNet
  • Tidy SourceForge group for Tidy library

Update History

  • 06 September 2013: EfTidyNet version 1.0.2.0 
  • 20 July, 2009: EfTidyNet version 1.0.1.3
  • 23rd June, 2008: EfTidyNet version 1.0.1.2
  • 5th March, 2008: EfTidyNet version 1.0.1.1
  • 15th February, 2008: EfTidyNet version 1.0