Click here to Skip to main content
12,634,171 members (27,920 online)
Click here to Skip to main content
Add your own
alternative version

Stats

68.1K views
1.4K downloads
34 bookmarked
Posted

EfTidyNet: .NET Wrapper for Tidy library

, 6 Sep 2013 GPL3
Rate this:
Please Sign up or sign in to vote.
Free component for parsing HTML, .NET version of EfTidyCom

Introduction

Before I go into details, I want you to know what EfTidy actually is. EfTidy is a wrapper component of Tidy library, and if you don't know what Tidy is, here is a little description:

"TidyLib is an open source utility for tidying up HTML. Tidy is composed from an HTML parser and an HTML pretty printer. The parser goes to considerable lengths to correct common markup errors. It also provides advice on how to make your pages more accessible to people with disabilities, and can be used to convert HTML content into XML as XHTML. Tidy is W3C open source and available free. It has been successfully compiled on a large number of platforms, and is being integrated into many HTML authoring tools."

- By Mr. Dave Raggett

This is the .NET version of the EfTidyCom component (also present on The Code Project). Before moving further, this library is dedicated to the memory of my mother Late Mrs. Saroj Gupta, whom I lost recently (29th January, 2008), just want to say Mummy!, I love you.

I have had a lot of demand to provide the .NET version of EfTidyCom library as COM is losing focus and .NET seems to be the future. This library is written in VC++.NET (by mixing managed and unmanaged code). Please find a reference and test cases in this article. Thanks and just pray for my mother that she live happy wherever she is.

This is also an updated version of EfTidyCom. Some features (Node and Attribute classes) have been removed as I think they are not of much use!

Library Reference

EfTidy contains two classes:

  • TidyNetOpt [under EfTidyNet namespace]
  • TidyNet [under EfTidyNet::EfTidyOpt namespace]

EfTidy also contains four enumerations:

  • ECharEncodingType
  • EOutputType
  • EIndentScheme
  • EDoctypeModes

Now, let's take each interface one by one.

1. TidyNet

First, let's check out each and every method or property present in this interface, and the functions they perform:

Property/Method nameParametersGet/PutDescription
TidyFiletoMem const String^ SFileName , String^ % SResultn/aWrite output to memory.
TidyFileToFile const String^ SsourceFileName , const String^ SDestFilen/aWrite output in file.
TidyMemToMem String^ SsourceData , String^ % SResultn/aWrite output to memory.
TidyMemtoFile String^ SBuffer , String^ SDestFilen/aTake input as buffer and output in file.
TotalWarnings long %pValGetReturn the total number of warnings after the above four operations.
TotalErrors long %pValGetReturn the total number of errors after the above four operations.
ErrorWarning voidString^Return the buffer, which contains human readable errors/ warnings.
Option voidEfTidyOpt:: TidyNetOpt^Set the Option for the Tidy library.

2. TidyNetOpt

Here is a list of properties and methods for the ItidyOption interface:

Property/Method nameParameterGet/PutDescription
LoadConfigFile String^n/aLoad option settings from a configuration file.
ResetToDefaultValue Voidn/aReset options to default settings.
DoctypeString^BothDoctype declaration generated by Tidy.
TidyMark BOOLBothFor meta element indicating tidied doc.
HideEndTag BOOLBothSuppress optional end tags.
EncloseText BOOLBothIf yes, text in the body is wrapped in <p>.
EncloseBlockText BOOLBothIf yes, text in blocks is wrapped in <p>
LogicalEmphasis BOOLBothReplace i by em and b by strong.
DefaultAltText String^BothDefault text for alt attribute.
Clean BOOLBothReplace presentational clutter by style rules.
DropFontTags BOOLBothDiscard presentation tags.
DropEmptyParas BOOLBothDiscard empty p elements.
Word2000 BOOLBothBoth draconian cleaning for Word2000.
FixBadComment BOOLBothBoth fix comments with adjacent hyphens.
FixBackslash BOOLBothBoth fix URLs by replacing \ with /.
NewEmptyTags String^BothDeclared empty tags.
NewInlineTags String^BothDeclared inline tags.
NewBlockLevelTags String^BothDeclared block tags.
NewPreTags String^BothDeclared pre tags.
OutputType EOutputType BothYou can set the output type from here, like you can get the output as XML, XHTML or pure HTML.
InputAsXML BOOLBothTreat input as XML.
ADDXmlDecl BOOLBothAdd >?xml ?< for XML docs.
AddXmlSpace BOOLBothIf set to yes, adds XML: space attr as needed.
Bare BOOLBothMake bare HTML.
AssumeXmlProcins BOOLBothIf set to yes, PIs must end with ?>.
CharEncoding ECharEncodingTypeBothSet/Get in/out character encoding.
InCharEncoding ECharEncodingTypeBothInput character encoding (if different).
OutCharEncoding ECharEncodingTypeBothOutput character encoding (if different).
NumericsEntities BOOLBothUse numeric entities for symbols.
QuoteMarks BOOLBothOutput " marks as ".
QuoteNBSP BOOLBothBoth output non-breaking space as entity.
QuoteAmpersand BOOLBothOutput naked ampersand as &.
OutputTagInUpperCase BOOLBothOutput tags in upper not lower case.
OutputAttrInUpperCase BOOLBothOutput attributes in upper not lower case.
WrapScriptlets BOOLBothWrap within JavaScript string literals.
WrapAttVals BOOLBothWrap within attribute values.
WrapSection BOOLBothWrap within section tags.
WrapAsp BOOLBothWrap within ASP pseudo elements.
WrapJste BOOLBothWrap within JSTE pseudo elements.
WrapPhp BOOLBothWrap within PHP pseudo elements.
Indent EIndentSchemeBothIndent the content of appropriate tags.
IndentSpace longBothIndentation of n spaces.
WrapLen longBothSet wrap margin for output.
TabSize longBothExpand tabs to n spaces.
IndentAttributes longBothNew-line + indent before each attribute.
BreakBeforeBR BOOLBothOutput new-line before or not.
LiteralAttribs BOOLBothIf true, attributes may use new-lines.
MarkUp BOOLBoth
ShowWarnings BOOLBothOn/Off
Quiet BOOLBothNo 'Parsing X', guessed DTD or summary.
KeepTime BOOLBothIf yes, last modified time is preserved.
ErrorFile String^BothFile name to write errors to.
GnuEmacs BOOLBothIf true, format error output for GNU Emacs
FixUrl BOOLBothApplies URI encoding if necessary.
BodyOnly BOOLBothOutput BODY content only.
HideComments BOOLBothHides all (real) comments in output.
DoctypeMode EDoctypeModesBothSets the doctype mode for output.

Using the Code

I have used the Test.htm (included with the project) to test EfTidyNet responses. Here is what Test.htm contains:

<html>
    <head><title>tidy Library</title></head>
    <body>
      <blockquote>
        <p> </p> --(1)

        <p><fontsize="5"color=
      "#FF00FF">TidyLibrary</font></p>
      </blockquote>
      <P><p><fontsize="5"color="#FF00FF"></font></p>

      <table border="1" cellpadding="0" cellspacing="0"
         style="border-collapse: collapse"
         bordercolor="#111111" width="100%" id="AutoNumber1">

       <tr>
         <td width="50%" style="border-left-style: solid;
           border-left-width: 1; border-right-style: none;
           border-right-width: medium; border-top-style: solid;
           border-top-width: 1; border-bottom-style:
           none; border-bottom-width: medium"> --(2)
         </td>
         <td width="50%" style="border-left-style: none;
           border-left-width: medium; border-right-style:solid;
           border-right-width: 1; border-top-style: solid;
           border-top-width: 1;border-bottom-style: none;
           border-bottom-width: medium">

         </td>
       </tr>
      </table>
      <b>Tidy  --- (3)
      </h1> <tidy> ---(4)

    </body>
</html>

In test.htm, I have added the following mistakes:

  • A dummy <Tidy> tag at (4)
  • Missing <h1> tag at (4)
  • Empty para <p> tag (1)
  • Un-closed <b> tag at (3)
Test Case # 1 using TidyNet

First, create an object of our component. Here is a listing of how to achieve that:

TidyNet objTidyNet = new TidyNet(); 

Now, clean the test.htm file using this object. The code listing for that is given below:

private void button1_Click(object sender, EventArgs e)
{
 int iTotalWarn = 0,iTotalErrs = 0;
 String SReturnData ="";
 String SError = "";

 TidyNet objTidyNet = new TidyNet();
 objTidyNet.TidyFiletoMem("C:\\MyProjects\\Test\\hello.htm",
   ref SReturnData);

 objTidyNet.TotalWarnings(ref iTotalWarn);
 SError = objTidyNet.ErrorWarning();
 objTidyNet.TotalErrors(ref iTotalErrs);
}

And here is the result produced by Tidy listing showing what test1.htm (created by EfTidyNet) contains:

<html>
<head>
 <meta name="generator"
       content="HTML Tidy for Windows (vers 1st September 2004),
                see www.w3.org">
    <title>tidy Library</title>

</head>
<body>
    <blockquote>
        <p> </p>
        <p><font size="5" color="#FF00FF">Tidy Library</font>

        </p>
    </blockquote>
    <p><font size="5" color= "#FF00FF"> </font></p>

    <table border="1" cellpadding="0" cellspacing="0"
         style= "border-collapse: collapse" bordercolor="#111111"

         width="100%" id= "AutoNumber1">
     <tr>
        <td width="50%" style= "border-left-style: solid;
           border-left-width: 1; border-right-style: none;
           border-right-width: medium; border-top-style: solid;
           border-top-width: 1; border-bottom-style: none;
           border-bottom-width: medium">

        </td>
        <td width="50%"
           style= "border-left-style: none;border-left-width: medium;
           border-right-style: solid; border-right-width: 1;
           border-top-style: solid; border-top-width: 1;
           border-bottom-style: none;border-bottom-width: medium">
        </td>
     </tr>

    </table>
    <b>Tidy</b> --(1)
</body>
</html>

If you see the above cleaned HTML page - the dummy <tidy> tag and the </h1> have been removed near (1), and </b> is added after Tidy at (1).

Here is a summary of the errors/warnings produced by EfTidyNet, showing you the details of each action it has performed:

line 1 column 1   - Warning: missing <!DOCTYPE> declaration
line 22 column 10 - Warning: discarding unexpected </h1>
line 23 column 1  - Error: <tidy> is not recognized!
line 23 column 1  - Warning: discarding unexpected <tidy>

line 15 column 1  - Warning: <table> proprietary attribute
                    "bordercolor"
line 15 column 1  - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary

5 warnings, 1 error were found!
Test Case # 2 using TidyNet with TidyNetOpt

Now, apply some options to Test.htm to get the custom output. So, I am using these options:

  • Clean =TRUE (to make separate class for style)
  • DoctypeMode = DoctypeUser (to enable display string)
  • Doctype = "Ef Tidy library" (display string)
  • OutputType = XhtmlOut (output type)
  • NewInlineTags = "tidy" (Make our dummy <tidy> tag legal)

Here is the code listing to achieve the above:

private void TestCase2_Click(object sender, EventArgs e)
{
  int iTotalWarn = 0, iTotalErrs = 0;
  String SReturnData = "";
  String SError = "";

  TidyNet objTidyNet = new TidyNet();

  objTidyNet.Option.Clean(true);
  objTidyNet.Option.NewInlineTags("tidy");
  objTidyNet.Option.OutputType(EfTidyNet.EfTidyOpt.EOutputType.XhtmlOut);
  objTidyNet.Option.DoctypeMode(EfTidyNet.EfTidyOpt.EDoctypeModes.DoctypeUser);
  objTidyNet.Option.Doctype("Ef Tidy Library");

  objTidyNet.TidyFiletoMem("C:\\MyProjects\\Test\\hello.htm", ref SReturnData);
  objTidyNet.TotalWarnings(ref iTotalWarn);
  SError = objTidyNet.ErrorWarning();
  objTidyNet.TotalErrors(ref iTotalErrs);
}

And here is the result produced by Tidy listing showing what test1.htm (created by EfTidyNet) contains after applying our options:

<!DOCTYPE html PUBLIC "Ef Tidy library" ""> --(1)

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <meta name="generator"

    content="HTML Tidy for Windows (vers 1st September 2004),
            see www.w3.org" />

  <title>tidy Library</title>
  <style type="text/css">  --(2)

     /*<![CDATA[*/
       table.c4 {border-collapse: collapse}
       td.c3 {border-left-style: none;
          border-left-width: medium; border-right-style: solid;
          border-right-width: 1; border-top-style: solid;
          border-top-width: 1;
          border-bottom-style: none; border-bottom-width: medium}
       td.c2 {border-left-style: solid; border-left-width: 1;
          border-right-style: none;
          border-right-width: medium; border-top-style: solid;
          border-top-width: 1;
          border-bottom-style: none; border-bottom-width: medium}
       h2.c1 {color: #FF00FF}
     /*]]>*/
  </style>

  </head>
  <body>
    <blockquote>
      <p> </p>

      <h2 class="c1">Tidy Library</h2>

    </blockquote>
    <h2 class="c1">
    </h2>
    <table border="1" cellpadding="0" cellspacing="0" class="c4"

           bordercolor="#111111" width="100%" id="AutoNumber1">
        <tr>
            <td width="50%" class="c2"> </td> ----(3)

            <td width="50%" class="c3"> </td>
        </tr>
    </table>
    <b>Tidy <tidy></tidy></b> ----(4)

  </body>
</html>

Now, let us see what Tidy cleans for us:

  • In (1), our custom string "Ef Tidy Library" is visible.
  • In (2) and (3), the styles are cleaned and a class is created for that.
  • In (4), our <Tidy> tag gets legal, though it does nothing in the actual HTML page.

Here is a summary of all the errors/warnings:

line 1 column 1  - Warning: missing <!DOCTYPE> declaration
line 22 column 10- Warning: discarding unexpected </h1>
line 23 column 1 - Warning: <tidy> is not approved by W3C
line 23 column 1 - Warning: missing </tidy> before </body>

line 22 column 2 - Warning: missing </b> before </body>

line 15 column 1 - Warning: <table> proprietary attribute
                   "bordercolor"
line 15 column 1 - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary

7 warnings, 0 errors were found!

Here, all I have given is a small overview of the Tidy library and EfTidyCom. For more information on the Tidy library, visit Tidy home page.

Author Comment

I know there is much scope for improvement in this component. I promise these improvements will be there in the next version/update of the library. If you encounter any bugs, please intimate so that I could improve the code further.

Files Listed with the Project

EfTidy Version 1.0.2.0
  • Source zip contains:
    • TidyLib (original Tidy library) 2009 March  release source code
    • EfTidyNet source code with multilingual support
    • Source code updated for Visual Studio 2010 
  • Project zip contains:
    • Release version of EfTidyNet Library
    • C# test project (with source)
    • Test.htm

EfTidy Version 1.0.1.3
  • Source zip contains:
    • TidyLib (original Tidy library) 2009 March  release source code
    • EfTidyNet source code with multilanguage support
  • Project zip contains:
    • Release version of EfTidyNet Library
    • C# test project (with source)
    • Test.htm

EfTidy Version 1.0.1.2 (Latest)

  • Source zip contains:
    • TidyLib (original Tidy library) 2008 release source code
    • EfTidyNet source code with multilanguage support
    • Thanks to Wingogo and megger83 for bug reporting!
  • Project zip contains:
    • Release version of EfTidyNet Library

EfTidy Version 1.0.1.1

  • Source zip contains:
    • TidyLib (original Tidy library) 2008 release source code
    • EfTidyNet source code with multilanguage support
    • EfTidyNetx64 version by Spike!
  • Project zip contains:
    • Release version of EfTidyNet Library
    • C# test project (with source)
    • Test.htm

EfTidy Version 1.0

  • Source zip contains:
    • TidyLib (original Tidy library) source code
    • EfTidyNet source code
  • Project zip contains:
    • Release version of EfTidyNet library
    • C# Test project (with source)
    • Test.htm

Special Thanks

  • Mr. Saurabh Gupta [Director Efextra eSolutions Pvt. Ltd.]
  • Mr Spike! for creating X64 version of EfTidyNet
  • Tidy SourceForge group for Tidy library

Update History

  • 06 September 2013: EfTidyNet version 1.0.2.0 
  • 20 July, 2009: EfTidyNet version 1.0.1.3
  • 23rd June, 2008: EfTidyNet version 1.0.1.2
  • 5th March, 2008: EfTidyNet version 1.0.1.1
  • 15th February, 2008: EfTidyNet version 1.0

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Share

About the Author

ThatsAlok
Software Developer (Senior)
India India
He used to have biography here Smile | :) , but now he will hire someone (for free offcourse Big Grin | :-D ), Who writes his biography on his behalf Smile | :)

He is Great Fan of Mr. Johan Rosengren (his idol),Lim Bio Liong, Nishant S and DavidCrow and Believes that, he will EXCEL in his life by following there steps!!!


For good 8 years he was Visual CPP MSMVP!

You may also be interested in...

Pro

Comments and Discussions

 
Question64 Bit Wrapper Pin
ashish singhvi27-Jul-15 1:28
memberashish singhvi27-Jul-15 1:28 
AnswerRe: 64 Bit Wrapper Pin
ThatsAlok20-Aug-15 1:16
memberThatsAlok20-Aug-15 1:16 
QuestionEfTidy Version 1.0.1.3 Pin
David Oden29-Aug-14 6:24
memberDavid Oden29-Aug-14 6:24 
AnswerRe: EfTidy Version 1.0.1.3 Pin
ThatsAlok29-Aug-14 6:54
memberThatsAlok29-Aug-14 6:54 
GeneralRe: EfTidy Version 1.0.1.3 Pin
David Oden9-Sep-14 10:03
memberDavid Oden9-Sep-14 10:03 
GeneralRe: EfTidy Version 1.0.1.3 Pin
ThatsAlok15-Sep-14 21:59
memberThatsAlok15-Sep-14 21:59 
QuestionBad download Pin
kracora9-Dec-13 22:36
memberkracora9-Dec-13 22:36 
BugHi regarding Eftidynet dll Pin
cp agrawal15-Feb-12 22:36
membercp agrawal15-Feb-12 22:36 
Hi i am using Eftidynet dll in my project its working fine with the 32bit system but its not working with 64bit, i need the x64 version of the eftidynet dll can you please provide me the same. its urgent.
QuestionMessage! Pin
ThatsAlok </4-Aug-11 5:19
member ThatsAlok 4-Aug-11 5:19 
GeneralUTF8 character support Pin
Conraddewet20-Mar-10 14:52
memberConraddewet20-Mar-10 14:52 
GeneralRe: UTF8 character support Pin
Conraddewet25-Mar-10 4:53
memberConraddewet25-Mar-10 4:53 
GeneralRe: UTF8 character support Pin
ThatsAlok </25-Mar-10 5:42
member ThatsAlok 25-Mar-10 5:42 
GeneralRe: UTF8 character support Pin
Member 310855016-Sep-11 5:16
memberMember 310855016-Sep-11 5:16 
GeneralRe: UTF8 character support Pin
Member 310855016-Sep-11 7:18
memberMember 310855016-Sep-11 7:18 
GeneralMemory problems Pin
Ian Grant2-Jul-09 0:49
memberIan Grant2-Jul-09 0:49 
GeneralRe: Memory problems Pin
ThatsAlok </2-Jul-09 1:56
member ThatsAlok 2-Jul-09 1:56 
QuestionUunderline tag replaces with span Pin
Kazim Sardar Mehdi21-May-09 22:17
memberKazim Sardar Mehdi21-May-09 22:17 
GeneralVista x64 Pin
Friedl198211-Mar-09 0:12
memberFriedl198211-Mar-09 0:12 
GeneralRe: Vista x64 Pin
JohnEEvansIII6-May-15 14:31
memberJohnEEvansIII6-May-15 14:31 
GeneralUnable to download code attached Pin
Member 159112710-Nov-08 22:59
memberMember 159112710-Nov-08 22:59 
GeneralRe: Unable to download code attached Pin
ThatsAlok </2-Jul-09 1:55
member ThatsAlok 2-Jul-09 1:55 
GeneralEncoding Problem Pin
Kazim Sardar Mehdi30-Oct-08 4:14
memberKazim Sardar Mehdi30-Oct-08 4:14 
GeneralRe: Encoding Problem Pin
Conraddewet13-Feb-10 11:05
memberConraddewet13-Feb-10 11:05 
General"Disconnected Context" with WebbrowserControl Pin
megger833-Jul-08 7:09
membermegger833-Jul-08 7:09 
Questionmemery leak problem.. Pin
wingogoo18-Jun-08 23:54
memberwingogoo18-Jun-08 23:54 
AnswerRe: memery leak problem.. Pin
ThatsAlok </19-Jun-08 1:52
member ThatsAlok 19-Jun-08 1:52 
AnswerRe: memery leak problem.. Pin
ThatsAlok </19-Jun-08 1:53
member ThatsAlok 19-Jun-08 1:53 
GeneralRe: memery leak problem.. Pin
wingogoo19-Jun-08 16:56
memberwingogoo19-Jun-08 16:56 
GeneralRe: memery leak problem.. Pin
ThatsAlok </19-Jun-08 23:07
member ThatsAlok 19-Jun-08 23:07 
GeneralRe: memery leak problem.. Pin
wingogoo20-Jun-08 1:20
memberwingogoo20-Jun-08 1:20 
GeneralRe: memery leak problem.. Pin
ThatsAlok </20-Jun-08 1:34
member ThatsAlok 20-Jun-08 1:34 
GeneralRe: memery leak problem.. Pin
ThatsAlok </20-Jun-08 1:55
member ThatsAlok 20-Jun-08 1:55 
GeneralRe: memery leak problem.. Pin
ThatsAlok </23-Jun-08 3:43
member ThatsAlok 23-Jun-08 3:43 
GeneralRe: memery leak problem.. Pin
wingogoo23-Jun-08 18:14
memberwingogoo23-Jun-08 18:14 
GeneralNew*Tags not working Pin
megger8317-Jun-08 20:48
membermegger8317-Jun-08 20:48 
GeneralRe: New*Tags not working Pin
ThatsAlok18-Jun-08 6:12
member ThatsAlok 18-Jun-08 6:12 
GeneralRe: New*Tags not working Pin
ThatsAlok </19-Jun-08 1:53
member ThatsAlok 19-Jun-08 1:53 
GeneralRe: New*Tags not working Pin
megger8319-Jun-08 22:39
membermegger8319-Jun-08 22:39 
GeneralRe: New*Tags not working Pin
ThatsAlok </19-Jun-08 23:01
member ThatsAlok 19-Jun-08 23:01 
GeneralRe: New*Tags not working Pin
ThatsAlok </20-Jun-08 1:43
member ThatsAlok 20-Jun-08 1:43 
GeneralRe: New*Tags not working Pin
ThatsAlok </25-Jun-08 4:37
member ThatsAlok 25-Jun-08 4:37 
QuestionWhy MCPP? Pin
Priyank Bolia4-Mar-08 21:09
memberPriyank Bolia4-Mar-08 21:09 
AnswerRe: Why MCPP? Pin
ThatsAlok4-Mar-08 21:19
member ThatsAlok 4-Mar-08 21:19 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.161208.2 | Last Updated 6 Sep 2013
Article Copyright 2008 by ThatsAlok
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid