Click here to Skip to main content
Click here to Skip to main content

Pdfizer, a dumb HTML to PDF converter, in C#

By , 17 Jan 2004
 

Introduction

This article presents a basic HTML to PDF converter: with this library, you can transform simple HTML pages to nice and printable PDF files.

The HTML cleaning is done with NTidy (see [1]), a .NET wrapper for the HTML Tidy library (see [2]). The PDF generation is done with iTextSharp, a PDF generation library (see [3]).

Transformation Pipe

Transforming HTML documents to PDF is a fairly complex task. Hopefully, there exists powerful tools on the web that could help me accomplish this.

Parsing HTML

The first problem to handle was that HTML is usually "dirty": the structure is usually not XML conformant and trying to parse HTML pages with the XmlDocument will usually lead to a failure.

To overcome this problem, I had to write a .NET wrapper around HTML Tidy (see [2]). HTML Tidy is a very useful application that takes "dirty" HTML and returns it cleaned as much as possible. The .NET wrapper exposes a DOM-like class structure so that you can use it much like XmlDocument.

Hence, with NTidy, we can safely parse HTML document.

Creating PDF

The PDF creation is done by iTextSharp (see [3]), a .NET library hosted on SourceForge, that gives you the tool to create PDF easily. Hence, the PDF creation problem is solved.

Reading, Traversing

With NTidy and iTextSharp on my toolset, I could start to create the generator. The generator works like this: it first reads the input with NTidy, then traverses the DOM tree and generates the PDF fragments with iTextSharp.

Quick Example

The library usage is done through the HtmlToPdfConverter class. Creating a PDF file is done through the following steps, as illustrated in the example:

  1. Create a converter,
  2. Open a new PDF file using the Open method,
  3. Add a chapter,
  4. Feed HTML to the converter,
  5. If you want another chapter, go to 3.
  6. When finished, close the PDF file by calling Close.
// create converter
HtmlToPdfConverter html2pdf = new HtmlToPdfConverter();

// open new pdf file
html2pdf.Open(@"test");
// start a chapter
html2pdf.AddChapter(@"Dummy Chapter");
string html = ...;
// convert string
html2pdf.Run(html);
// add a new chapter
html2pdf.AddChapter(@"Boost page");
// read web page
html2pdf.Run(new Uri(@"http://www.boost.org/libs/libraries.htm"));
// close and finish pdf file.
html2pdf.Close();

What to expect and not expect

Don't expect too much from this tool, it will not work with complex HTML pages and will give fairly good results with simple HTML pages. Specially, tables are not yet supported.

Reference

  1. NTidy, a .NET wrapper around Tidy.
  2. HTML Tidy home page.
  3. iTextSharp, PDF generation tool.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Jonathan de Halleux
Engineer
United States United States
Member
Jonathan de Halleux is Civil Engineer in Applied Mathematics. He finished his PhD in 2004 in the rainy country of Belgium. After 2 years in the Common Language Runtime (i.e. .net), he is now working at Microsoft Research on Pex (http://research.microsoft.com/pex).

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionIs this usable??? or waste of time too Pinmembersimpa23 Aug '12 - 22:47 
BugHaving some exception error when using web content. Pinmemberjlopez78810 Jul '12 - 9:14 
GeneralMy vote of 5 Pinmembermanoj kumar choubey15 Feb '12 - 23:04 
QuestionI couldnt convert the code, please help me with this. Convert the code to visual studio 2005 and Frameword 2.0 and give it to me back. PinmemberArivazhagan @ Saran28 Sep '11 - 19:04 
QuestionHTML to PDF code in article PinmemberDov Miller12 Sep '11 - 4:05 
GeneralMy vote of 1 Pinmembermjanulaitis123418 Apr '10 - 16:17 
QuestionHow do you write the pdf to disk? PinmemberLawrence Botley24 Jul '09 - 0:04 
QuestionThe specified module could not be found. (Exception from HRESULT: 0x8007007E) PinmemberNikhil Prajapati15 Mar '09 - 21:26 
GeneralHtml TO PDF Pinmembervinodkrebc6 Mar '09 - 19:56 
GeneralImages support Pinmembermerlinox30 Nov '08 - 23:57 
GeneralWhen i run the code i am getting HTML code insted of page UI PinmemberRam432125 Nov '08 - 0:16 
QuestionHTML to PDF Convertd, when the project its throws following error... Pinmemberrdssiva13 Nov '08 - 1:07 
GeneralTABLE and DIV Support PinmemberGokhan Mamaci25 Oct '08 - 6:27 
QuestionNo Spport Farsi Language ITextSharp.dll Pinmembers_nazari@yahoo.com22 Oct '08 - 23:01 
GeneralNo Spport Farsi Language ITextSharp.dll Pinmemberfatemeh22046 Oct '08 - 3:44 
GeneralPdfizer Projects Pinmemberfatemeh22046 Oct '08 - 2:16 
GeneralPdfizer Projects Pinmemberm.jafari545 Oct '08 - 20:14 
GeneralSample project given is not working..Please help Pinmembersrinath g nath2 May '08 - 1:13 
Questionpdfize can't support chinese? Pinmembereclay19 Feb '08 - 16:55 
GeneralNTidy.dll PinmemberMember 13389527 Feb '08 - 3:45 
GeneralThe specified module could not be found. (Exception from HRESULT: 0x8007007E) Pinmembermr_aladddin21 Jan '08 - 0:37 
Generalc# Pinmembermihaela13 Jan '08 - 3:39 
Questionabt pdfizer Pinmemberabinmaloth4u22 Nov '07 - 15:44 
QuestionNTidy.dll Support of DotNetNUke Pinmembertariq software engineer19 Aug '07 - 21:07 
GeneralHTML to PDF Library for .NET Pinmemberwinnovative6 Jul '07 - 12:04 
GeneralHTML To PDF Converter for .NET PinmemberFlorentin BADEA24 May '07 - 4:20 
GeneralFile extension test Pinmemberdanneth21 Mar '07 - 4:58 
GeneralFont color Pinmembersejmik13 Mar '07 - 6:21 
Generalweb service URL PinmemberPingu2213 Feb '07 - 4:05 
GeneralError PinmemberMarco Delgado11 Jan '07 - 2:08 
Generalurl doesn't support local file PinmemberPerlDev16 Nov '06 - 8:50 
GeneralWidth Problem PinmemberAfzal Farooqui22 Jun '06 - 6:12 
QuestionProblems with any entities Pinmembernico.piqueras26 May '06 - 1:39 
GeneralCan't get a mathcad generated html to convert to pdf PinmemberFawxes23 May '06 - 6:10 
GeneralMulti Language Support. Pinmemberhalogen8430 Mar '06 - 21:59 
QuestionHow to work with tables PinmemberChetan Ranpariya7 Feb '06 - 22:32 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130516.1 | Last Updated 18 Jan 2004
Article Copyright 2004 by Jonathan de Halleux
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid