5,445,109 members and growing! (15,171 online)
Email Password   helpLost your password?
Languages » C# » Samples     Intermediate

Pdfizer, a dumb HTML to PDF converter, in C#

By Jonathan de Halleux

This library converts simple HTML documents to PDF.
C#, Windows, .NET 1.0, .NET 1.1, .NET, ASP.NET, Visual Studio, VS.NET2003, Dev

Posted: 17 Jan 2004
Updated: 17 Jan 2004
Views: 141,516
Bookmarked: 83 times
Announcements
Want a new Job?



Search    
Advanced Search
Sitemap
14 votes for this Article.
Popularity: 5.17 Rating: 4.52 out of 5
2 votes, 14.3%
1
0 votes, 0.0%
2
1 vote, 7.1%
3
0 votes, 0.0%
4
11 votes, 78.6%
5

Introduction

This article presents a basic HTML to PDF converter: with this library, you can transform simple HTML pages to nice and printable PDF files.

The HTML cleaning is done with NTidy (see [1]), a .NET wrapper for the HTML Tidy library (see [2]). The PDF generation is done with iTextSharp, a PDF generation library (see [3]).

Transformation Pipe

Transforming HTML documents to PDF is a fairly complex task. Hopefully, there exists powerful tools on the web that could help me accomplish this.

Parsing HTML

The first problem to handle was that HTML is usually "dirty": the structure is usually not XML conformant and trying to parse HTML pages with the XmlDocument will usually lead to a failure.

To overcome this problem, I had to write a .NET wrapper around HTML Tidy (see [2]). HTML Tidy is a very useful application that takes "dirty" HTML and returns it cleaned as much as possible. The .NET wrapper exposes a DOM-like class structure so that you can use it much like XmlDocument.

Hence, with NTidy, we can safely parse HTML document.

Creating PDF

The PDF creation is done by iTextSharp (see [3]), a .NET library hosted on SourceForge, that gives you the tool to create PDF easily. Hence, the PDF creation problem is solved.

Reading, Traversing

With NTidy and iTextSharp on my toolset, I could start to create the generator. The generator works like this: it first reads the input with NTidy, then traverses the DOM tree and generates the PDF fragments with iTextSharp.

Quick Example

The library usage is done through the HtmlToPdfConverter class. Creating a PDF file is done through the following steps, as illustrated in the example:

  1. Create a converter,
  2. Open a new PDF file using the Open method,
  3. Add a chapter,
  4. Feed HTML to the converter,
  5. If you want another chapter, go to 3.
  6. When finished, close the PDF file by calling Close.
// create converter

HtmlToPdfConverter html2pdf = new HtmlToPdfConverter();

// open new pdf file

html2pdf.Open(@"test");
// start a chapter

html2pdf.AddChapter(@"Dummy Chapter");
string html = ...;
// convert string

html2pdf.Run(html);
// add a new chapter

html2pdf.AddChapter(@"Boost page");
// read web page

html2pdf.Run(new Uri(@"http://www.boost.org/libs/libraries.htm"));
// close and finish pdf file.

html2pdf.Close();

What to expect and not expect

Don't expect too much from this tool, it will not work with complex HTML pages and will give fairly good results with simple HTML pages. Specially, tables are not yet supported.

Reference

  1. NTidy, a .NET wrapper around Tidy.
  2. HTML Tidy home page.
  3. iTextSharp, PDF generation tool.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Jonathan de Halleux


Jonathan de Halleux is Civil Engineer in Applied Mathematics. He finished his PhD in 2004 in the rainy country of Belgium. After 2 years in the Common Language Runtime (i.e. .net), he is now working at Microsoft Research on Pex (http://research.microsoft.com/pex).

Occupation: Engineer
Location: United States United States

Other popular C# articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
FAQ FAQ Noise ToleranceSearch Search Messages 
 Layout  Per page   
 Msgs 1 to 25 of 44 (Total in Forum: 44) (Refresh)FirstPrevNext
Subject  Author Date 
GeneralSample project given is not working..Please helpmembersrinath g nath2:13 2 May '08  
Generalpdfize can't support chinese?membereclay17:55 19 Feb '08  
GeneralNTidy.dllmemberMember 13389524:45 7 Feb '08  
GeneralThe specified module could not be found. (Exception from HRESULT: 0x8007007E)membermr_aladddin1:37 21 Jan '08  
GeneralRe: The specified module could not be found. (Exception from HRESULT: 0x8007007E)memberSameers (theAngrycodeR )6:31 2 Apr '08  
Generalc#membermihaela4:39 13 Jan '08  
Questionabt pdfizermemberabinmaloth4u16:44 22 Nov '07  
AnswerRe: abt pdfizermemberRavi Bhavnani17:01 22 Nov '07  
QuestionNTidy.dll Support of DotNetNUkemembertariq software engineer22:07 19 Aug '07  
GeneralHTML to PDF Library for .NETmemberwinnovative13:04 6 Jul '07  
GeneralRe: HTML to PDF Library for .NETmemberMember 710021:00 9 Jul '08  
GeneralHTML To PDF Converter for .NETmemberFlorentin BADEA5:20 24 May '07  
GeneralRe: HTML To PDF Converter for .NETmemberMember 710021:03 9 Jul '08  
GeneralFile extension testmemberdanneth5:58 21 Mar '07  
GeneralFont colormembersejmik7:21 13 Mar '07  
Generalweb service URLmemberPingu225:05 13 Feb '07  
GeneralErrormemberMarco Delgado3:08 11 Jan '07  
Generalurl doesn't support local filememberPerlDev9:50 16 Nov '06  
GeneralWidth ProblemmemberAfzal Farooqui7:12 22 Jun '06  
QuestionProblems with any entitiesmembernico.piqueras2:39 26 May '06  
GeneralCan't get a mathcad generated html to convert to pdfmemberFawxes7:10 23 May '06  
GeneralRe: Can't get a mathcad generated html to convert to pdfmemberFawxes23:16 25 May '06  
GeneralMulti Language Support.memberhalogen8422:59 30 Mar '06  
GeneralHow to work with tablesmemberChetan Ranpariya23:32 7 Feb '06