Click here to Skip to main content
6,822,613 members and growing! (16,006 online)
Email Password   helpLost your password?
Platforms, Frameworks & Libraries » Windows Presentation Foundation » General License: The Code Project Open License (CPOL)

Conversion of PDF to EPUB Format

By André van heerwaarde

This application converts a file from PDF into a EPUB format
C#, Windows, WPF
Revision:3 (See All)
Posted:22 Nov 2009
Views:2,859
Bookmarked:2 times
printPrint   add Share
      Discuss Discuss   Broken Article?Report  
2 votes for this article.
Popularity: 0.60 Rating: 2.00 out of 5

1
2 votes, 100.0%
2

3

4

5

Introduction

This application converts a PDF into a EPUB file ready for an e-book reader. An epub file is an open and free standard format (http://en.wikipedia.org/wiki/EPUB). I was not satisfied with most applications I tried (Stansa and Caldera). They seemed to drop some or all pictures. So I wrote something of my own. Some people showed interest in the application and that’s why it shows up here. But remember this is a weekend project, so don't expect fully refactored code complete with unit tests.

EPUB

The epub file format are HTML files with some extra files with meta information stored in a zip file. The extra files are not that special, see the Wikipedia link if you really want to know. To get from PDF to HTML, I started out with an open source project from SourceForge (Pdf2Html). This gives us a single HTML file and all the images.

But epub requires the HTML to be in xhtml format, so we need to convert all the tags into xhtml format. Luckily only a few tags were used, so the conversion is fairly simple. The tags must be in lower case and have a closing tag or be closed with ‘\>’. The a link attribute name must be replaced with the attribute id.

Next we want to strip the page header and footer. This makes reading on an e-book reader more enjoyable. To accomplice this, we split the source file into separate pages (identified by the <hr> tag). The best way to correctly identify a header or a footer is to select a fixed number of lines from the start/end of page and to match it with a regular expression. If it conforms, we strip it and add the rest to a new output file.

The HTML file has lines of a fixed length, but we need to collect multiple lines into a paragraph. This allows the e-book reader to flow the text more easily. Currently we detect 3 types of lines:

Normal line When not a heading or a code line J
Heading If it starts with a (any) tag
Code line If the current or next line starts with a space

All normal lines are grouped together into paragraphs. Each heading gets its own paragraph. We surround each code line with a pre tag. This will show the code automatically in the courier font.

Next we need to add the extra files with meta info. Most can be copied, but some need to be manipulated. This is done by replacing markers in template files.

Finally we need to zip it. I tried to use the command line, but they all rearranged the file order and epub wants the first file to be mimetype (uncompressed). But .NET also contains some compression code, so that’s the final step done.

I hope somebody finds this application useful and if you want a change or a new feature, you are free to add it and spread it around the world.

History

  • 22nd November, 2009: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

André van heerwaarde


Member
I am software developer working with microsoft technologies for the last 20 years. Seen almost all versions of DOS and windows and programming in Assembly, C / C++ / C# and VB.NET. Currently focussing completely on everything .NET.
Occupation: Software Developer
Company: Eclectic (www.eclectic.nl)
Location: Netherlands Netherlands

Other popular Windows Presentation Foundation articles:

Article Top
You must Sign In to use this message board.
FAQ FAQ 
 
Noise Tolerance  Layout  Per page   
 Msgs 1 to 12 of 12 (Total in Forum: 12) (Refresh)FirstPrevNext
GeneralIt works for me. Pinmemberkursist@gmail.com5:00 6 Feb '10  
GeneralEPUB Reader PinmemberA_ndre11:34 2 Jan '10  
GeneralRe: EPUB Reader PinmemberAndré van heerwaarde21:14 2 Jan '10  
GeneralCompilation Error Pinmemberlayoro18:00 7 Dec '09  
GeneralRe: Compilation Error PinmemberAndré van heerwaarde23:56 8 Dec '09  
GeneralAdobe Digital edition not open epub files PinmemberSaro710:24 3 Dec '09  
GeneralRe: Adobe Digital edition not open epub files PinmemberAndré van heerwaarde0:24 9 Dec '09  
GeneralREQUIRES MICROSOFT.EXPRESSION Pinmemberdriverte3:59 24 Nov '09  
GeneralRe: REQUIRES MICROSOFT.EXPRESSION PinmemberAndré van heerwaarde23:55 8 Dec '09  
GeneralMy vote of 2 PinmemberPriyank Bolia20:49 22 Nov '09  
GeneralRe: My vote of 2 PinmemberAndré van heerwaarde23:51 8 Dec '09  
General[My vote of 2] Lacks any explanation of the code PinmemberRichard MacCutchan12:27 22 Nov '09  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads.

PermaLink | Privacy | Terms of Use
Last Updated: 22 Nov 2009
Editor: Deeksha Shenoy
Copyright 2009 by André van heerwaarde
Everything else Copyright © CodeProject, 1999-2010
Web21 | Advertise on the Code Project