Click here to Skip to main content
Licence CPOL
First Posted 22 Nov 2009
Views 27,643
Downloads 1,033
Bookmarked 11 times

Conversion of PDF to EPUB Format

By André van heerwaarde | 22 Nov 2009
This application converts a file from PDF into a EPUB format

1
2 votes, 66.7%
2

3
1 vote, 33.3%
4

5
2.22/5 - 3 votes
μ 2.22, σa 0.88 [?]

Introduction

This application converts a PDF into a EPUB file ready for an e-book reader. An epub file is an open and free standard format (http://en.wikipedia.org/wiki/EPUB). I was not satisfied with most applications I tried (Stansa and Caldera). They seemed to drop some or all pictures. So I wrote something of my own. Some people showed interest in the application and that’s why it shows up here. But remember this is a weekend project, so don't expect fully refactored code complete with unit tests.

EPUB

The epub file format are HTML files with some extra files with meta information stored in a zip file. The extra files are not that special, see the Wikipedia link if you really want to know. To get from PDF to HTML, I started out with an open source project from SourceForge (Pdf2Html). This gives us a single HTML file and all the images.

But epub requires the HTML to be in xhtml format, so we need to convert all the tags into xhtml format. Luckily only a few tags were used, so the conversion is fairly simple. The tags must be in lower case and have a closing tag or be closed with ‘\>’. The a link attribute name must be replaced with the attribute id.

Next we want to strip the page header and footer. This makes reading on an e-book reader more enjoyable. To accomplice this, we split the source file into separate pages (identified by the <hr> tag). The best way to correctly identify a header or a footer is to select a fixed number of lines from the start/end of page and to match it with a regular expression. If it conforms, we strip it and add the rest to a new output file.

The HTML file has lines of a fixed length, but we need to collect multiple lines into a paragraph. This allows the e-book reader to flow the text more easily. Currently we detect 3 types of lines:

Normal line When not a heading or a code line J
Heading If it starts with a (any) tag
Code line If the current or next line starts with a space

All normal lines are grouped together into paragraphs. Each heading gets its own paragraph. We surround each code line with a pre tag. This will show the code automatically in the courier font.

Next we need to add the extra files with meta info. Most can be copied, but some need to be manipulated. This is done by replacing markers in template files.

Finally we need to zip it. I tried to use the command line, but they all rearranged the file order and epub wants the first file to be mimetype (uncompressed). But .NET also contains some compression code, so that’s the final step done.

I hope somebody finds this application useful and if you want a change or a new feature, you are free to add it and spread it around the world.

History

  • 22nd November, 2009: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

André van heerwaarde

Software Developer
Eclectic (www.eclectic.nl)
Netherlands Netherlands

Member
I am software developer working with microsoft technologies for the last 20 years. Seen almost all versions of DOS and windows and programming in Assembly, C / C++ / C# and VB.NET. Currently focussing completely on everything .NET.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
QuestionHow it works? Pinmembermehuljsheth8:44 5 Dec '11  
AnswerRe: How it works? PinmemberAndré van heerwaarde10:27 6 Dec '11  
GeneralMy vote of 4 Pinmemberrtisatto13:05 5 Nov '11  
QuestionError PinmemberMember 416880720:58 24 Oct '11  
AnswerRe: Error PinmemberAndré van heerwaarde0:46 25 Oct '11  
GeneralA great tool - but not for me :( Pinmemberykpui16:25 5 Nov '10  
GeneralRe: A great tool - but not for me :( PinmemberChance Edwards10:15 18 Oct '11  
GeneralVery Cool Pinmemberrtisatto13:03 5 Nov '11  
GeneralIt works for me. Pinmemberkursist@gmail.com5:00 6 Feb '10  
GeneralRe: It works for me. PinmemberVikram Pathak19:33 28 Mar '11  
GeneralEPUB Reader PinmemberA_ndre11:34 2 Jan '10  
GeneralRe: EPUB Reader PinmemberAndré van heerwaarde21:14 2 Jan '10  
GeneralRe: EPUB Reader PinmemberSridhar Sathya10:40 13 Jul '10  
GeneralCompilation Error Pinmemberlayoro18:00 7 Dec '09  
GeneralRe: Compilation Error PinmemberAndré van heerwaarde23:56 8 Dec '09  
GeneralAdobe Digital edition not open epub files PinmemberSaro710:24 3 Dec '09  
GeneralRe: Adobe Digital edition not open epub files PinmemberAndré van heerwaarde0:24 9 Dec '09  
GeneralREQUIRES MICROSOFT.EXPRESSION Pinmemberdriverte3:59 24 Nov '09  
GeneralRe: REQUIRES MICROSOFT.EXPRESSION PinmemberAndré van heerwaarde23:55 8 Dec '09  
GeneralMy vote of 2 PinmemberPriyank Bolia20:49 22 Nov '09  
GeneralRe: My vote of 2 PinmemberAndré van heerwaarde23:51 8 Dec '09  
General[My vote of 2] Lacks any explanation of the code PinmemberRichard MacCutchan12:27 22 Nov '09  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web02 | 2.5.120209.1 | Last Updated 22 Nov 2009
Article Copyright 2009 by André van heerwaarde
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid