Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Conversion of PDF to EPUB Format

0.00/5 (No votes)
22 Nov 2009 1  
This application converts a file from PDF into a EPUB format

Introduction

This application converts a PDF into a EPUB file ready for an e-book reader. An epub file is an open and free standard format (http://en.wikipedia.org/wiki/EPUB). I was not satisfied with most applications I tried (Stansa and Caldera). They seemed to drop some or all pictures. So I wrote something of my own. Some people showed interest in the application and that’s why it shows up here. But remember this is a weekend project, so don't expect fully refactored code complete with unit tests.

EPUB

The epub file format are HTML files with some extra files with meta information stored in a zip file. The extra files are not that special, see the Wikipedia link if you really want to know. To get from PDF to HTML, I started out with an open source project from SourceForge (Pdf2Html). This gives us a single HTML file and all the images.

But epub requires the HTML to be in xhtml format, so we need to convert all the tags into xhtml format. Luckily only a few tags were used, so the conversion is fairly simple. The tags must be in lower case and have a closing tag or be closed with ‘\>’. The a link attribute name must be replaced with the attribute id.

Next we want to strip the page header and footer. This makes reading on an e-book reader more enjoyable. To accomplice this, we split the source file into separate pages (identified by the <hr> tag). The best way to correctly identify a header or a footer is to select a fixed number of lines from the start/end of page and to match it with a regular expression. If it conforms, we strip it and add the rest to a new output file.

The HTML file has lines of a fixed length, but we need to collect multiple lines into a paragraph. This allows the e-book reader to flow the text more easily. Currently we detect 3 types of lines:

Normal line When not a heading or a code line J
Heading If it starts with a (any) tag
Code line If the current or next line starts with a space

All normal lines are grouped together into paragraphs. Each heading gets its own paragraph. We surround each code line with a pre tag. This will show the code automatically in the courier font.

Next we need to add the extra files with meta info. Most can be copied, but some need to be manipulated. This is done by replacing markers in template files.

Finally we need to zip it. I tried to use the command line, but they all rearranged the file order and epub wants the first file to be mimetype (uncompressed). But .NET also contains some compression code, so that’s the final step done.

I hope somebody finds this application useful and if you want a change or a new feature, you are free to add it and spread it around the world.

History

  • 22nd November, 2009: Initial post

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here