Introduction
This application converts a PDF into a EPUB file ready for an e-book reader. An epub file is an open and free standard format (http://en.wikipedia.org/wiki/EPUB). I was not satisfied with most applications I tried (Stansa and Caldera). They seemed to drop some or all pictures. So I wrote something of my own. Some people showed interest in the application and that’s why it shows up here. But remember this is a weekend project, so don't expect fully refactored code complete with unit tests.
EPUB
The epub file format are HTML files with some extra files with meta information stored in a zip file. The extra files are not that special, see the Wikipedia link if you really want to know. To get from PDF to HTML, I started out with an open source project from SourceForge (Pdf2Html
). This gives us a single HTML file and all the images.
But epub requires the HTML to be in xhtml format, so we need to convert all the tags into xhtml format. Luckily only a few tags were used, so the conversion is fairly simple. The tags must be in lower case and have a closing tag or be closed with ‘\>
’. The a link attribute name must be replaced with the attribute id.
Next we want to strip the page header and footer. This makes reading on an e-book reader more enjoyable. To accomplice this, we split the source file into separate pages (identified by the <hr>
tag). The best way to correctly identify a header or a footer is to select a fixed number of lines from the start/end of page and to match it with a regular expression. If it conforms, we strip it and add the rest to a new output file.
The HTML file has lines of a fixed length, but we need to collect multiple lines into a paragraph. This allows the e-book reader to flow the text more easily. Currently we detect 3 types of lines:
Normal line |
When not a heading or a code line J |
Heading |
If it starts with a (any) tag |
Code line |
If the current or next line starts with a space |
All normal lines are grouped together into paragraphs. Each heading gets its own paragraph. We surround each code line with a pre
tag. This will show the code automatically in the courier font.
Next we need to add the extra files with meta info. Most can be copied, but some need to be manipulated. This is done by replacing markers in template files.
Finally we need to zip it. I tried to use the command line, but they all rearranged the file order and epub wants the first file to be mimetype (uncompressed). But .NET also contains some compression code, so that’s the final step done.
I hope somebody finds this application useful and if you want a change or a new feature, you are free to add it and spread it around the world.
History
- 22nd November, 2009: Initial post