Blend PDF with HTML5

Modesty Zhang

4.93/5 (17 votes)

Sep 26, 2012

CPOL

14 min read

116662

4801

Rendering PDF binary stream for interactive forms by extending PDF.JS in HTML5

Download source

Sample Image

Introduction

The Web evolves a lot since I posted the technique on Blend PDF with Silverlight about four years ago. In addition to the changes of less and no dependencies on browser plugins in RIA (Rich Internet Application) and the raise of RJA (Rich JavaScript Application), a particular trend is that mobile web traffics are exploding and tablet usage to access the web are increasing every day. If your web application is electronic forms centric, fill-able forms are developed in PDF, and also need to support mobile or particularly tablet, then it's time for a tech refresh: we need a practical solution to enable web apps not just rendering fill-able PDF forms directly within the browser without browser plugin, but also integrating user interactions and exchanging data seamlessly. Now with new capabilities in HTML5 and the browsers, plus a powerful open source JavaScript PDF viewer library from Mozilla Lab - PDF.JS, parsing, rendering, layout and interacting PDF electronic forms directly within browser becomes possible.

This article discusses a practical while simple client side solution for parsing, rendering, layout and data binding PDF interactive forms directly in the browser by extending PDF.JS, without browser plugin involved.

PDF.JS library does the heavy lifting to process PDF file file in the client side as binary stream, also renders PDF's read-only parts, including text, shapes, lines, fills, etc., in HTML5 Canvas. I'll discuss in detail where to extend PDF.JS to support form elements parsing, not only text input boxes and check boxes, but also radio buttons, radio button groups, push buttons and dropdown list (comboboxes). I'll also discuss the techniques that are used in the prototype to lay out those interactive form elements together with PDF content. A working instance of the prototype can be found in here: http://www.hanray.com/sites/BlendPDFWithHTML5/.

Background

PDF format became an ISO standard in 2008, it's well defined and documented, we're not going to discuss PDF format, instead, we're going to focus on the interactive form elements parsing and rendering on top of PDF.JS. The main idea is that PDF electronic forms can be parsed, rendered and integrated with user interactions and user data by JavaScript in the client. The benefit of this approach is documented in Andreas Gal's post, even though he mainly talked about view PDF in read-only mode, while our goal is to add forms interactivities and data binding.

Rendering interactive PDF forms in browser relies on some new features in HTML5. PDF.JS uses JavaScript Typed Array and XHR level 2 for PDF binary stream processing, also utilizes Web Worker and HTML5 Canvas to render PDF in read-only mode. This means not all browsers are compatible with this approach, primarily for IE. Although IE9 has canvas support, but no Typed Array till IE10. For browsers that support these HTML5 capabilities, including the latest version of Google Chrome, Firefox and Safari, the prototype works great.

Since PDF.JS focuses on building out a JavaScript PDF viewer for browser, which is ready-only, it doesn't support form elements parsing and rendering. Although one of its sample provides some basic support for AcroForm, but it only has text input and check box, so the extension to PDF.JS is mainly around parsing radio button (groups), push buttons and dropdowns (comboboxes). Once we have the form elements data, we'll layout HTML forms on top of PDF content on canvas for user interactions and data integration.

The way to lay out interactive form elements is to using PDF.JS generated canvas drawing as form background, then put forms controls in absolute positioned HTML form tag on top of it. Size and position for each PDF page and form elements are controlled by PDF stream, canvas and absolute positioned form layer's z-index is controlled by CSS. The goal is to drop a fill-able PDF form file in web server, without server side process and client side plugins, our prototype web app can render the PDF content and interactive form without code changes.

Another note to add is that this approach doesn't address PDF printing issue, the sample code doesn't have database support either. Although we addresses user data persistence while navigating back and forth, current data store is in DOM storage via HTML5 data-* attributes just for prove of concept purpose. You can check out the sample project in http://www.hanray.com/sites/BlendPDFWithHTML5/ to see how it works.

Sample Project Structure

Before we delve into form element parsing, we can brief on the sample project's structure to better describe how the entire app works. This web app is built with backbonejs + underscorejs + requirejs + AMD + jQuery. I'm using backbone as MVC framework, and utilize a lot functional programming capabilities provided by underscorejs, which backbone is also dependent on, and obviusly jQuery helps a lot for DOM manipulating.

Although all sample code is AMD (Asynchronous Module Definition) compatible and loaded via requireJS, PDF.JS library files are not loaded asynchronously, they are referenced by script tag in index.html. Just as all AMD compatible project, the sample project's bootstrap is defined in main.js. Here below shows how requireJS is configured to load other non-AMD compatible JS libraries:

    requirejs.config({
        paths:{
            // Major libraries
            underscore:'lib/underscore', // https://github.com/amdjs
            backbone:'lib/backbone', // https://github.com/amdjs
            text:'lib/text', // Require.js plugins
            template: '../template'
        },
        shim:{
            underscore: {
                exports:'_'
            },
            backbone: {
                deps:['underscore'],
                exports:'Backbone'
            }
        },
        waitSeconds: 90,
        urlArgs: "v=1.0.5"
    });

Within main.js, it instantiates an instance of AppView (defined in view/app.js) and AppRouter (router.js) as entry point to the app. The AppRounter will monitor the URL changes, then determines which view to render based on URL hash code:

    define([
        'underscore',
        'backbone',
        'vm'
    ], function(_, Backbone, Vm) {
        "use strict";
        var AppRouter = Backbone.Router.extend({
            routes: {
                "pdf/:formid": "renderFedPDFForm",
                "*actions": "defaultRoute"
            },
            renderFedPDFForm: function(formid) {
                var mcView = Vm.getChildView(this.options.appView, 'MainContentView');
                mcView.render({viewName:'PDFViewer', id:'fed/' + formid});
            },
            defaultRoute: function(actions){
                if (!actions) {
                    this.navigate("#/pdf/f1040ezt");
                }
            },
            initialize : function(options){
                this.options = options;
                Backbone.history.start();
            }
        });
        return AppRouter;
    });

From the defaultRoute function, it'll try to render the f1040ezt.pdf file when no PDF form id is specified in URL. The renderFedPDFForm function is the main entry to all PDF Forms renderings, it calls through MainContentView (defined in view/MainContentView.js) to instantiate and render PDFView class. Here is the code for PDFView's render function:

    render:function (options) {
        var self = this;

        self.forms = [];
        self.pdfPageTemplate = _.template(VM.getTemplate('tmpl-pdfpage'));

        require(['model/PDFParser'], function(PDFParser){
            self.pdfParser = new PDFParser();
            self.pdfParser.set('formInstanceID', options.id.toLowerCase() + ".pdf");
            self.pdfParser.on('pdfDocumentReady', _renderPDF, self);
            self.pdfParser.on('pdfDocumentError', _errorPDF, self);
            self.pdfParser.getPDFDocument("data/");
        });
    }

The PDFView's render will tell its model, pdfParser, instance of PDFParser class, to load the PDF file base on the URL hash code, when the PDF binary stream is ready, callback on _renderPDF function. When this callback function is invokes, it'll first tell PDF.JS to render the read-only parts on canvas, then callback to render interactive form elements. The following _setupForm function is invoked after each PDF page is rendered:

    var _setupForm = function(page, formIdx, callback) {
        var self = this;
        self.ffs = $('<div></div>') ;
        page.getAnnotations().then(function(fields){
            _.each(fields, _addField, self);
            var $form = self.forms[formIdx];
            $form.append(self.ffs.html());
            _handleUserData.call(self, formIdx);
            callback();
        });
    };

Notice the self.ffs variable, it allows all form elements are appended to an in-memory jQuery object, once all elements are inserted, it'll append its content to the form tag in DOM. This in-memory operation is mainly for improving rendering performance.

The _handleUserData is the key to layout form elements on top of PDF content in canvas, it assumes all items in fields are already parsed out. We need to extend PDF.JS to support all types of form elements in addition to text inputs and check boxes. Let's discuss how those elements are parsed before talk about user interactions for data exchange.

Extending PDF.JS for Form Elements

Based on Adobe's PDF Reference document, interactive form elements are defined as Widget Annotations in section 8.6 page 640. PDF.JS implements getAnnotations in core.js file. The first thing I change there is to remove "." joined item name code, use the "T" value in annotation dictionary:

    item.fullName = stringToPDFString(getInheritableProperty(annotation,'T') || '');

Where fullName will be served as form element's name attribute in HTML tag, which is critical for user data integration (we'll discuss it later.) Furthermore, based on Widget Annotation type, we need to parse out additional attributes that is not currently supported in PDF.JS. Starting line 411 in core.js is my extensions:

    //MQZ.Sep.19.2012: adding field value
      if (item.fieldType == 'Btn') { //PDF Spec p.675
          if (item.flags & 32768) {
              setupRadioButton(annotation, item);
          }
          else if (item.flags & 65536) {
              setupPushButton(annotation, item);
          }
          else {
              setupCheckBox(annotation, item);
          }
      }
      else if (item.fieldType == 'Ch') {
          setupDropDown(annotation, item);
      }

The above 4 setup... functions are defined as closure functions inside getAnnotation to keep it private. Let's look at them one by one.

setupRadioButton

Usually, radio buttons come with a group so that only one can be selected at one time. Based on PDF spec, each radio button in a radio group is defined as a Widget Annotation, and their group name is the item.fullName we talked earlier. The only additional information we need is the "value" associated with each radio, here is how we get it:

    function setupRadioButton(annotation, item) {
        //PDF Spec p.606: get appearance dictionary
        var ap = annotation.get('AP');
        //PDF Spec p.614 get normal appearance
        var nVal = ap.get('N');
        //PDF Spec p.689
        var i = 0;
        nVal.forEach(function(key, value){
            i++;
            if (i == 2) {
                item.value = key; //value if selected for the radio button
            }
        });
    }

We're going to talk about how these item.value, item.fullName is laid out as a HTML radio button input tag later in user interactions section. Let's move on to PushButton parsing.

setupPushButton

As for PushButton, we need 2 pieces of information when placing them in an interactive form. One is the button label, another is the button action. In my use case, button action is always to navigate to a URL when clicked, so the 2nd piece of data is the URL string that's set in the AcroForm. More details are in the code:

    function setupPushButton(annotation, item) {
        //button label: PDF Spec p.640
        var mk = annotation.get('MK');
        item.value = mk.get('CA') || '';

        //button action: url when mouse up: PDF Spec:p.642
        item.FL = "";
        var ap = annotation.get('A');
        if (ap) {
            var sp = ap.get('S');
            item.FL = ap.get(sp.name);
        }
    }

setupCheckBox

The purpose of extending CheckBox parsing is to get the "value" associated with the field. Here is the code:

    function setupCheckBox(annotation, item) {
        //PDF Spec p.606: get appearance dictionary
        var ap = annotation.get('AP');
        //PDF Spec p.614 get normal appearance
        var nVal = ap.get('N');
        //PDF Spec p.689
        var i = 0;
        nVal.forEach(function(key, value){
            i++;
            if (i == 1) //value when selected
                item.value = key;
        });
    }

setupDropDown

The dropdown or combobox element have a list of item, each item's labels is shown to the user and corresponding "value" is the actual data representation of the selection. PDF stores these information in one dictionary entry as a string array, it makes parsing dropdown to be the simplest:

    function setupDropDown(annotation, item) {
        //PDF Spec p.688
        item.value = annotation.get('Opt') || [];
    }

With the help of PDF.JS, we can render the PDF read-only parts in canvas and we also extended it to parse out form elements information. Let's see how we use these information to layout HTML forms on top of canvas.

Interactive Form Layout

As we discussed in Background section, the way to layout form elements is to make sure form tags have higher z-index than canvas (where PDF content is drawn), then absolute position each element based on coordinates we got from PDF.JS. Specifically, each PDF page is inserted to the DOM based on a underscoreJS HTML template: (defined in template/template.html)

    <script id='tmpl-formviewer' type='text/template'>
        <canvas id="formViewer" width="<%= width %>px" height="<%= height %>px"></canvas>
        <form class="formFields" width="<%= width %>px" height="<%= height %>px"></form>
    </script>

And the following CSS rules will make sure z-index and positioning are right:

.formFields input, .formFields button, .formFields select, .formFields div { position: absolute; }
.pdfpage { position:relative; top: 0; left: 0; border: solid 1px black; margin: 0; }
.pdfpage > canvas { position: absolute; top: 0; left: 0; background-color: #F4F3EA; z-index: 0;}
.pdfpage > form { position: relative; z-index: 1; top: 0; left: 0; }

Above CSS rules are defined in css/layout.css, and it's dynamically loaded by sohaBase.js when AppView loads. sohaBase.js is a AMD compatible version of basic common functionality in Service Oriented HTML Application, it honors the version number in requireJS configuration to make sure using updated CSS file instead of the cached one when version string changes.

Now we have the HTML container tags, CSS rules and fields items data, next step is simply to get the concrete fields data, apply it against the corresponding input/button/select templates (again, defined in template.html), then insert the resulting HTML into the form tags. Let's take checkbox for instance to see how simple it is.

Checkbox tag shares the same HTML input template with text input, radio and push buttons, since their only difference is the type attribute values. The sample project only handles 4 different input types, it can be extended to support more HTML5 input types with the same code and template, like date, email, url, password, range, search, time, etc. Here below is the definition of input template:

    <script id='tmpl-inputbutton' type='text/template'>
        <input type="<%=type%>" name="<%=id%>" tabindex="<%=tabindex%>" value="<%=value%>" style="left:<%=x%>px;top:<%=y%>px;">
    </script>

For each individual CheckBox field item (we've got the item object from extending PDF.JS, see setupCheckBox code sample above), the view model will be generated by invoking getCheckBoxData in model/PDFParser.js before it's applied to the template. getCheckBoxData function extends common field view model data from _getFieldBaseData and getFieldPosition:

    var getFieldPosition = function(field) {
        var viewPort = this.get("viewport");
        var fieldRect = viewPort.convertToViewportRectangle(field.rect);
        var rect = Util.normalizeRect(fieldRect);
        return {
            x: Math.floor(rect[0]),
            y: Math.floor(rect[1]),
            width: Math.floor(rect[2] - rect[0]),
            height: Math.floor(rect[3] - rect[1])
        };
    };

    var _getFieldBaseData = function(field) {
        return _.extend({
            id: field.fullName,
            tabindex: _tabIndex++
        }, getFieldPosition.call(this, field));
    };

    getCheckBoxData: function(field) {
        return _.extend({
            type: "checkbox",
            value: field.value
        }, _getFieldBaseData.call(this, field));
    }

When getCheckBoxData returns, _addCheckBox in view/PDFViewer.js will apply the returning data against the tmpl-inputbutton template, then insert the result HTML into in-memory jQuery object:

    var _addCheckBox = function(field) {
        var cbData = this.pdfParser.getCheckBoxData(field);
        this.ffs.append(this.inputButtonTemplate(cbData));
    };

All other form elements types, text input, radio/push buttons and dropdowns, work in the same way: data is converted from PDF.JS extended item object to view model, then input template is applied with the view model and end result is inserted to in-memory jQuery object. Once each item completes the same process, we get an interactive forms laid out on top of PDF content drawn in canvas.

Now we have PDF read-only parts rendered in canvas by PDF.JS, also have interactive form parsed and laid out, next step is to bind data to the form.

Data Binding to Forms

Since we are building a PDF form based web application, our goal is not just viewing PDF, we need to enable the interactivity between forms and users for data exchange. In real world use case, end user needs to be authenticated then connected to database to retrieve user specific dat. For demonstration purpose, I'll omit this step and just to prove the data binding is viable and reliable in this approach by using DOM storage as the data store. You can simply replace the DOM data store with Ajax calls for a fully integrated web application.

Form data binding involves two aspect, one is to wire up form element's change event in a form agnostic way, and save the user data to the data store when navigate away. Another aspect is when a form layout completes, the data store needs checked, if user data is available for current loaded form, then grab it and populate them back to the right fields.

The above approach follows the "separation of concerns" principle, not only to separate out logic (user data processing usually in server) and presentation (client side only), but also to separate out template data (PDF file) and user data within the presentation layer. Since the template data is the same for all users (those PDF forms) and public, while user data is user specific and secure, this further separation makes content and logic to be able to developed in parallel, and using the web app to glue every piece together in a loosed couple way.

Event Binding

Let's take a look at the first part of form data binding: within view/PDFViewer.js, input fields event handler is wired up after in-memory jQuery object's HTML content is inserted to DOM:

    var _handleUserData = function(formIdx) {
        var self =this;
        var formData = self.pdfParser.getFormUserData(formIdx);
        var $form = self.forms[formIdx];

        $form.find('input').bind('change', function(evt){
            if (this.type == 'checkbox')
                formData[this.name] = this.checked;
            else {//if (!this.type || this.type == 'text' || this.type == 'radio')
                formData[this.name] = this.value;
            }
        });

        $form.find('select').bind('change', function(evt){
            formData[this.name] = this.value;
        });

        _fillUserData.call(self, $form, formData);
    };

It loops through all input and select tags within the current form, and put each changed field's value back to a name-value pair JavaScript object, where "name" is the field's ID, and this name-value pair object is managed by model/PDFParser.js. Here below is the code in model/PDFParser.js shows how this object is initialized and saved to DOM storage.

    //save user data to DOM storage
    updateUserData: function() {
        var uD = this.get('userData');
        $('body').data(this.get('formInstanceID'), uD);
    },
    //to make sure each pdf page has a userData object associated
    initUserData: function(pageCount) {
        var uD = [];
        for (var i = 0; i < pageCount; i++) {
            uD.push({});
        }
        this.set({userData: uD}, {silent: true});
    }

When user navigates to a different form, updateUserData is invoked from view/PDFViewer.js to save all collected user data to DOM storage based on current form ID. The form ID based storage key will allow us to retrieve the user data back when a new form loads.

Data Binding

When a form finishes rendering and layout, view/PDFViewer.js will call into its model in model/PDFParser.js to read out previously saved user data for current form:

    //read user data from DOM storage by formID, and raise event of change:userData
    getFormUserData: function(formIDx) {
        var uD = $('body').data(this.get('formInstanceID'));
        if (uD) {
            this.set({userData: uD}, {silent: true});
        }
        else
            uD = this.get('userData');
        return uD[formIDx];
    }

Once user data is returned, view/PDFViewer.js continues to populate the form based on field IDs:

    var _fillUserData = function($form, formData) {
        var self = this;
        $form.find('input').each(function(index, inputEle){
            if (_.has(formData, this.name)) {
                if (this.type == 'checkbox') {
                    this.checked = formData[this.name];
                }
                else if (this.type == 'radio') {
                    this.checked = this.value === formData[this.name];
                }
                else
                    this.value = formData[this.name];
            }
        });

        $form.find('select').each(function(i, s){
            if (_.has(formData, this.name)) {
                this.value = formData[this.name];
            }
        });
    };

UP to now, we have a fully functioned, data bound PDF based interactive form with navigation running within the browser without plugin. This PDF based interactive form rendering, layout and data binding are blending PDF with HTML5 in a easy, safe (browser sandbox) and generic way. You can google some PDF AcroForm file and drop them into the data/fed folder to see how quickly to integrate new forms into the application without code change.

Wrap Up

The evolution of web and technology always renders some technique obsolete while providing new options to enable great customer experience in a more efficient and secure way. Beyond the excitement about new capabilities in browsers and HTML5, there are some notes and issues I'd like to point out:

Although HTML5 is evolving fast, we always need to check the W3C and WHATWG HTML5 spec before start to reply on some new features in our projects. For example, binary data process is not available in current version of IE (IE9), while IE10 has plan to support it. Same caution goes to other feature considerations, like canvas not available in IE8/IE7, Web Worker is not in IE either, etc. The good news is that those features are gradually adding in, and if not, there is always some alternative options, like Chrome Frame or HTML5 ployfills, etc.

As for the PDF.JS library itself, it's powerful and well written, but it's size is fairly big (I ran the build script once, combined altogether is still 800+k) and needs work to make it fully AMD compatible. Besides it's not supporting interactive forms, it renders all text content in canvas, which make it un-selectable, since canvas is pixel based, it might be an issue when accessibility is required. I'd love to see PDF.JS start to provide configurable options to render in SVG or pure HTML text for select-ability, readability and accessibility.

The sample project comes with this article is really the first attempt to parse, render, layout and data bound interactive forms. I only focus on AcroForm, haven't looked into XFA forms yet. Even for AcroForm, not all elements attributes are fully parsed and utilized, like input constrains (max length, digit only, etc.), validations and different appearance settings in PDF. More works to follow later on.

With all that said, it's still exciting to see the browser has growing capabilities to process binary data via typed array and XHR 2, running certain logic asynchronously in Web Worker, drawing graphics and text in canvas, and blending with other HTML content seamlessly with interactivity. With current development pace, HTML5 just gets better and more powerful every day.