Click here to Skip to main content
Click here to Skip to main content
Go to top

RESTful Web Service for PDF2JSON

, 6 Apr 2013
Rate this:
Please Sign up or sign in to vote.
Running pdf2json module in RESTful Web Service, built with resitify and nodejs

Introduction

pdf2json extends pdf.js with interactive form elements and runs as a node.js module. It takes PDF file as input, parses it then converts it to in-memory objects in node.js. The commandline utility included in pdf2json module takes the in-memory parsing results and writes it out as JSON file, this article presents a different runtime context: to run pdf2json in RESTful web service.

When running pdf2json through web service, PDF file can be located and parsed on demand in the server side, the client application, either web client, desktop app or a mobile app, receives PDF content in JSON format rather than PDF binary, so that the client can focus more on form presentation and data binding/integration, eliminating the need to worry about loading PDF binaries and parsing them. This architecture separates data parsing from presentation, also separates out data template (the service JSON payload) from user data (what user enters to the form), so that the session data and cache size on the app server can be reduced significantly, and also makes a state-less/session-less service possible for higher scalability and availability.

This project is open sourced in Github, module name is p2jsvc, it's built with pdf2json v0.1.23, resitify v2.3.5 and node.js v0.10.1.

Background

To run pdf2json in REST web service, node.js built-in web server is leveraged and resitify is chosen as the REST API framework. Although resitify borrows heavily from express, it enables full controls over HTTP interaction with a strict RESTful style service API. The service end point of p2jsvc is very simple:

	HTTP GET:   http://[host_name]:8001/p2jsvc/[folderName]/[pdfId]
	HTTP POST:  http://[host_name]:8001/p2jsvc
		content-type: application/json
		body: {"folderName":"", "pdfId":""} 

The JSON format in the response body is well documented, I won't repeat it here, let's dive in to see how the service is built.

Context and Response Class

Before talking about the actual service code, we can briefly look at two helper classes. First one is response class:

 
'use strict';
var SvcResponse = (function () {
    // private static
    var _svcStatusMsg = {200: "OK", 400: "Bad Request", 404: "Not Found"};

    // constructor
    var cls = function (code, message, fieldName, fieldValue) {
        // public, this instance copies
        this.status = {
            code: code,
            message: message || _svcStatusMsg[code],

            fieldName: fieldName,
            fieldValue: fieldValue
        };
    };

    cls.prototype.setStatus = function(code, message, fieldName, fieldValue) {
        this.status.code = code;
        this.status.message = message || _svcStatusMsg[code];
        this.status.fieldName = fieldName;
        this.status.fieldValue = fieldValue;
    };

    cls.prototype.destroy = function() {
        this.status = null;
    };

    return cls;
})();

module.exports = SvcResponse;

The actual response class will derive from it, so that status is always part of response payload for both success and error cases. The client will always check the status.code before trying to read other properties, in case of application error (not network exceptions), the client code can construct user friendly messages based on status.message, status.fieldName and status.fieldValue. One example is when user log in failed, the HTTP status from XHR is 200, while in the reponse body, status.code will be 401, so the client will show a "try again" message.

The second helper class is context class, it wraps up the request, response objects and next function from restify:

 
'use strict';
var SvcContext = (function () {
    // constructor
    var cls = function (req, res, next) {
        // public, this instance copies
        this.req = req;
        this.res = res;
        this.next = next;
    };

    cls.prototype.completeResponse = function(jsObj) {
        this.res.send(200, jsObj);
        this.next();
    };

    cls.prototype.destroy = function() {
        this.req = null;
        this.res = null;
        this.next = null;
    };

    return cls;
})();

module.exports = SvcContext;

Since our web service layer is on top of pdf2json, while pdf2json has and should not have any knowledge about web service request and response, the communication between these two layers will rely on nodejs events for asynchronious operations. We'll instantiate new instance of pdf2json and SvcContext for each request, and the new SvcContext instance will be injected into the instance of pdf2json. When parsing complete event raises, the event handler in service layer can use the SvcContext instance from event data to complete the response in nodejs' non-blocking asynchornous fashion, so the service instance can continously serve other requests while waiting for the events from earlier ones.

With SvcReponse and SvcContext, writing a REST service for pdf2json becomes a simple and fun task.

Create and Configure the Server

resitify does the heavy lifting to create and configure the server: 

var server = restify.createServer({
	name: self.get_name(),
	version: self.get_version()
});

server.use(restify.acceptParser(server.acceptable));
server.use(restify.authorizationParser());
server.use(restify.dateParser());
server.use(restify.queryParser());
server.use(restify.bodyParser());
server.use(restify.jsonp());
server.use(restify.gzipResponse());
server.pre(restify.pre.userAgentConnection());

Some resitify built-in handlers are configured to handle requests, including:

  • Accept header parsing
  • Authorization header parsing
  • Date header parsing
  • JSONP support
  • Gzip Response
  • Query string parsing
  • Body parsing (JSON/URL-encoded/multipart form)

Since I'm using curl to test service APIs, pre.userAgentConnection() is configured to check whether the user agent is curl. If it is, it sets the Connection header to "close" and removes the "Content-Length" heade. Without it, curl will use Connection: keep-alive as default.

Route the Request and Start the Server

As discussed earlier, we'd like to support both GET and POST for the a PDF resource, and we also want to instantiate new instance for SvcContext for each request then calls to pdf2json to parse the PDF asynchrounously, this would un-block our server while ealier requests is in process:

	server.get('/p2jsvc/:folderName/:pdfId', function(req, res, next) {
		_gfilter(new SvcContext(req, res, next));
	});

	server.post('/p2jsvc', function(req, res, next) {
		_gfilter(new SvcContext(req, res, next));
	});

	server.get('/p2jsvc/status', function(req, res, next) {
		var jsObj = new SvcResponse(200, "OK", server.name, server.version);
		res.send(200, jsObj);
		return next();
	});

	server.listen(8001, function() {
		nodeUtil.log(nodeUtil.format('%s listening at %s', server.name, server.url));
	});

For each GET or POST request,, it's routed to the same _gfilter function with a new instance of SvcContext. The '/p2jsvc/status' route simply returns a HTTP 200 response without parsing a PDF, it can be used for health check calls from service monitoring tools.

Process the Request

All PDF parsing request is processd with a new instance of pdf2json, class name is PDFParser:

 
	var _gfilter = function(svcContext) {
		var req = svcContext.req;
		var folderName = req.params.folderName;
		var pdfId = req.params.pdfId;
		nodeUtil.log(self.get_name() + " resceived request:" + req.method + ":" + folderName + "/" + pdfId);

		var pdfParser = new PFParser(svcContext);

		_customizeHeaders(svcContext.res);

		pdfParser.on("pdfParser_dataReady", _.bind(_onPFBinDataReady, self));
		pdfParser.on("pdfParser_dataError", _.bind(_onPFBinDataError, self));

		pdfParser.loadPDF(_pdfPathBase + folderName + "/" + pdfId + ".pdf");
	};

When a new instance of PDFParser is created, the svcContext instance is also passed into. When "pdfParser_dataReady" or "pdfParser_dataError" event raised, the event handler can acccess the original request and response objects to complete the response. This new instance, context and event based set up is essential to the throughput and performance of our service.

Complete the Response

The response will be completed when either "pdfParser_dataReady" or "pdfParser_dataError" event is raised from pdf2json instance, it's done via a new instance of SvcReponse:

 
    var _onPFBinDataReady = function(evtData) {
        var resData = new SvcResponse(200, "OK", evtData.pdfFilePath, "FormImage JSON");
        resData.formImage = evtData.data;
        evtData.context.completeResponse(resData);
    };

    var _onPFBinDataError = function(evtData){
        nodeUtil.log(this.get_name() + " 500 Error: " +  JSON.stringify(evtData.data));
        evtData.context.completeResponse(new SvcResponse(500, JSON.stringify(evtData.data)));
    };

If parsing successful, PDF parsing result in JSON is created when invoking context.completeResponse(resData). The service layer code handles all service related tasks, including server, request, response, invoking PDFParser asynchronously and also serialize the parsing result to JSON, while pdf2json instance works in a context-agnostic way, so that it can resued either in a web service project or as a command line tool.

Cross Domain Support

In my project, the web server and app server are running on separated VMs with different host names and sub-domains, this p2jsvc is deployed to app server while my Backbone based web client are deployed to web server, and it communicates with app server through Ajax. To support this cross domain (or corss sub-domain) server configuration, Apache Proxy is configued in httpd.conf on the web server:

<IfModule proxy_module>
	proxyrequests off

  ProxyPass /p2jsvc/ http://app.server.host.ip:8001/ retry=0
	ProxyPassReverse /p2jsvc/ http://app.server.host.ip:8001/ retry=0
</IfModule>

Additionally, p2jsvc also supports JSONP (in server configuration) and Cross Origin Reource Sharing (CORS):

    var _customizeHeaders = function(res) {
        // This headers comply with CORS and allow us to server our response to any origin
        res.header("Access-Control-Allow-Origin", "*");
        res.header("Access-Control-Allow-Headers", "X-Requested-With");
        res.header("Cache-Control", "no-cache, must-revalidate");
    };

Run and Test the Service

Here are some quick command reference to run and test the service. For installation:

    git clone https://github.com/modesty/p2jsvc
    cd p2jsvc
    npm install

to start the server for development:

    cd p2jsvc
    node index

If server starts successfully, you should see prompts in console:

    [time_stamp] - PDFFORMServer1 listening at http://0.0.0.0:8001

When start the server on production server, I use forever to run it as background process:

    cd p2jsvc
    forever start index.js	

To run the test with HTTP GET:

curl -isv http://0.0.0.0:8001/p2jsvc/data/xfa_1040ez
curl -isv http://0.0.0.0:8001/p2jsvc/data/xfa_1040a
curl -isv http://0.0.0.0:8001/p2jsvc/data/xfa_1040

Those xfa_xxx.pdf are testing PDF files, you can replace them with your own under data directory. Similarly  you can test it with POST:

curl -isv -H "Content-Type: application/json" -X POST -d '{"folderName":"data", "pdfId":"xfa_1040ez"}' http://0.0.0.0:8001/p2jsvc
curl -isv -H "Content-Type: application/json" -X POST -d '{"folderName":"data", "pdfId":"xfa_1040a"}' http://0.0.0.0:8001/p2jsvc
curl -isv -H "Content-Type: application/json" -X POST -d '{"folderName":"data", "pdfId":"xfa_1040"}' http://0.0.0.0:8001/p2jsvc

Lastly, here is the curl command to check the service status:

 curl -isv http://0.0.0.0:8001/p2jsvc/status

When the service is up and running correctly, the response JSON body should be:

 {"status":{"code":200,"message":"OK","fieldName":"PDFFORMServer1"}}

The following command will send 10 concurrent requests to parse PDFs for conconsurrency benchmark test:

 ab -n 10 -c 10 http://0.0.0.0:8001/p2jsvc/data/xfa_1040ez
 ab -n 10 -c 10 http://0.0.0.0:8001/p2jsvc/data/xfa_1040a
 ab -n 10 -c 10 http://0.0.0.0:8001/p2jsvc/data/xfa_1040

Wrap Up

Expose pdf2json with a REST interface is fairly simple while powerful with resitify, although this article is all about runnning pdf2json in a RESTful web service project, its context and event based asynchronious model is appliable to other resitify based web service project, wish you found it useful too.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Modesty Zhang
Technical Lead
United States United States
Tech Lead of large scale consumer facing software offerings, specializing in Web and Mobile application architecting and development.
 
Specialties:
Web App/ iOS / Cocoa Touch / HTML5 / CSS3 / Ajax / jQuery / jQuery Mobile / jQuery UI / Node.js / Rich JavaScript Application / RESTful Web Services / Java EE 6 / Java 7 / PHP / Ruby on Rails / and Windows / .NET / RIA / Flex / Flash / Silverlight / Software Architecting / Front End Design and Development

Comments and Discussions

 
QuestionWhat to expect when I request a PDF parsing? PinmemberMember 108044998-May-14 5:23 
GeneralMy vote of 5 PinmemberPrasad Khandekar6-Apr-13 10:50 
GeneralRe: My vote of 5 PinmemberModesty Zhang6-Apr-13 12:09 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140916.1 | Last Updated 6 Apr 2013
Article Copyright 2013 by Modesty Zhang
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid