The Great PDF - Revealed

shiraztk

Rate me:

3.51/5 (17 votes)

20 Aug 20046 min read

213.2K

4.6K

A project to create PDF files using text editor and digitally sign them.

Introduction

I have always wondered about the Adobe Reader and the PDF files. Have you ever tried to open a PDF file in a text editor? It’s amazing! In this project I am trying to bring the hidden things behind the PDF files to light. This simple application lets you create PDF files, just as you create txt files from a Notepad (hence the name pdfpad). Type your text in the editor and save it as a PDF file. Of course, you need Acrobat Reader to view the created PDF file. You cannot open an existing PDF file in this editor. You can only create, and once created its done. The greatest feature of this project is the digital signature. It teaches you the very basics of adding an invisible digital signature to the files created using pdfpad. It automatically adds an invisible digital signature when you create PDF files in pdfpad.exe.

Due to lack of time, many of the details couldn't be included. Please bear with us.

Background

Firstly, one should know the basics of the PDF format. I recommend you to download a copy of the PDF reference manual from PDF Reference and go through it (oops it's 1000 pages!).

Download the application demo, enter some text, and save the file as PDF. Now open the file in notepad and read on…

If one says C++ is object oriented, I would say PDF is more object oriented. In a PDF everything is treated as an object and every object has its own property and will refer other objects. This makes large PDF files (A 1000 page book just downloaded) to be navigated randomly and quickly.

A PDF file is read from the last. There is a token called the startxref, this is were everything begins. A viewer application reads this entry to get the offset of a table called xref. The table lists the objects used in the file and also their byte offsets within the PDF file. The format of the entries greatly matter here. Each entry should be 20 bytes long including the carriage return and the line feed.

Every object is numbered sequentially starting from 0 to n. ( though not necessary). If you notice the xref entry you will find a ‘0’ and a number n. This means that the table contains n objects starting from 0. Just take a look at them… 0000000074 this is the byte offset, 00000 is the generation number, n ..means it's in use. Only the first entry has the generation number that is not zero and it's marked f. Read the reference manual for more details.

A PDF document can be regarded as a hierarchy of objects contained in the body section of a PDF file. At the root of the hierarchy are the document's catalog dictionary. Most of the objects in the hierarchy are dictionaries. Each page of the document is represented by a page object, which is a dictionary that includes references to the page contents and other attributes such as its thumbnail image and any annotations associated with it. The individual page objects are tied together in a structure called the page tree, which in turn is located via an indirect reference in the document catalog.

The root of a document object hierarchy is the catalog dictionary, located via the Root entry in the trailer of the PDF file. The catalog contains references to other objects that define the document's contents, outline article threads, named destinations and other attributes.

Now to start with, the reader reads the value of the root entry in the trailer. This is the root. This is the root of all the references that are to be made. Now the reader reads the byte offset of the root object and moves to the root. This is a catalog dictionary. This again contains many other references. In our application only minimum entries are made so that it is easy to understand.

Now let’s see what happens to the text that we enter in the edit box. Firstly, all the occurrences of the end of line are replaced with the PDF operators for line feed. Then all the operators for showing the text on the page is added in the contents dictionary. This content is added as a stream, which is called a content stream. For compressing the text I have used zlib, courtesy zlib, this is a freely downloadable library. Flat compression algorithm is used to compress the text. This algorithm is supported by the Adobe viewer.

The most amazing thing is about the digital signature. I haven't employed a real life digital signature using cryptographic libraries. All I intend to show is, how to add a digital signature to the PDF document. The entries here are all dummy entries. This signature can be made a real digital signature if you can change the contents entry in the Signature dictionary with the real signed hash of the document.

I won’t be covering the details of the digital signature here. I will stick to the details of the PDF. PDF has two types of digital signatures, invisible and visible. Our application uses invisible signatures. The signature can be viewed in the signature panel. The entries in the signature dictionary can be changed to put your name, time of signing, location etc., programmatically using the user's inputs. This is left to you.

When a digital signature is added to a document, the Adobe acrobats signature handler calculates a checksum that is based on the content of the document at that time and it embeds the checksum in the signature. When the signature is validated, the handler recalculates the checksum for that signed version of the document and compares it with the value in the signature. If the signed version has changed in any way the signature handler detects the change and marks the signature as invalid.

You can also use Crypto API to create the hash, Sign using Digital Certificates etc., which I hope to cover in my next article. While creating the hash, the byte range must be specified correctly. Byte range is an array of two integers, Starting offset and number of bytes. Byte range array is used to exclude the contents entry in the signature dictionary. This entry will be filled with a temp entry initially to get the total file size for calculating the hash. After creating the hash the contents entries are be made. This explains why byte range is specified so as to exclude the contents entry from creating the hash. Otherwise while verifying the signature it may get invalidated.

Once you get a grip of the reference manual you can modify the code below, to add more pages, add drawing to the below etc.

Using the code

The main function that creates the PDF files is added to the ***doc class, it's called CreatedPdf ( CString text). I enjoy manipulating the CString object rather than using the char buffers. You can modify accordingly to make it more efficient. The code is well commented to explain the details.

This is a part of the Doc class, that should be modified to write the files in PDF format:

/*You should derive the view class from CEditView and modify the 
serialize function as below.*/
void CPdfPadDoc::Serialize(CArchive& ar)
{
        if(ar.IsStoring())
        {
           CString strFull;
           CEdit &edit =((CEditView*)m_viewList.GetHead())->GetEditCtrl();
           edit.GetWindowText(strFull);
           ar.WriteString(CreatePdf(strFull));
           ar.Flush();
        }
}

The main function written in doc class is the CreatePdf(). This actually takes the text and returns the formatted PDF to be written to the file:

CString CPdfPadDoc::CreatePdf(CString text)
{
  /*To replace all the the endof line with appropriate 
  pdf graphics commands.*/
  text.Replace("\r\n",")Tj T*(");
  /*accordingly to the inputs to the editor*/
  /*it will hold the byte offsets of the various 
  objects within the pdf file*/
  int objArray[10];
  int fontSize=1;
  /*start of the first line. This is the left margin*/
  int hPos=50;
  /*The top margin*/
  int vPos=750;
  CString fileBuff;
  /*This is version 1.5 of the pdf reference manual*/
  /*important for correct build*/
  CString header="%PDF-1.5\r%\xC3\xBE\r\n";
  fileBuff=header;
  /*It is called catalog dictionary*/
  objArray[0]=fileBuff.GetLength();
          CString catalog="1 0 obj<</Pages "
            "2 0 R/Type /Catalog/AcroForm 6 0 R>>\nendobj\r";
  fileBuff+=catalog;
  /*This example only contains one page. So kids contains 
  only one reference*/
  objArray[1]=fileBuff.GetLength();
          CString pageTree="2 0 obj<</Count 1/"
                "Kids [3 0 R]/Type /Pages>>\nendobj\r";
  fileBuff+=pageTree;
  /*used the page. */
  objArray[2]=fileBuff.GetLength();
          CString page="3 0 obj<</Annots[7 0 R]/"
               "Contents [5 0 R]/Type /Page/Parent 2 0 R/Rotate 0/"
               "MediaBox[0 0 612 792]/CropBox[0 0 612 792]/"
               "Resources<</Font<</T1_0 4 0 R>>/"
               "ProcSet[/PDF/Text]>>>>\nendobj\n";
  fileBuff+=page;
  objArray[3]=fileBuff.GetLength();
          CString font="4 0 obj<</Type/Font/BaseFont/"
                    "Times-Roman/Subtype/Type1>>\nendobj\n";
  fileBuff+=font;
  /*Every text is associated with a text matrix. 
  Details of this operators can be 
  found in pdf reference manual*/
  objArray[4]=fileBuff.GetLength();
        CString stream;
        stream.Format("%s%d%s%d%s%d%s%s%s","0 g\r1 i \rBT\r/T1_0 ",
          fontSize," Tf\r0 Tc 0 Tw 0  Ts 100  Tz 0 Tr 1.2 TL 12 0 0 12 ",
          hPos," ",vPos," Tm \rT* (",text,")Tj \rET");
  /*Compress the stream using flate compression*/
          CString compressedStream=FlateCompress(stream);/*1*/
          int len=compressedStream.GetLength();
  /*if you dont what the stream to be compressed. Replace 
  the line number 1 and remove the entry from
  string contents /Filter /FlateDecode
  CString compressedStream=stream*/
          CString contents;
          contents.Format("%s%d%s%s%s",
               "5 0 obj<</Filter /FlateDecode/Length ",len,
               ">>stream\r\n",compressedStream,
               "\r\nendstream\rendobj\n");
  fileBuff+=contents;
  objArray[5]=fileBuff.GetLength();
          CString acroForm;
          acroForm="6 0 obj<</Fields[7 0 R]/SigFlags 3/"
                       "DA(/Helv 0 Tf 0 g )>>\nendobj\n";
          fileBuff+=acroForm;
  objArray[6]=fileBuff.GetLength();
          CString annotation;
          annotation="7 0 obj<</Type /Annots/Subtype /"
                    "Widget/FT /Sig/Rect[0 0 0 0]/P 3 0 R/"
                    "T(signature)/V 8 0 R/MK<<>>>>\"
                    "nendobj\n";
          fileBuff+=annotation;
  objArray[7]=fileBuff.GetLength();
          CString sign;
          sign="8 0 obj<</Type /Sig/Filter/ICM.SignDoc/Contents";
          int byteRange[2];
          fileBuff+=sign;
          byteRange[0]=fileBuff.GetLength();
  //This is just a dummy signature. Actually it should be 
  //taken after using cryptographic library.
  //In the next version...
          CString signature="<AE423B23FE56>";
          fileBuff+=signature;
          byteRange[1]=fileBuff.GetLength();
          //We dont know the actual byte range. ie the end of file. 
          //Therefore it is a dummy entry now.
          //We will replace it after we get the length of file.
          sign.Format("%s%d%s%d%s","/ByteRange [0 ",byteRange[0],
             " ",byteRange[1],"XXX]/Name(Shiraz)/"
             "M(D:20040524100433+05'30')/"
             "Location(Cordiant)/Reason(ICM Library)"
             "/Date(Nov  3 200314:27:40)>>\nendobj\n");
          fileBuff+=sign;
  /*This table will contain the objects used in the 
  file and there byte offsets*/
  objArray[8]=fileBuff.GetLength();
          CString xref;
          /*Please look in article*/
          xref.Format("%s%d%s","xref\r\n0 ",9,
                         "\r\n0000000000 65535 f\r\n");
  int numObj=8;
          CString offsets;
          for(int i=0;i<numObj;i++)
          {
                 //This field should be 20 bytes long.
                 offsets.Format("%0.10d",objArray[i]);
                 xref+=offsets+" 00000 n\r\n";
          }
  fileBuff+=xref;
          CString trailer;
          trailer.Format("%s%d%s","trailer\n<</Size 9/"
              "Root 1 0 R/ID[<5181383ede94727bcb32ac27ded71c68"
              "><5181383ede94727bcb32ac27ded71c68>]>>\"
              "r\nstartxref\r\n",objArray[8],"\r\n%%EOF\r\n");
  fileBuff+=trailer;
  /*We have finished with the pdf file.One thing 
  left is the actual byte range.*/
  CString byteRangeEnd;
  byteRangeEnd.Format("%d",fileBuff.GetLength()-byteRange[1]);
  fileBuff.Replace("XXX",byteRangeEnd);
  /*retrun the final string*/
  return fileBuff;
}

This method is used to compress the content stream. The usage of the DLL can be found in zlib.dll.

CString CPdfPadDoc::FlateCompress(CString inputStream)
{
  CMemFile *pInput=new CMemFile();
  CMemFile *pOutput=new CMemFile();
  z_stream zstream;
  memset(&zstream,0,sizeof(z_stream));
  DWORD inputLength=inputStream.GetLength();
  char *inBuffer=new char[inputLength];
  inBuffer=inputStream.GetBuffer(inputStream.GetLength());
  inputStream.ReleaseBuffer();
  pInput->Write(inBuffer,inputLength);
  pInput->SeekToBegin();
  
  BYTE zBufIn[20000];
  BYTE zBufOut[4000];
  deflateInit(&zstream, Z_DEFAULT_COMPRESSION);
      int error = Z_OK;
      while ( TRUE )
      {
        UINT cbRead = 0;
         cbRead = pInput->Read(zBufIn, sizeof(zBufIn));
         if ( cbRead == 0 )
                 break;
         zstream.next_in = (Bytef*)zBufIn;
         zstream.avail_in = (uInt)cbRead;
         while ( TRUE )
         {
             zstream.next_out = (Bytef*)zBufOut;
             zstream.avail_out = sizeof(zBufOut);
             err = deflate(&zstream, Z_SYNC_FLUSH);
             if (err != Z_OK)
                     break;
             UINT cbWrite = sizeof(zBufOut) - zstream.avail_out;
             if ( cbWrite == 0 )
                     break;
            pOutput->Write(zBufOut, cbWrite);
             if ( zstream.avail_out != 0 )
                 break;
         }
      }
  error = deflateEnd(&zstream);//Check the error if required
  DWORD szOutBuff=pOutput->GetLength();
  char *outBuffer=(char*)malloc(szOutBuff);
  pOutput->SeekToBegin();
  pOutput->Read(outBuffer,szOutBuff);
  CString outStream(outBuffer);

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Written By

shiraztk

Engineer Hydenso Steel & Engineering Pvt Ltd

India

Settled in the beautiful city of Kochi in India.
Started working in VC++, dotnet in the hometown from 2003 after searching hard for some position in research in robotics and Aritificial Interlligence.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

The Great PDF - Revealed

Introduction

Background

Using the code

License

Comments and Discussions