Extract Bookmark from PDF file

Md Kamruzzaman Sarker

Rate me:

5.00/5 (1 vote)

1 Nov 2012CPOL2 min read

34.9K

Bookmark extraction from PDF files and showing it in a treeview.

Introduction

This articles show how to extract bookmark from PDF files and show it as a tree style.

Background

Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it. Bookmark in PDF files kept as object.

Using the code

This application uses iTextSharp DLL to extract raw bookmark from PDF files. You need to add reference iTextSharp DLL to your application.

iTextSharp gives raw bookmark or bookmark in xml format. First you need this raw bookmark to process.

This can be done easily with this code.

// PdfReader reader = new iTextSharp.text.pdf.PdfReader(reader_name);
IList<Dictionary<string, object>> book_mark = SimpleBookmark.GetBookmark(reader);

This book_mark variable stores all the bookmark as dictionary style. Each bookmark has it's own property such as colour, page number, have child or not etc.

Here I will discuss how to extract each bookmark and it's corresponding page number only.

If bookmark has child then it's keyvalue will be child.

Bookmark's name is saved as title in keyvalue pair of book_mark variable.

Bookmark's page number is saved as page in keyvalue pair of book_mark variable.

So the code will be

// foreach (Dictionary<string, object> bk in book_mark)
{

    foreach (KeyValuePair<string, object> kvr in bk)
    {
        if (kvr.Key == "Kids" || kvr.Key == "kids")
        {
           // need to perform recursive search
        }

        else if (kvr.Key == "Title" || kvr.Key == "title")
        {
           string name= new System.Windows.Forms.TreeNode(kvr.Value.ToString());
           
        }
        else if (kvr.Key == "Page" || kvr.Key == "page")
        {
            //saves page number
            int page number = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
            
        }
    }
}

Recursive search also will be the same.

Now the bookmark name and corresponding page number have to show as tree style. I use .net's TreeView to perform this.

When I found each bookmark's name and page number i added it to the treeview and when i found each child i perform a recursive search.

To perform the searching and adding to treeview is a slow process and the user interface may be frozen.

So I used BackgroundWorker to do the work.

The whole code to do this is given below......

public void recursive_search(IList<Dictionary<string, object>> ilist, TreeNode tnt)
{
    foreach (Dictionary<string, object> bk in ilist)
    {
        foreach (KeyValuePair<string, object> kvr in bk)
        {
            if (kvr.Key == "Kids" || kvr.Key == "kids")
            {
                IList<Dictionary<string, object>> child = 
                            (IList<Dictionary<string, object>>)kvr.Value;
                recursive_search(child, tn);
            }
            else if (kvr.Key == "Title" || kvr.Key == "title")
            {
                tn = new System.Windows.Forms.TreeNode(kvr.Value.ToString());
                                     
            }
            else if (kvr.Key == "Page" || kvr.Key == "page")
            {
                tn.ToolTipText = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
                tnt.Nodes.Add(tn);
               
            }
        }
    }
}


void bgw_DoWork(string reader_name)
{
    reader = new iTextSharp.text.pdf.PdfReader(reader_name);
   
    IList<Dictionary<string, object>> book_mark = SimpleBookmark.GetBookmark(reader);

    foreach (Dictionary<string, object> bk in book_mark)
    {

        foreach (KeyValuePair<string, object> kvr in bk)
        {
            if (kvr.Key == "Kids" || kvr.Key == "kids")
            {
                IList<Dictionary<string, object>> child = 
                        (IList<Dictionary<string, object>>)kvr.Value;
                treeView1.Invoke((MethodInvoker)(() => recursive_search(child, tn)));
            }

            else if (kvr.Key == "Title" || kvr.Key == "title")
            {
                tn = new System.Windows.Forms.TreeNode(kvr.Value.ToString());
               

            }
            else if (kvr.Key == "Page" || kvr.Key == "page")
            {
                //saves page number
                tn.ToolTipText = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
                treeView1.Invoke((MethodInvoker)(() => treeView1.Nodes.Add(tn)));
            }
        }
    }
}

Points of Interest

PDF file format is really interesting. It keeps all the data as object. Extracting data from PDF is easy but you have to know the file format very well.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Md Kamruzzaman Sarker

Software Developer Samsung R&D Institute Bangladesh

Bangladesh

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.