Click here to Skip to main content
Click here to Skip to main content
Go to top

Extract Bookmark from PDF file

, 1 Nov 2012
Rate this:
Please Sign up or sign in to vote.
Bookmark extraction from PDF files and showing it in a treeview.

Introduction

This articles show how to extract bookmark from PDF files and show it as a tree style.

Background

Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it. Bookmark in PDF files kept as object.

Using the code

This application uses iTextSharp DLL to extract raw bookmark from PDF files. You need to add reference iTextSharp DLL to your application. 

iTextSharp gives raw bookmark or bookmark in xml format. First you need this raw bookmark to process.

This can be done easily with this code.

// PdfReader reader = new iTextSharp.text.pdf.PdfReader(reader_name);
IList<Dictionary<string, object>> book_mark = SimpleBookmark.GetBookmark(reader);

This book_mark variable stores all the bookmark as dictionary style. Each bookmark has it's own property such as colour, page number, have child or not etc.

Here I will discuss how to extract each bookmark and it's corresponding page number only.

If bookmark has child then it's keyvalue will be child. 

Bookmark's name is saved as title in keyvalue pair of book_mark variable.

Bookmark's page number is saved as page in keyvalue pair of book_mark variable.

So the code will be

// foreach (Dictionary<string, object> bk in book_mark)
{

    foreach (KeyValuePair<string, object> kvr in bk)
    {
        if (kvr.Key == "Kids" || kvr.Key == "kids")
        {
           // need to perform recursive search
        }

        else if (kvr.Key == "Title" || kvr.Key == "title")
        {
           string name= new System.Windows.Forms.TreeNode(kvr.Value.ToString());
           
        }
        else if (kvr.Key == "Page" || kvr.Key == "page")
        {
            //saves page number
            int page number = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
            
        }
    }
}

Recursive search also will be the same.

Now the bookmark name and corresponding page number have to show as tree style. I use .net's TreeView to perform this.

When I found each bookmark's name and page number i added it to the treeview and when i found each child i perform a recursive search. 

To perform the searching and adding to treeview is a slow process and the user interface may be frozen.

So I used BackgroundWorker to do the work.

The whole code to do this is given below......

public void recursive_search(IList<Dictionary<string, object>> ilist, TreeNode tnt)
{
    foreach (Dictionary<string, object> bk in ilist)
    {
        foreach (KeyValuePair<string, object> kvr in bk)
        {
            if (kvr.Key == "Kids" || kvr.Key == "kids")
            {
                IList<Dictionary<string, object>> child = 
                            (IList<Dictionary<string, object>>)kvr.Value;
                recursive_search(child, tn);
            }
            else if (kvr.Key == "Title" || kvr.Key == "title")
            {
                tn = new System.Windows.Forms.TreeNode(kvr.Value.ToString());
                                     
            }
            else if (kvr.Key == "Page" || kvr.Key == "page")
            {
                tn.ToolTipText = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
                tnt.Nodes.Add(tn);
               
            }
        }
    }
}


void bgw_DoWork(string reader_name)
{
    reader = new iTextSharp.text.pdf.PdfReader(reader_name);
   
    IList<Dictionary<string, object>> book_mark = SimpleBookmark.GetBookmark(reader);

    foreach (Dictionary<string, object> bk in book_mark)
    {

        foreach (KeyValuePair<string, object> kvr in bk)
        {
            if (kvr.Key == "Kids" || kvr.Key == "kids")
            {
                IList<Dictionary<string, object>> child = 
                        (IList<Dictionary<string, object>>)kvr.Value;
                treeView1.Invoke((MethodInvoker)(() => recursive_search(child, tn)));
            }

            else if (kvr.Key == "Title" || kvr.Key == "title")
            {
                tn = new System.Windows.Forms.TreeNode(kvr.Value.ToString());
               

            }
            else if (kvr.Key == "Page" || kvr.Key == "page")
            {
                //saves page number
                tn.ToolTipText = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
                treeView1.Invoke((MethodInvoker)(() => treeView1.Nodes.Add(tn)));
            }
        }
    }
}

Points of Interest

PDF file format is really interesting. It keeps all the data as object. Extracting data from PDF is easy but you have to know the file format very well.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

No Biography provided

Comments and Discussions

 
QuestionHi Kamruzzaman How extract Internal link in pdf file PinmemberAshwani Gusain14-Feb-14 19:05 
QuestionIs there asp.net(VB) version Pinmemberjoeyan22-Jan-13 21:04 
QuestionText comparison Pinmemberivanicin9-Nov-12 14:36 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140905.1 | Last Updated 1 Nov 2012
Article Copyright 2012 by Md Kamruzzaman Sarker
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid