Click here to Skip to main content
15,885,537 members
Articles / Programming Languages / C# 3.5
Tip/Trick

Extract Bookmark from PDF file

Rate me:
Please Sign up or sign in to vote.
5.00/5 (1 vote)
1 Nov 2012CPOL2 min read 34.9K   4   5
Bookmark extraction from PDF files and showing it in a treeview.

Introduction

This articles show how to extract bookmark from PDF files and show it as a tree style.

Background

Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it. Bookmark in PDF files kept as object.

Using the code

This application uses iTextSharp DLL to extract raw bookmark from PDF files. You need to add reference iTextSharp DLL to your application. 

iTextSharp gives raw bookmark or bookmark in xml format. First you need this raw bookmark to process.

This can be done easily with this code.

C#
// PdfReader reader = new iTextSharp.text.pdf.PdfReader(reader_name);
IList<Dictionary<string, object>> book_mark = SimpleBookmark.GetBookmark(reader);

This book_mark variable stores all the bookmark as dictionary style. Each bookmark has it's own property such as colour, page number, have child or not etc.

Here I will discuss how to extract each bookmark and it's corresponding page number only.

If bookmark has child then it's keyvalue will be child. 

Bookmark's name is saved as title in keyvalue pair of book_mark variable.

Bookmark's page number is saved as page in keyvalue pair of book_mark variable.

So the code will be

C#
// foreach (Dictionary<string, object> bk in book_mark)
{

    foreach (KeyValuePair<string, object> kvr in bk)
    {
        if (kvr.Key == "Kids" || kvr.Key == "kids")
        {
           // need to perform recursive search
        }

        else if (kvr.Key == "Title" || kvr.Key == "title")
        {
           string name= new System.Windows.Forms.TreeNode(kvr.Value.ToString());
           
        }
        else if (kvr.Key == "Page" || kvr.Key == "page")
        {
            //saves page number
            int page number = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
            
        }
    }
}

Recursive search also will be the same.

Now the bookmark name and corresponding page number have to show as tree style. I use .net's TreeView to perform this.

When I found each bookmark's name and page number i added it to the treeview and when i found each child i perform a recursive search. 

To perform the searching and adding to treeview is a slow process and the user interface may be frozen.

So I used BackgroundWorker to do the work.

The whole code to do this is given below......

C#
public void recursive_search(IList<Dictionary<string, object>> ilist, TreeNode tnt)
{
    foreach (Dictionary<string, object> bk in ilist)
    {
        foreach (KeyValuePair<string, object> kvr in bk)
        {
            if (kvr.Key == "Kids" || kvr.Key == "kids")
            {
                IList<Dictionary<string, object>> child = 
                            (IList<Dictionary<string, object>>)kvr.Value;
                recursive_search(child, tn);
            }
            else if (kvr.Key == "Title" || kvr.Key == "title")
            {
                tn = new System.Windows.Forms.TreeNode(kvr.Value.ToString());
                                     
            }
            else if (kvr.Key == "Page" || kvr.Key == "page")
            {
                tn.ToolTipText = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
                tnt.Nodes.Add(tn);
               
            }
        }
    }
}


void bgw_DoWork(string reader_name)
{
    reader = new iTextSharp.text.pdf.PdfReader(reader_name);
   
    IList<Dictionary<string, object>> book_mark = SimpleBookmark.GetBookmark(reader);

    foreach (Dictionary<string, object> bk in book_mark)
    {

        foreach (KeyValuePair<string, object> kvr in bk)
        {
            if (kvr.Key == "Kids" || kvr.Key == "kids")
            {
                IList<Dictionary<string, object>> child = 
                        (IList<Dictionary<string, object>>)kvr.Value;
                treeView1.Invoke((MethodInvoker)(() => recursive_search(child, tn)));
            }

            else if (kvr.Key == "Title" || kvr.Key == "title")
            {
                tn = new System.Windows.Forms.TreeNode(kvr.Value.ToString());
               

            }
            else if (kvr.Key == "Page" || kvr.Key == "page")
            {
                //saves page number
                tn.ToolTipText = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
                treeView1.Invoke((MethodInvoker)(() => treeView1.Nodes.Add(tn)));
            }
        }
    }
}

Points of Interest

PDF file format is really interesting. It keeps all the data as object. Extracting data from PDF is easy but you have to know the file format very well.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer Samsung R&D Institute Bangladesh
Bangladesh Bangladesh
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionHow to import the bookmark to pdf from another pdf? Pin
nachia12-Jul-16 2:29
nachia12-Jul-16 2:29 
QuestionNo proper Solution Pin
Member 1057840128-Oct-14 12:21
Member 1057840128-Oct-14 12:21 
AnswerRe: No proper Solution Pin
Member 1301964627-Jul-17 1:31
Member 1301964627-Jul-17 1:31 
QuestionHi Kamruzzaman How extract Internal link in pdf file Pin
Ashwani Gusain14-Feb-14 19:05
Ashwani Gusain14-Feb-14 19:05 
QuestionIs there asp.net(VB) version Pin
joeyan22-Jan-13 21:04
joeyan22-Jan-13 21:04 
QuestionText comparison Pin
Ivan Ičin9-Nov-12 14:36
Ivan Ičin9-Nov-12 14:36 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.