Click here to Skip to main content
Click here to Skip to main content

Extract and Remove PDF Attachments with C#

By , 28 Sep 2012
 

Editorial Note

This article appears in the Third Party Products and Tools section. Articles in this section are for the members only and must not be used to promote or advertise products in any way, shape or form. Please report any spam or advertising.

Introduction 

Compared with Microsoft Word and Excel, PDF is special. As the most secure file format, PDF enables to attach different files. Sometimes we need extract and save attachments, sometimes we might prefer to get rid of them for some reasons such as security, convenience of reusing etc. For programmers, this article shares a solution to extract PDF attachments and directly remove PDF attachments easily by using C# via a third party PDF Platinum Pack which contains two professional components: PDF and PDFViewer on .NET, Silverlight and WPF. 

Article Structure     

  1. Extract PDF attachments 
  2. Remove PDF attachments
  3. Extract PDF text
  4. Extract PDF images 

Content 

Extract PDF attachments

In my solution, both PDF attachments and attachments with annotations will be extracted via a  .NET PDF Viewer component. When extracting the attachment which contains an annotation, we are taken to see the accurate location of the specific attachment annotation automatically in PDF text. Besides, the properties of PDF attachments are clearly displayed.

The whole procedure of PDF attachments extracting task requires three steps. First, we need to load a PDF file. Second, get attachments data from PDF just loaded. Third, save the PDF attachments data to disk. Detail steps are shown as below.

Step 1. Load a PDF file 

It is very easy to load a PDF file with C#. First, open a PDF dialog box and create class: Spire.Pdf.PdfViewer.Forms.PdfDocumentViewer instance .Then, load a specific PDF file from system, just view code here:

OpenFileDialog dialog = new OpenFileDialog();
dialog.Filter = "PDF document(*.pdf)|*.pdf";
dialog.Title = "Open PDF document with attachment";
DialogResult result = dialog.ShowDialog();
if (result == DialogResult.OK)
{
    string pdfFile = dialog.FileName;
    PdfDocumentViewer pdfViewer = new PdfDocumentViewer();
    pdfViewer.LoadFromFile(pdfFile);
}

Step 2. Extract PDF attachments

It is a little different from extracting PDF attachments and PDF attachments with annotations. Let’s see in detail about both one by one.

2.1 Extract PDF attachments without attachment annotation

Without any doubt, after importing a PDF file, it is time to get the PDF attachments. Here, we use the method PdfDocumentViewer.GetAttachments() to get the attachments. Also we can gain the name and byte array of attachment files via PdfDocumentAttachment.FileName and PdfDocumentAttachment.Data Property. 

if (pdfViewer.IsDocumentLoaded)
{
    ListView lstView = new ListView();
    lstView.View = View.Details;
    lstView.Columns.Add("FileName", 200);
    lstView.Columns.Add("Size", 120);

    PdfDocumentAttachment[] attchments = pdfViewer.GetAttachments();
    if (attchments != null && attchments.Length > 0)
    {
        for (int i = 0; i < attchments.Length; i++)
        {
            PdfDocumentAttachment attachment = attchments[i];
            string fileName = attachment.FileName;
            byte[] data = attachment.Data;
            ListViewItem item = new ListViewItem();
            item.Text = Path.GetFileName(fileName);
            string size = (attachment.Data.Length / 1024).ToString() + "Kb";
            item.SubItems.Add(size);
            item.Tag = attachment;
            lstView.Items.Add(item);
        }
    }

2.2 Extract PDF attachments with attachment annotation

For PDF attachments with attachments annotations, we can call PdfDocumentViewer.GetAttachmentAnnotaions() method to get attachment. Attachment file name and byte array can be displayed through PdfDocumentAnnotation.FileName and PdfDocumentAnnotation.Data Property.

if (annotations != null && annotations.Length > 0)
{
    for (int i = 0; i < annotations.Length; i++)
    {
        PdfDocumentAttachmentAnnotation annotation = annotations[i];
        ListViewItem item = new ListViewItem(annotation.FileName);
        string size = (annotation.Data.Length / 1024).ToString() + "Kb";
        item.SubItems.Add(size);
        item.Tag = annotation;
        lstView.Items.Add(item);
    }
}
lstView.Tag = pdfViewer;
if (lstView.Items.Count > 0)
{
    lstView.DoubleClick+=new EventHandler(lstView_DoubleClick);
}
TableLayoutPanel panel=new TableLayoutPanel ();
this.Controls.Clear();
this.Controls.Add(panel);
panel.Dock = DockStyle.Fill;
panel.ColumnCount=1;
panel.RowCount=2;
panel.RowStyles.Add(new System.Windows.Forms.RowStyle(System.Windows.Forms.SizeType.Percent, 80F));
panel.RowStyles.Add(new System.Windows.Forms.RowStyle(System.Windows.Forms.SizeType.Percent, 20F));
panel.Controls.Add(pdfViewer,0,0);
pdfViewer.Dock = DockStyle.Fill;
panel.Controls.Add(lstView,0,1);
lstView.Dock = DockStyle.Fill;

Step 3. Save PDF attachments data to disk

In this step, double click the ListViewItem, we save the attachment data to disk.  

private void lstView_DoubleClick(Object sender, EventArgs e)
{
    ListView lstView=sender as ListView;
    ListViewItem item = lstView.SelectedItems[0];
    SaveFileDialog dialog = new SaveFileDialog();
    DialogResult result = dialog.ShowDialog();
    PdfDocumentViewer viewer = lstView.Tag as PdfDocumentViewer;
    if (result == DialogResult.OK&&viewer!=null)
    {
        string fileName = dialog.FileName;
        FileStream stream = new FileStream(fileName, FileMode.Create);
        BinaryWriter writer = new BinaryWriter(stream);

        if (item.Tag as PdfDocumentAttachmentAnnotation !=null  )
          
        {
            PdfDocumentAttachmentAnnotation annotation = (PdfDocumentAttachmentAnnotation)item.Tag;
            viewer.GotoAttachmentAnnotation(annotation);
            byte[] data = annotation.Data;
            writer.Write(data);
        }
        else if (item.Tag as PdfDocumentAttachment!=null)
        {
            PdfDocumentAttachment annotation = (PdfDocumentAttachment)item.Tag;
            byte[] data = annotation.Data;
            writer.Write(data);
        }
        writer.Close();
        stream.Close();
        System.Diagnostics.Process.Start(fileName);      
    }
}

Remove PDF Attachments

PDF attachments can be easily removed  by using a method PdfDocument.Attchments.Clear() in a .NET PDF component in the PDF Platinum Pack.

private void button1_Click(object sender, EventArgs e)
{
    OpenFileDialog dialog = new OpenFileDialog();
    dialog.Filter = "PDF document(*.pdf)|*.pdf";
    DialogResult result = dialog.ShowDialog();
    if (result == DialogResult.OK)
    {
        PdfDocument doc = new PdfDocument();
        doc.LoadFromFile(dialog.FileName);
        PdfAttachmentCollection attachments = doc.Attachments;
        if (attachments.Count > 0)
        {
            doc.Attachments.Clear();
        }
        doc.SaveToFile("test.pdf");
        System.Diagnostics.Process.Start("test.pdf");                    
     }
}

Extract PDF text

In this example, we will extract text from PDF. Include both plain text and uncommonly used text such as embedded text and text of Hebrew & Latin Language. Now, let us view the code in detail below: 

string pdfFile = @"D:\PDF\Hebrew.pdf";
PdfDocumentViewer pdfViewer = new PdfDocumentViewer();
pdfViewer.LoadFromFile(pdfFile);
StringBuilder buffer=new StringBuilder();
for (int i = 0; i < pdfViewer.PageCount; i++)
{
    string pageText = pdfViewer.ExtractText(i);
    if (!string.IsNullOrEmpty(pageText))
    {
        buffer.AppendLine(pageText);
        buffer.AppendLine("Page "+(i+1).ToString()+
                          " of "+totalPage.ToString());
    }
}

Extract PDF Image

With the following code, we can extract PDF images and save them as formats of JPG, BMP, PNG, TIFF, GIF.

string pdfFile = @"D:\PDF\Hebrew.pdf";
PdfDocumentViewer pdfViewer = new PdfDocumentViewer();
pdfViewer.LoadFromFile(pdfFile);
List<Image>images = new List <Image>();
for (int i = 0; i < pdfViewer.PageCount; i++)
{
    Image[] pageImages = pdfViewer.ExtractImages(i);
    if (pageImages != null && pageImages.Length > 0)
    {
        images.AddRange(pageImages);
    }
}

Conclusion

This article mainly focuses on how to extract and remove PDF attachments. Besides,  PDF text (plain and special) and images extract solutions are also introduced. The professional C# tool Spire.PDF Platinum Pack for commercial use  helps me a lot. You can give it a try here.

Friendly reminder: Only PDF attachments removing requires the PDF component, other extracting tasks all realized by the PDF Viewer component.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Lacy00
United States United States
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
Questionit packs all the PDF extactionmemberpandaloveu1 Oct '12 - 22:48 
thanks for sharing, all the PDF extract function are packed in it.
GeneralGoodmemberSam.S12927 Sep '12 - 20:03 
Good conclusion on attachment in PDF
GeneralRe: GoodmemberLacy0027 Sep '12 - 23:43 
Thanks,hope it helps.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130523.1 | Last Updated 28 Sep 2012
Article Copyright 2012 by Lacy00
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid