Introduction
Compared with Microsoft Word and Excel, PDF is special. As the most secure file format, PDF enables to attach different
files. Sometimes we need extract and save attachments, sometimes we might prefer to get rid of them for some reasons such as security, convenience of reusing
etc. For programmers, this article shares a solution to extract PDF attachments and directly remove PDF attachments easily by using C# via a third party PDF
Platinum Pack which contains two professional components: PDF and PDFViewer on .NET, Silverlight and WPF.
Article Structure
- Extract PDF attachments
- Remove PDF attachments
- Extract PDF text
- Extract PDF images
Content
Extract PDF attachments
In my solution, both PDF attachments and attachments with annotations will be extracted via a .NET PDF Viewer component.
When extracting the attachment which contains an annotation, we are taken to see the accurate location of the specific attachment annotation
automatically in PDF text. Besides, the properties of PDF attachments are clearly displayed.
The whole procedure of PDF attachments extracting task requires three steps. First, we need to load a PDF file. Second, get attachments data from PDF just loaded. Third,
save the PDF attachments data to disk. Detail steps are shown as below.
Step 1. Load a PDF file
It is very easy to load a PDF file with C#. First, open a PDF dialog box and create class:
Spire.Pdf.PdfViewer.Forms.PdfDocumentViewer instance .Then, load a specific PDF file from system, just view code here:
OpenFileDialog dialog = new OpenFileDialog();
dialog.Filter = "PDF document(*.pdf)|*.pdf";
dialog.Title = "Open PDF document with attachment";
DialogResult result = dialog.ShowDialog();
if (result == DialogResult.OK)
{
string pdfFile = dialog.FileName;
PdfDocumentViewer pdfViewer = new PdfDocumentViewer();
pdfViewer.LoadFromFile(pdfFile);
}
Step 2. Extract PDF attachments
It is a little different from extracting PDF attachments and PDF attachments with annotations. Let’s see in detail about both one by one.
2.1 Extract PDF attachments without attachment annotation
Without any doubt, after importing a PDF file, it is time to get the PDF attachments. Here, we use the method PdfDocumentViewer.GetAttachments() to get the attachments. Also we can gain
the name and byte array of attachment files via PdfDocumentAttachment.FileName and PdfDocumentAttachment.Data Property.
if (pdfViewer.IsDocumentLoaded)
{
ListView lstView = new ListView();
lstView.View = View.Details;
lstView.Columns.Add("FileName", 200);
lstView.Columns.Add("Size", 120);
PdfDocumentAttachment[] attchments = pdfViewer.GetAttachments();
if (attchments != null && attchments.Length > 0)
{
for (int i = 0; i < attchments.Length; i++)
{
PdfDocumentAttachment attachment = attchments[i];
string fileName = attachment.FileName;
byte[] data = attachment.Data;
ListViewItem item = new ListViewItem();
item.Text = Path.GetFileName(fileName);
string size = (attachment.Data.Length / 1024).ToString() + "Kb";
item.SubItems.Add(size);
item.Tag = attachment;
lstView.Items.Add(item);
}
}
2.2 Extract PDF attachments with attachment annotation
For PDF attachments with attachments annotations, we can call PdfDocumentViewer.GetAttachmentAnnotaions() method to get attachment. Attachment file name and byte array can be displayed
through PdfDocumentAnnotation.FileName and PdfDocumentAnnotation.Data Property.
if (annotations != null && annotations.Length > 0)
{
for (int i = 0; i < annotations.Length; i++)
{
PdfDocumentAttachmentAnnotation annotation = annotations[i];
ListViewItem item = new ListViewItem(annotation.FileName);
string size = (annotation.Data.Length / 1024).ToString() + "Kb";
item.SubItems.Add(size);
item.Tag = annotation;
lstView.Items.Add(item);
}
}
lstView.Tag = pdfViewer;
if (lstView.Items.Count > 0)
{
lstView.DoubleClick+=new EventHandler(lstView_DoubleClick);
}
TableLayoutPanel panel=new TableLayoutPanel ();
this.Controls.Clear();
this.Controls.Add(panel);
panel.Dock = DockStyle.Fill;
panel.ColumnCount=1;
panel.RowCount=2;
panel.RowStyles.Add(new System.Windows.Forms.RowStyle(System.Windows.Forms.SizeType.Percent, 80F));
panel.RowStyles.Add(new System.Windows.Forms.RowStyle(System.Windows.Forms.SizeType.Percent, 20F));
panel.Controls.Add(pdfViewer,0,0);
pdfViewer.Dock = DockStyle.Fill;
panel.Controls.Add(lstView,0,1);
lstView.Dock = DockStyle.Fill;
Step 3. Save PDF attachments data to disk
In this step, double click the ListViewItem, we save the attachment data to disk.
private void lstView_DoubleClick(Object sender, EventArgs e)
{
ListView lstView=sender as ListView;
ListViewItem item = lstView.SelectedItems[0];
SaveFileDialog dialog = new SaveFileDialog();
DialogResult result = dialog.ShowDialog();
PdfDocumentViewer viewer = lstView.Tag as PdfDocumentViewer;
if (result == DialogResult.OK&&viewer!=null)
{
string fileName = dialog.FileName;
FileStream stream = new FileStream(fileName, FileMode.Create);
BinaryWriter writer = new BinaryWriter(stream);
if (item.Tag as PdfDocumentAttachmentAnnotation !=null )
{
PdfDocumentAttachmentAnnotation annotation = (PdfDocumentAttachmentAnnotation)item.Tag;
viewer.GotoAttachmentAnnotation(annotation);
byte[] data = annotation.Data;
writer.Write(data);
}
else if (item.Tag as PdfDocumentAttachment!=null)
{
PdfDocumentAttachment annotation = (PdfDocumentAttachment)item.Tag;
byte[] data = annotation.Data;
writer.Write(data);
}
writer.Close();
stream.Close();
System.Diagnostics.Process.Start(fileName);
}
}
Remove PDF Attachments
PDF attachments can be easily removed by using a method PdfDocument.Attchments.Clear() in a .NET PDF component
in the PDF Platinum Pack.
private void button1_Click(object sender, EventArgs e)
{
OpenFileDialog dialog = new OpenFileDialog();
dialog.Filter = "PDF document(*.pdf)|*.pdf";
DialogResult result = dialog.ShowDialog();
if (result == DialogResult.OK)
{
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(dialog.FileName);
PdfAttachmentCollection attachments = doc.Attachments;
if (attachments.Count > 0)
{
doc.Attachments.Clear();
}
doc.SaveToFile("test.pdf");
System.Diagnostics.Process.Start("test.pdf");
}
}
Extract PDF text
In this example, we will extract text from PDF. Include both plain text and uncommonly used text such as embedded text and text of Hebrew & Latin Language. Now,
let us view the code in detail below:
string pdfFile = @"D:\PDF\Hebrew.pdf";
PdfDocumentViewer pdfViewer = new PdfDocumentViewer();
pdfViewer.LoadFromFile(pdfFile);
StringBuilder buffer=new StringBuilder();
for (int i = 0; i < pdfViewer.PageCount; i++)
{
string pageText = pdfViewer.ExtractText(i);
if (!string.IsNullOrEmpty(pageText))
{
buffer.AppendLine(pageText);
buffer.AppendLine("Page "+(i+1).ToString()+
" of "+totalPage.ToString());
}
}
Extract PDF Image
With the following code, we can extract PDF images and save them as formats of JPG, BMP, PNG, TIFF, GIF.
string pdfFile = @"D:\PDF\Hebrew.pdf";
PdfDocumentViewer pdfViewer = new PdfDocumentViewer();
pdfViewer.LoadFromFile(pdfFile);
List<Image>images = new List <Image>();
for (int i = 0; i < pdfViewer.PageCount; i++)
{
Image[] pageImages = pdfViewer.ExtractImages(i);
if (pageImages != null && pageImages.Length > 0)
{
images.AddRange(pageImages);
}
}
Conclusion
This article mainly focuses on how to extract and remove PDF attachments. Besides, PDF text
(plain and special) and images extract solutions are also introduced. The professional C# tool Spire.PDF Platinum Pack for commercial use helps me a lot.
You can give it a try here.
Friendly reminder: Only PDF attachments removing requires the PDF component, other extracting tasks all realized by the PDF Viewer component.