Click here to Skip to main content
15,896,473 members
Articles / Web Development / HTML

Converting PDF to Text in C#

Rate me:
Please Sign up or sign in to vote.
4.80/5 (144 votes)
19 Apr 2015CPOL3 min read 1.9M   31.8K   484  
Parsing PDF files in .NET using PDFBox and IKVM.NET (managed code).
using System;
using System.IO;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;

namespace Pdf2Text
{
	class Program
	{
		/// <summary>
		/// The main entry point for the application.
		/// </summary>
		[STAThread]
		static void Main(string[] args)
		{
			DateTime start = DateTime.Now;
			if (args.Length < 2)
			{
				Console.WriteLine("Usage: PDF2TEXT <input filename (PDF)> <output filename (text)>");
				return;
			}

			using (StreamWriter sw = new StreamWriter(args[1]))
			{
				sw.WriteLine(parseUsingPDFBox(args[0]));
			}

			Console.WriteLine("Done. Took " + (DateTime.Now - start));
//			Console.ReadLine();

		}

		private static string parseUsingPDFBox(string input)
		{
			PDDocument doc = PDDocument.load(input);
			PDFTextStripper stripper = new PDFTextStripper();
			return stripper.getText(doc);
		}
	}
}

By viewing downloads associated with this article you agree to the Terms of Service and the article's licence.

If a file you wish to view isn't highlighted, and is a text file (not binary), please let us know and we'll add colourisation support for it.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Czech Republic Czech Republic
My open-source event calendar/scheduling web UI components:

DayPilot for JavaScript, Angular, React and Vue

Comments and Discussions