Click here to Skip to main content
16,017,373 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
I need to convert PDF to Excel using SSIS package. I tried using script task, but I didn't get the proper output. Can you help me to solve this issue?

What I have tried:

C#
using System;
using System.IO;
using System.Text.RegularExpressions;
using Microsoft.SqlServer.Dts.Runtime;

namespace ST_eec8239112bd4e70893cef9eeae84e23
{    [Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {
        public void Main()
        {
            try
            {
                string pdfFilePath = 
                Dts.Variables["User::inputpath"].Value.ToString();
                string csvFilePath = 
                Dts.Variables["User::outputpath"].Value.ToString(); 

                using (StreamReader reader = new StreamReader(pdfFilePath))
                {
                    string pdfContent = reader.ReadToEnd();
                    string extractedText = ExtractTextFromPdf(pdfContent);

                    using (StreamWriter writer = new 
                           StreamWriter(csvFilePath))
                    {
                        writer.WriteLine(extractedText);
                    }
                } 

                Dts.TaskResult = (int)ScriptResults.Success;
            }
            catch (Exception ex)
            {
                Dts.Events.FireError(0, "PDF to CSV Error", 
                                     ex.Message, string.Empty, 0);
                Dts.TaskResult = (int)ScriptResults.Failure;
            }
        } 

        private string ExtractTextFromPdf(string pdfContent)
        {
            // Replace line breaks and other common formatting
            string cleanText = pdfContent
                .Replace("\r\n", " ")
                .Replace("\n", " ")
                .Replace("\r", " ")
                .Replace("\t", " ")
                .Replace(",", " "); 

            // Remove multiple spaces
            cleanText = Regex.Replace(cleanText, @"\s+", " "); 

            return cleanText;
        } 

        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        }
    }
}
Posted
Updated 13-Sep-23 4:45am
v3
Comments
Andre Oosthuizen 29-Aug-23 10:14am    
Quote: but i didn't get the proper output - does not help us at all. What output are you getting and what output do you expect?
PIEBALDconsult 13-Sep-23 11:19am    
Writing Excel is not an easy task. I know of no way to do it with SSIS itself. At best, you could write a C# procedure to do it, but then why involve SSIS at all?
Excel can read CSV, so frequently we write CSV instead.

You cannot extract the content of PDF files by using a Streamreader. The content of PDF files is not simple lines of text, but structured representaions of what you see when you read it. So what you will see from the above code is most likely garbage. You will need to use a PDF library such as iTextSharp | iText PDF[^].
 
Share this answer
 
To add to what Richard has said, once you have read the PDF, you will need a second package to write the content to a format that Excel understands. This may help: C#: Read and Write Excel (*.xls and *.xlsx) Files Content without Excel Automation (using NPOI and ADO.NET)[^]
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900