i want to read the content of file(pdf) correctly

Question

0.00/5 (No votes)

See more:

Actually i have develop one winform application that application reads the content

file(.txt) very well but using same code read the pdf files.it's working but content

like as "횶땐擇몎态㺛갿籕因뚜靐⨎ᴪ䣌塥並ࠧ町뫳俫黶뫜ﭪ଻혫᭍떼㌵ꇨ㯽☐녴샯﹯蛪髚☐㉾翐☐䜓☐幄뤄ꇥል貑꒥⣔☐⭸쨧렅½캽泜빳燗⁇圷춪⏖뚍鳀餡ꊾᴦ뗖詒Ꝅ퍃怮鐙좽聗逋麟☐ധш♉℩邝䥎ᒼ翏狲Ꮘ쮛旾睬譚칺馵ว퀑뒷ꞹ䰛涉죢㐆蓮捥ﳂ濼跛ᬹ䲷妞ఞ".
this content is not understanding.this result will be trace out using breaking points

that code like as

C#

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.IO;
using System.Collections;
using System.Windows.Forms;

namespace test
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }
      public static string StringFromBytes(byte[] arr)
        {
            char[] ch = new char[arr.Length / 2];
            for (int i = 0; i < ch.Length; ++i)
            {
                ch[i] = (char)((int)arr[i * 2] + (((int)arr[i * 2 + 1]) << 8));
            }
            return new String(ch);
        }

        private void button1_Click(object sender, EventArgs e)
        {
            ArrayList fileStatistics = new ArrayList();
            String datasetPath = @"D:\Data Sets\Enron";
            DirectoryInfo d = new DirectoryInfo(datasetPath);
            FileInfo[] files = d.GetFiles("*.pdf");
            MessageBox.Show(files.Length.ToString());

            foreach (FileInfo file in files)
            {                
                    //create instance of data class
                    fileAtt f = new fileAtt();

                    f.fFullName = file.FullName;
                    f.fName = file.Name;
                    f.FileSize = file.Length;
                    f.fExtension = file.Extension;
                    byte[] bytes = File.ReadAllBytes(file.FullName);
                    f.content    =Form1.StringFromBytes(bytes);
                   //f.content = Encoding.ASCII.GetString(bytes);
                   f.lastaccesstime = file.LastAccessTime;                
                    fileStatistics.Add(f);
                 //   StreamReader r = new StreamReader(datasetPath);
                 //foreach
                    
                
            }
            gvStatistics.DataSource = fileStatistics;

        }
        }
    }

fileatt is property class:

C#

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace test
{
    class fileAtt
    {
        public long FileSize { get; set; }
        public string fName { get; set; }
        public string fFullName { get; set; }
        public string fExtension { get; set; }

        public string content { get; set; }

        public DateTime lastaccesstime { get; set; }
    }
}

i want to read the content of pdf's correctly i.e content understand by user.this is

my requirements.i want solution according to the above code.

pls help me.

thank u

Posted 25-Jan-15 18:48pm

Krishna Veni

Updated 25-Jan-15 19:06pm

v2

Add a Solution

Comments

Leo Chapiro 26-Jan-15 0:56am

What is fileAtt here: fileAtt f = new fileAtt();

Krishna Veni 26-Jan-15 1:07am

fileatt is property class .that class added

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Wendelius · Answer 1 · 2015-01-25T19:06:00

Solution 1

PDF files are not pure text, instead they are binary files which contain a quite complex structure. So you cannot just read the content and expect to see the text inside a PDF document.

I think an easiest approach is to use a ready made library such as iTextSharp[^] to explore the content of the PDF and extract text from it.

Posted 25-Jan-15 19:06pm

Wendelius

Comments

BillWoodruff 27-Jan-15 3:32am

+5

Wendelius 5-Feb-15 0:30am

Thanks :)