Click here to Skip to main content
15,881,852 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Actually i have develop one winform application that application reads the content

file(.txt) very well but using same code read the pdf files.it's working but content

like as "횶땐擇몎态㺛갿籕因뚜靐⨎ᴪ䣌塥並ࠧ町뫳俫黶뫜ﭪ଻혫᭍떼㌵ꇨ㯽☐녴샯﹯蛪髚☐㉾翐☐䜓☐幄뤄ꇥል貑꒥⣔☐⭸쨧렅½캽泜빳燗⁇圷춪⏖뚍鳀餡ꊾᴦ뗖詒Ꝅ퍃怮鐙좽聗逋麟☐ധш♉℩邝䥎ᒼ翏狲Ꮘ쮛旾睬譚칺馵ว퀑뒷ꞹ䰛涉죢㐆蓮捥ﳂ濼跛ᬹ䲷妞ఞ".
this content is not understanding.this result will be trace out using breaking points

that code like as

C#
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.IO;
using System.Collections;
using System.Windows.Forms;

namespace test
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }
      public static string StringFromBytes(byte[] arr)
        {
            char[] ch = new char[arr.Length / 2];
            for (int i = 0; i < ch.Length; ++i)
            {
                ch[i] = (char)((int)arr[i * 2] + (((int)arr[i * 2 + 1]) << 8));
            }
            return new String(ch);
        }

        private void button1_Click(object sender, EventArgs e)
        {
            ArrayList fileStatistics = new ArrayList();
            String datasetPath = @"D:\Data Sets\Enron";
            DirectoryInfo d = new DirectoryInfo(datasetPath);
            FileInfo[] files = d.GetFiles("*.pdf");
            MessageBox.Show(files.Length.ToString());

            foreach (FileInfo file in files)
            {                
                    //create instance of data class
                    fileAtt f = new fileAtt();

                    f.fFullName = file.FullName;
                    f.fName = file.Name;
                    f.FileSize = file.Length;
                    f.fExtension = file.Extension;
                    byte[] bytes = File.ReadAllBytes(file.FullName);
                    f.content    =Form1.StringFromBytes(bytes);
                   //f.content = Encoding.ASCII.GetString(bytes);
                   f.lastaccesstime = file.LastAccessTime;                
                    fileStatistics.Add(f);
                 //   StreamReader r = new StreamReader(datasetPath);
                 //foreach
                    
                
            }
            gvStatistics.DataSource = fileStatistics;

        }
        }
    }


fileatt is property class:

C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace test
{
    class fileAtt
    {
        public long FileSize { get; set; }
        public string fName { get; set; }
        public string fFullName { get; set; }
        public string fExtension { get; set; }

        public string content { get; set; }

        public DateTime lastaccesstime { get; set; }
    }
}



i want to read the content of pdf's correctly i.e content understand by user.this is

my requirements.i want solution according to the above code.

pls help me.

thank u
Posted
Updated 25-Jan-15 19:06pm
v2
Comments
Leo Chapiro 26-Jan-15 0:56am    
What is fileAtt here: fileAtt f = new fileAtt();
Krishna Veni 26-Jan-15 1:07am    
fileatt is property class .that class added

1 solution

PDF files are not pure text, instead they are binary files which contain a quite complex structure. So you cannot just read the content and expect to see the text inside a PDF document.

I think an easiest approach is to use a ready made library such as iTextSharp[^] to explore the content of the PDF and extract text from it.
 
Share this answer
 
Comments
BillWoodruff 27-Jan-15 3:32am    
+5
Wendelius 5-Feb-15 0:30am    
Thanks :)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900