Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: PDF string file Text
Hello all,
 
I am currently writing a program to search for files relating to different things (e.g ECR's, RMS's, Drawings, etc.) on a system. One part of the program (searching for ECR's) requires me to read around 450 pdf files and search within them for a specified product name, for example "NPS-0243", as the name of the file is of no relation to the products it references. I have successfully achieved this by using a background worker that starts when the program starts to start reading the files and adding them to a string array which can be looked at by the program. The problem is, the array is not completely filled with the text from the files until about 2 minutes after the program is run, and this of course can cause problems for a user who does not want to sit and wait 2 minutes to search for a certain type of file.
 
My question therefore is, is there any way of speeding this process up? i have tried writing the text from each file to one .txt file then reading and searching for the end of each file's text before adding it to the array, this is, if anything, slower than the previous method. In my opinion, im going to have to live with the 2 minute wait, but then again im not as clever as some of you!
 
Any help would be greatly appreciated.
 
Here is the background worker code i am currently using:
 
        public void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
        {
            int i = 0;
            foreach (string thisfile in filePaths)
            {
                if (thisfile.Contains(".MASTER ECR LOG") || thisfile.Contains("Thumbs.db"))
                { }
                else
                {
                    i++;
                    PdfReader reader2 = new PdfReader(thisfile);
                    string strText = string.Empty;
                    for (int page = 1; page <= reader2.NumberOfPages; page++)
                    {
                        ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
                        PdfReader reader = new PdfReader(thisfile);
                        string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
                        s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
                        strText = strText + s;
                        data[i, 1] = strText;
                        data[i, 2] = thisfile;
                        reader.Close();
                    }
                }
            }
        }
Posted 4-Aug-11 6:19am
Edited 4-Aug-11 6:27am
v2

1 solution

Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

Have you considered indexing them as a background task?
That way, you just have to search the index.
  Permalink  
Comments
Member 8113010 at 4-Aug-11 11:47am
   
I am fairly new to c#, and im unsure as to whats involved with indexing them, could you elaborate slightly?
 
(I can almost see the pained "not another noob" expression on your face!)
OriginalGriff at 4-Aug-11 14:29pm
   
You presumably are not looking for that much information: If you know where in each file it is located (or it say has a unique string just before or just after) then you can pre-process these files and store the content in a separate file, with a reference to which file contains it - similar to an index in a book, it lists the word or phrase, together with a list of the pages it is found on. It's quicker to search (because it tends to be a lot smaller, and can be in a sensible order, or hashed) and it just needs updates when files are added or altered. Google do it with websites: they have a (bloody enourmous) index of each word, linked to the pages that contain it.
 
I haven't tried it myself, but you might want to read up on the Windows Indexing Service, since that is what it (in theory) does.
Member 8113010 at 5-Aug-11 5:25am
   
That sounds like a great idea, i've begun work on it, one thing im struggling to work out is how to search for the string. It does not occur at a specific position in each file, it could be anywhere, but it is always of the layout 3 letters, then a dash, then 4 numbers, e.g. NOR-6537, or FYT-0759. Is there a way to search for a string in the file with that layout? Thanks for all your help so far
Member 8113010 at 5-Aug-11 5:26am
   
I tried searching for "***-****" but of course that literally searches for 3 *'s then a dash then 4 *'s
Member 8113010 at 8-Aug-11 7:40am
   
I managed to simply add all the text in each pdf file into a .txt file in a seperate location using a simple program i wrote, now the original program just add the text in each file to an array, this has reduced wait time by roughly 95%.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 OriginalGriff 539
1 Maciej Los 300
2 DamithSL 233
3 Sergey Alexandrovich Kryukov 209
4 BillWoodruff 200
0 OriginalGriff 7,168
1 Sergey Alexandrovich Kryukov 6,377
2 DamithSL 5,461
3 Manas Bhardwaj 4,876
4 Maciej Los 4,450


Advertise | Privacy | Mobile
Web04 | 2.8.1411023.1 | Last Updated 4 Aug 2011
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100