Click here to Skip to main content
15,891,248 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hello all,

I am currently writing a program to search for files relating to different things (e.g ECR's, RMS's, Drawings, etc.) on a system. One part of the program (searching for ECR's) requires me to read around 450 pdf files and search within them for a specified product name, for example "NPS-0243", as the name of the file is of no relation to the products it references. I have successfully achieved this by using a background worker that starts when the program starts to start reading the files and adding them to a string array which can be looked at by the program. The problem is, the array is not completely filled with the text from the files until about 2 minutes after the program is run, and this of course can cause problems for a user who does not want to sit and wait 2 minutes to search for a certain type of file.

My question therefore is, is there any way of speeding this process up? i have tried writing the text from each file to one .txt file then reading and searching for the end of each file's text before adding it to the array, this is, if anything, slower than the previous method. In my opinion, im going to have to live with the 2 minute wait, but then again im not as clever as some of you!

Any help would be greatly appreciated.

Here is the background worker code i am currently using:

public void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
    int i = 0;
    foreach (string thisfile in filePaths)
    {
        if (thisfile.Contains(".MASTER ECR LOG") || thisfile.Contains("Thumbs.db"))
        { }
        else
        {
            i++;
            PdfReader reader2 = new PdfReader(thisfile);
            string strText = string.Empty;
            for (int page = 1; page <= reader2.NumberOfPages; page++)
            {
                ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
                PdfReader reader = new PdfReader(thisfile);
                string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
                s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
                strText = strText + s;
                data[i, 1] = strText;
                data[i, 2] = thisfile;
                reader.Close();
            }
        }
    }
}
Posted
Updated 4-Aug-11 5:27am
v2

1 solution

Have you considered indexing them as a background task?
That way, you just have to search the index.
 
Share this answer
 
Comments
Member 8113010 4-Aug-11 11:47am    
I am fairly new to c#, and im unsure as to whats involved with indexing them, could you elaborate slightly?

(I can almost see the pained "not another noob" expression on your face!)
OriginalGriff 4-Aug-11 14:29pm    
You presumably are not looking for that much information: If you know where in each file it is located (or it say has a unique string just before or just after) then you can pre-process these files and store the content in a separate file, with a reference to which file contains it - similar to an index in a book, it lists the word or phrase, together with a list of the pages it is found on. It's quicker to search (because it tends to be a lot smaller, and can be in a sensible order, or hashed) and it just needs updates when files are added or altered. Google do it with websites: they have a (bloody enourmous) index of each word, linked to the pages that contain it.

I haven't tried it myself, but you might want to read up on the Windows Indexing Service, since that is what it (in theory) does.
Member 8113010 5-Aug-11 5:25am    
That sounds like a great idea, i've begun work on it, one thing im struggling to work out is how to search for the string. It does not occur at a specific position in each file, it could be anywhere, but it is always of the layout 3 letters, then a dash, then 4 numbers, e.g. NOR-6537, or FYT-0759. Is there a way to search for a string in the file with that layout? Thanks for all your help so far
Member 8113010 5-Aug-11 5:26am    
I tried searching for "***-****" but of course that literally searches for 3 *'s then a dash then 4 *'s
Member 8113010 8-Aug-11 7:40am    
I managed to simply add all the text in each pdf file into a .txt file in a seperate location using a simple program i wrote, now the original program just add the text in each file to an array, this has reduced wait time by roughly 95%.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900