Reading a lot of files, need a quicker way

Question

0.00/5 (No votes)

See more:

Hello all,

I am currently writing a program to search for files relating to different things (e.g ECR's, RMS's, Drawings, etc.) on a system. One part of the program (searching for ECR's) requires me to read around 450 pdf files and search within them for a specified product name, for example "NPS-0243", as the name of the file is of no relation to the products it references. I have successfully achieved this by using a background worker that starts when the program starts to start reading the files and adding them to a string array which can be looked at by the program. The problem is, the array is not completely filled with the text from the files until about 2 minutes after the program is run, and this of course can cause problems for a user who does not want to sit and wait 2 minutes to search for a certain type of file.

My question therefore is, is there any way of speeding this process up? i have tried writing the text from each file to one .txt file then reading and searching for the end of each file's text before adding it to the array, this is, if anything, slower than the previous method. In my opinion, im going to have to live with the 2 minute wait, but then again im not as clever as some of you!

Any help would be greatly appreciated.

Here is the background worker code i am currently using:

public void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
    int i = 0;
    foreach (string thisfile in filePaths)
    {
        if (thisfile.Contains(".MASTER ECR LOG") || thisfile.Contains("Thumbs.db"))
        { }
        else
        {
            i++;
            PdfReader reader2 = new PdfReader(thisfile);
            string strText = string.Empty;
            for (int page = 1; page <= reader2.NumberOfPages; page++)
            {
                ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
                PdfReader reader = new PdfReader(thisfile);
                string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
                s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
                strText = strText + s;
                data[i, 1] = strText;
                data[i, 2] = thisfile;
                reader.Close();
            }
        }
    }
}

Posted 4-Aug-11 5:19am

Member 8113010

Updated 4-Aug-11 5:27am

Not Active

v2

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Accepted Answer · 2011-08-04T05:28:00

Solution 1

Have you considered indexing them as a background task?
That way, you just have to search the index.

Posted 4-Aug-11 5:28am

OriginalGriff

Comments

Member 8113010 4-Aug-11 11:47am

I am fairly new to c#, and im unsure as to whats involved with indexing them, could you elaborate slightly?

(I can almost see the pained "not another noob" expression on your face!)

OriginalGriff 4-Aug-11 14:29pm

You presumably are not looking for that much information: If you know where in each file it is located (or it say has a unique string just before or just after) then you can pre-process these files and store the content in a separate file, with a reference to which file contains it - similar to an index in a book, it lists the word or phrase, together with a list of the pages it is found on. It's quicker to search (because it tends to be a lot smaller, and can be in a sensible order, or hashed) and it just needs updates when files are added or altered. Google do it with websites: they have a (bloody enourmous) index of each word, linked to the pages that contain it.

I haven't tried it myself, but you might want to read up on the Windows Indexing Service, since that is what it (in theory) does.

Member 8113010 5-Aug-11 5:25am

That sounds like a great idea, i've begun work on it, one thing im struggling to work out is how to search for the string. It does not occur at a specific position in each file, it could be anywhere, but it is always of the layout 3 letters, then a dash, then 4 numbers, e.g. NOR-6537, or FYT-0759. Is there a way to search for a string in the file with that layout? Thanks for all your help so far

Member 8113010 5-Aug-11 5:26am

I tried searching for "***-****" but of course that literally searches for 3 *'s then a dash then 4 *'s

Member 8113010 8-Aug-11 7:40am

I managed to simply add all the text in each pdf file into a .txt file in a seperate location using a simple program i wrote, now the original program just add the text in each file to an array, this has reduced wait time by roughly 95%.