Click here to Skip to main content
15,879,326 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi, I'm currently implementing a program to find certain text within pdf, word and text files and subdirectories in the specified directory. However, something seems to be off in the code. The program runs and hang state for quite some time before result is displayed. The code I implemented is stated below.

What I have tried:

static string[] keywords = {"a", "restricted", "the"};

static void Main(string[] args)
{
    search_file(@"D:\Documents", keywords);
    Console.ReadKey();
}
static void search_file(string path, string[] keywords)
{
    var files = Directory.EnumerateFiles(path, "*.*", SearchOption.AllDirectories);
    Console.WriteLine("Files count: {0}", files.Count().ToString());
    string Content = string.Empty;

    foreach (string file in files)
    {
        switch (System.IO.Path.GetExtension(file))
        {
            case ".txt":
                Content = GetFileText(file);
                SearchForContent(Content, file);
                break;

            case ".pdf":
                Content = GetTextFromPDF(file);
                SearchForContent(Content, file);
                break;

            case ".docx":
                Content = GetTextFromWord(file);
                SearchForContent(Content, file);
                break;
        }

    }
}
static void SearchForContent(string Contents, string file)
{
    foreach (string key in keywords)
    {
        if (Contents.Contains(key))
        {
            Console.WriteLine("key: " + key + " " + file);
        }
    }
}

static string GetFileText(string name)
{
    string fileContents = String.Empty;

    // If the file has been deleted since we took
    // the snapshot, ignore it and return the empty string.
    if (System.IO.File.Exists(name))
    {
        fileContents = File.ReadAllText(name);
    }
    return fileContents;
}
static string GetTextFromPDF(string path)
{
    PdfReader reader = new PdfReader(path);
    using (StringWriter output = new StringWriter())
    {

        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
        }
        reader.Close();

        return output.ToString();
    }
}
static string GetTextFromWord(string direct)
{
    StringBuilder text = new StringBuilder();
    Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
    object miss = System.Reflection.Missing.Value;
    object path = direct;
    object readOnly = true;
    Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);

    for (int i = 0; i < docs.Paragraphs.Count; i++)
    {
        text.Append(" \r\n " + docs.Paragraphs[i + 1].Range.Text.ToString());
    }

    return text.ToString();
}


the program eventually outputs the results, but I found out that not every subdirectories and files were being searched. (I created a test folder within the Documents directory and it wasn't found in the output). Any help is greatly appreciated. thanks!
Posted
Updated 14-Mar-22 20:48pm
v3

1 solution

We can't help you - we have no access to your file system, and that is probably quite relevant.

So, it's going to be up to you.
Fortunately, you have a tool available to you which will help you find out what is going on: the debugger. If you don't know how to use it then a quick Google for "Visual Studio debugger" should give you the info you need.

Put a breakpoint on the first line in the function, and run your code through the debugger. Then look at your code, and at your data and work out what should happen manually. Then single step each line checking that what you expected to happen is exactly what did. When it isn't, that's when you have a problem, and you can back-track (or run it again and look more closely) to find out why.

I'd also suggest two other things:
1) Remove this line:
Console.WriteLine("Files count: {0}", files.Count().ToString());
That causes the system to enumerate each file - i.e. to search every file and folder in the path in your case - and throw away the result before it gets to actually using the file names!

2) Add logging to your code to record every file it tries to process and every file it does process inside the switch. After the run, you can identify all the "missed files" and see if they have anything in common.

Sorry, but we can't do that for you - time for you to learn a new (and very, very useful) skill: debugging!
 
Share this answer
 
Comments
Maciej Los 15-Mar-22 9:52am    
5ed!

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900