Click here to Skip to main content
15,564,334 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have a list of words that needs to be searched in documents etc, to see if they exists. The problem is, there are two words that are very similar, namely, "confidential" and "co-confidential". So, how to check and ensure that they do not mix these words up when searching? For "secret" and "restricted", those have no issues.

What I have tried:

static List<string> keywords = new List<string> {"secret", "restricted", "confidential", "co-confidential"};

 static bool ContainsAny(this string haystack, params string[] needles)
        {
            foreach (string needle in needles)
            {
                if (haystack.Contains(needle))
                {
                    if (haystack.Contains(needle))
                    {
                        if ((needle.StartsWith("co-")) && needle.EndsWith("ial"))
                            return true;

                        else if ((needle.StartsWith("co")) && needle.EndsWith("ial"))
                            return true;

                        else if (haystack.Contains(needle) && !(needle.StartsWith("co")) && needle.EndsWith("ial"))
                            return true;
                        else
                            return false;
                       
                    }
                    else
                        return false;

                }
                else
                    return false;


            }

            return false;
        }

static string GetUntilOrEmpty(this string text, string stopAt = "-")
        {
            if (!String.IsNullOrWhiteSpace(text))
            {
                int charLocation = text.IndexOf(stopAt, StringComparison.Ordinal);

                if (charLocation > 0)
                {
                    return text.Substring(0, charLocation);
                }
            }

            return String.Empty;
        }

        public static bool Validate_Text(this string source, string toCheck)
        {
            bool bl_found = false;
            try
            {

                //if either strings are null or empty
                if (string.IsNullOrEmpty(toCheck) || string.IsNullOrEmpty(source))
                {
                    return false;
                }

                bl_found = ContainsAny(source, toCheck);


            }
            catch (ArgumentNullException e)
            {
               Console.WriteLine("Keyword/ Content is null" + e.Message.ToString());
                return false;
            }
            catch (ArgumentException e)
            {
                Console.WriteLine("Keyword/ Content: Error" + e.Message.ToString());
                return false;
            }


            return bl_found;

        }



The output is as follows:

====Normal Directory======
Keyword: confidential  File path: c:\users\stk2017\desktop\testfolder\C.docx
Keyword: confidential  File path: c:\users\stk2017\desktop\testfolder\C.pptx
Keyword: confidential  File path: c:\users\stk2017\desktop\testfolder\C.xlsx
Keyword: confidential  File path: c:\users\stk2017\desktop\testfolder\Cc.docx
Keyword: co-confidential  File path: c:\users\stk2017\desktop\testfolder\Cc.docx
Keyword: confidential  File path: c:\users\stk2017\desktop\testfolder\Cc.pptx
Keyword: co-confidential  File path: c:\users\stk2017\desktop\testfolder\Cc.pptx
Keyword: confidential  File path: c:\users\stk2017\desktop\testfolder\Co-confid.docx
Keyword: co-confidential  File path: c:\users\stk2017\desktop\testfolder\Co-confid.docx
Keyword: confidential  File path: c:\users\stk2017\desktop\testfolder\Co.pptx
Keyword: co-confidential  File path: c:\users\stk2017\desktop\testfolder\Co.pptx
Keyword: confidential  File path: c:\users\stk2017\desktop\testfolder\doc.xlsx
Keyword: co-confidential  File path: c:\users\stk2017\desktop\testfolder\doc.xlsx
Keyword: confidential  File path: c:\users\stk2017\desktop\testfolder\Testing.docx
============Done===============
Posted
Updated 10-Jun-22 10:06am
v3
Comments
PIEBALDconsult 10-Jun-22 22:10pm    
"co-confidential" is not a word. Done.

" confidential " and " co-confidential ".
Note spaces before and after the words.


Punctuation is a problem there.

I would probably try using a regex to do this, not a simple Contains method.
 
Share this answer
 
v2
Comments
pohcb_sonic 10-Jun-22 0:08am    
any examples as reference? I did tried regex, but it doesn't work.
Dave Kreskowiak 10-Jun-22 7:40am    
I don't have an example, but you're looking to build a regex expression that matches your search terms with whitespace characters at the start of the word.

Google for "Expresso regex" and you'll find a tool for building regex expressions and be able to test them against sample data.
Hi, is it correct that you only want to know if one of these words occur without actually knowing the word it concerns? In that case matching "confidential" is enough for matching both "confidential" and "co-confidential".

If you need to know exactly which words occur in the document you can use regex (fast), but you can also use a simply split on the text. Something like:

C#
var keywords = new List<string> {"secret", "restricted", "confidential", "co-confidential"};
var words = documentText.Split(' ');
var hits = new Dictionary<string, int>();
foreach (var word in words) {
  //***** Remove punctuations and eliminate casing;
  var keyword = word.Trim('.', ',', '!', '?');
  if (keywords.Contains(keyword)
    if (hits.ContainsKey(keyword))
      hits[keyword]++;
    else
      hits.Add(keyword, 1);
}


HTH
 
Share this answer
 
Comments
pohcb_sonic 13-Jun-22 4:48am    
actually, it's important to identify either "confidential" or "co-confidential." Sometimes both might also appear at the same time, in the same file.
ludosoep 13-Jun-22 5:11am    
Okay, then the code previously posted should be enough. Only thing I forgot is to add lowercase to matching the keywords. So instead of var keyword = word.Trim('.', ',', '!', '?'); it should be var keyword = word.Trim('.', ',', '!', '?').ToLower();. Of course you can add more punctuations if necessary. Hope to hear if it works! Cheers

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900