Click here to Skip to main content
15,884,176 members
Articles / Desktop Programming / MFC
Article

ScanR - Text file string search and replace engine

Rate me:
Please Sign up or sign in to vote.
4.30/5 (7 votes)
3 Jul 20035 min read 87.2K   3.7K   35   3
A set of classes that allows you to search for strings in text files and replace them.

The ScanR Dialog

Introduction

This was actually an idea that somehow entered my head after getting hundreds of e-mails from people asking me to help them clean their systems infected with the Win32.Redlof.A HTML based script following my own article on the script based virus at NewOrder Security titled "Paper on the Win32.Redlof.A virus" (or so).

The virus code is actually in VBScript, and appended to every infected file. Anti-virus programs can detect the presence of the virus but I am NOT aware of any which REMOVES the offending code - so till now every victim has to manually search and replace the offending string from each file, and even with ONE missed file - you get infected again !

I set out to write a small utility which would find the infected files and remove the offending code, and so ScanR was born.

What this utility does is search a specific string occurrence inside TEXT (ASCII) files and replace them with another string which may also be NULL (which amounts to replacing the search string fully).
Thus to remove the virus string, I put the Search string to the virus script and the replacement string to NULL. Thankfully this approach worked, and everybody lived happily ever after.

This program is that very stuff, with a few bugs removed (and more introduced), and some redundant code eliminated, which REALLY has boosted up program performance.

Note: I have made a MORE feature rich implementation of this utility called CleanR, which operates on a single file, thus allowing me to focus more on text search-replace performance than File IO Performance. This should be available at CodeProject itself, if Chris lets it be so ;)
Though, BOTH the programs operate on the SAME search-replace engine code-base, I had to do minor modifications in each to optimize the code where it was pinching most.

Background

Though not absolutely essential, you can check out my article detailing the virus, it's at NewOrder Security - you can get it by clicking on the "..older posts" link.(I have forgotten the link actually).
The paper should ALSO available here .

This tool actually consists of two parts:

  • The file list generation class - CGetFileList
  • The string search and replace engine class - CCleanR

The CGetFileList is declared as:

class CGetFileList  
{
public:
    CGetFileList();
    CGetFileList(const char* szStartingDir,const char* szDirWildCard,
                 const char* szFileWildCard,DWORD dwMaxFileSize,
                 CCleanRboolSet theSet);
    DWORD GetFileCount();
    virtual ~CGetFileList();


private:
    CCleanRboolSet m_theSet;
    DWORD m_dwCount,m_dwFiles,m_dwMaxFileSize;
    char* m_szFileWildCard;
    DWORD FindnRecurseDir(const char* szStartingDir,
                          const char* szDirWildCard,
                          const char* szFileWildCard);
    DWORD FindFileMatching(const char*szPathToFindFiles,
                           const char* szFileWildCard);
    BOOL ShouldIReadThisFile(WIN32_FIND_DATA *FileData,
                             BOOL bJudgeByExtension);
    BOOL ProcessThisFile(const char* szPathName,const char* szFileName);
};

As evident from the class constructor, the class takes 4 arguments:

  • const char* szStartingDir : Defines the directory/drive under which the program should start searching.

  • const char* szDirWildCard : Defines the directory wildcard

    Note: If szDirWildCard is NULL, then it scans the root directory also.('*' DOES NOT resolve to NULL, i.e. G:\*\*.ABC does not resolve to G:\*.ABC. In my program however, if you search for G:\*.ABC, then it amounts to searching G:\*.ABC AND G:\*\*.ABC - What do you think about it ?)

  • const char* szFileWildCard : Defines the filename wildcard

  • DWORD dwMaxFileSize : Defines the maximum file size that the program should process.

This class just iterates though child directories matching szDirWildCard of szStartingDir searching for filenames matching szFileWildCard, and verifying if it should process the file by calling

BOOL ShouldIReadThisFile(WIN32_FIND_DATA *FileData, BOOL bJudgeByExtension);

This function sees if the filename or extension is/are blacklisted. File extensions are checked for if bJudgeByExtension is TRUE, else, it will just compare filenames.
You can set custom blacklists by modifying the appropriate variables in Settings.h file, and recompiling the application.

If ShouldIReadThisFile returns TRUE, then BOOL ProcessThisFile(const char* szPathName,const char* szFileName); is called, which calls CCleanR to do the dirty work of text search and replace.

The CCleanR class is defined as :

class CCleanR  
{
public:
    DWORD Process(); //Initiates start of processing, and returns the number<BR>                     //of matches found
    BOOL SetReplacementString(const char *szReplacementString);
    BOOL SetSearchString(const char *szSearchString);
    BOOL SetFileName(const char *szFileName);
    CCleanR();
    CCleanR(CCleanRboolSet boolSet,LPCTSTR szOutputFileName);
    virtual ~CCleanR();

private:
    BOOL IsCharBelongingToSet(char cCharToTest,char *szSetValues);
    BOOL m_bReplaceCRLFWithLF;
    BOOL m_bReplaceLFWithCRLF;
    BOOL m_bLastWordMayNotHavePunct;
    BOOL m_bCheckOnlyWords;
    BOOL m_bStrict;
    BOOL m_bStrainFound;
    BOOL m_bSearchIsFinished;
    BOOL m_bCaseSensetive;
    BOOL m_bOverWriteFile;
    CString m_sSearchString;
    CString m_sReplacementString;
    CString m_sTempSearchString;
    CString m_sInputFileName;
    CString m_sOutputFileName;
};

I think that the names are good enough to let you know what they do. For a detailed discussion, please refer to my article on the CCleanR class which I will soon be sending in at CodeProject.

The source code is also well commented in case you want to know more.

Using the code

It's recommended that you first check out the CCleanR engine on which this program is based, and it's accompanying demonstration program. It's here at CodeProject itself !

You should have realized by now that the application is based on two standalone classes which are ready to be used in any application without or with little modification.

However, as in any computer code, these two classes may also have something which could have been avoided or added or removed or just plain complicated - I leave it to you to please review my code and send in your constructive criticisms, bug reports and possible betterment of code.

To make it easier for you, I have packaged a sample MFC application with source code implementing all the code we just discussed.

Those text files which contain the supplied search string, have the matching string replaced and the resulting file copied into this program's directory. You can also choose to Overwrite the original file with the modified one by setting bOverWriteFile of the CleanRboolSet object theR defined in GenFileList.cpp to TRUE.

If the original file is left untouched, all modifications are reflected in the copy of the file present in this program's directory.

The filenames are also logged in a text file named InfectedFiles.log. You can change this name by editing Settings.h and recompiling the program. Those filenames which could NOT be processed are logged in IgnoredFiles.txt alongwith the reason for them being ignored.

Comments have been added generously whenever applicable, and if they are not enough I will be happy to update this article with more discussions about the code.

History

  • 14th June 2003 - Replaced edit box showing list of matching files with an edit box where you can enter the replacing string (to be put in place of the matching string)
  • 21st June 2003 - Added more search options which had not been implemented previously.
    There are still a LOT of features of the CCleanR class STILL not implemented here, check out my dedicated article on the CCleanR engine and it's abilities. It's here at CodeProject !

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
United States United States
Kamal Shankar is a programming freak (or so he feels).He currently lives in the Salt Lake City and loves doing what he has been since 1990 - coding horribly Wink | ;)

Comments and Discussions

 
GeneralUnicode Pin
mapharo23-Aug-10 15:11
mapharo23-Aug-10 15:11 
GeneralTrash in EOF Pin
Clark Thomas12-Sep-05 6:51
Clark Thomas12-Sep-05 6:51 
GeneralAdjust view format. Pin
WREY7-Jul-03 7:53
WREY7-Jul-03 7:53 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.