Click here to Skip to main content
11,714,345 members (93,304 online)
Click here to Skip to main content

A multi-pass file extension comparison algorithm

, 23 Jan 2009 LGPL3 21K 102 7
Rate this:
Please Sign up or sign in to vote.
This is what happened when I needed a file extension comparison algorithm, this should save you quite some time thinking one up.

Screenshot - ExtParseAlgorithm.png

Introduction

This algorithm came about when I was in need of a quick way of finding files in a list that had the same extension, and then remove them from the list of extensions to be compared/parsed by a for-looped system, and then only the extensions that were supported to be returned via a ref (this is % in Managed C++) variable back into the main processing line of my program (a media player).

Background

The code and ideas I am sharing in this article are to do with how this sort of complex comparison can be done quickly, and without too-many variables / for-loops. It was needed to be speedy as it was being used in a wrapper in an media player application I was writing, and it happened every time new file(s) were opened. The application specs I wrote specified that it should not hang at this stage unless there was an extremely good reason for it to, and parsing the file extensions is not, in this case, a good reason for the application to hang.

Using the code

There is really only one method to this algorithm - a function called ScanFileExts.

The definition of the function can be found in the file "ExtParseAlgorithm.h" (in the accompanying source zip file). It looks like:

bool ScanFileExts(array<String^>^ Files, array<String^>^ ComparisonExts,
     array<String^>^% retExts);

This accepts two input arguments: Files - the files (with extensions) to parse, and ComparisonExts - which speaks for itself, the extensions to use in the parsing process.

The third argument - retExts - is used to return the extensions that are supported and in the original file list.

The actual definition of the implementation is found in the top of the next file, and looks the same as the above, but without the ";", and it has a comment describing the return behavior.

In the other file with the name "ExtParseAlgorithm", which is a C++ source file, resides the actual implementation of the function.

The first section (stage) in this code is:

if (Files->Length == 0) return false;
array<String^>^ ret = gcnew array<String^>(0);
array<String^>^ LoopFileExts = gcnew array<String^>(Files->Length);

// Initial Data Collection
for (int i = 0; i < Files->Length; i++)
{
    String^ Ext = "";
    for (int j = (Files[i]->Length - 1); j > 0; j--)
    {
        if (Files[i][j] == '.')
        {
            Ext = Files[i]->Substring(j);
            j = 0;
        }
    }
    LoopFileExts[i] = Ext;
}
array<String^>^ internFiles = (array<String^>^)LoopFileExts->Clone();

This tests if the file list you have provided is empty, if so, returns false for an error, because this function should not have empty parameters. Then, it initializes two internal variables: ret and LoopFileExts.

The code then does an initial collection of extensions from the files in that list; it puts them into another variable - internFiles. For behavioral aspects with this code that might need to be addressed, see "Points of interest".

The next stage of code does the first pass of parsing / comparing, the code goes as follows:

// Checking for sims and diffs (Parse 1)
array< array<bool^>^ >^ Matches = gcnew array< array<bool^>^ >
(Files->Length);
for (int i = 0; i < Files->Length; i++)
{
    // Get matches for each Ext in turn
    array<bool^>^ Matched;
    Matched = gcnew array<bool^>(Files->Length);
    for (int j = 0; j < Files->Length; j++)
    {
        if (j != i)
        {
            if (LoopFileExts[i] == LoopFileExts[j]) Matched[j] = true;
            else Matched[j] = false;
        }
    }
    Matched[i] = false;
    Matches[i] = (array<bool^>^)Matched->Clone();
}

This code starts by initializing an array of boolean (bool as declaring type) arrays to the number of files in the file array. The next step is to start stepping through this array and filling it; this is achieved by making a temporary variable called Matched and setting each element to true if the extension of the current file in the list (indicated by i) is the same as that of the file at j. j is the variable created by running another loop through the files, and then comparing the number in j against i, if they are different, then saying if they are the same or not. The current file indicated by i is set to false in the array; this means that when the file array is cleaned, it is not deleted if it the first occurrence of the extension.

After this, we need to shorten the file list; this is achieved with the following:

// Shorten the internal file list to just the single extentions
for (int i = 0; ; i++)
{
    array<String^>^ FileList = gcnew array<String^>(0);
    int CurrentFileListLen = internFiles->Length;
    for (int j = 0; j < CurrentFileListLen; j++)
    {
        if (Matches[i][j]->ToString() == bool::FalseString)
        {
            FileList->Resize(FileList, (FileList->Length + 1));
            FileList[(FileList->Length - 1)] = LoopFileExts[j];
        }
    }
    internFiles->Resize(internFiles, 0);
    internFiles->Resize(internFiles, FileList->Length);
    internFiles = (array<String^>^)FileList->Clone();
    break;
}

Using the array previously created, it loops through, again using that double for-loop setup, removing extraneous entries in the file list.

Right, enough chasing the file list and cleaning it up. Now, to do the second pass of parsing that I mentioned earlier this code did, the code goes as follows:

// Do the comparison with the _posible_ extensions (Parse 2)
for (int i = 0; i < internFiles->Length; i++)
{
    for (int j = 0; j < ComparisonExts->Length; j++)
    {
        // Remove excess "\0"'s from the end of the strings.
        if (ComparisonExts[j][(ComparisonExts[j]->Length - 1)] == '\0')
        {
            ComparisonExts[j] = ComparisonExts[j]->Remove((ComparisonExts[j]->Length - 1));
        }
        
        // Perform the comparison
        if (ComparisonExts[j] == internFiles[i])
        {
            ret->Resize(ret, (ret->Length + 1));
            ret[(ret->Length - 1)] = (String^)ComparisonExts[j]->Clone();
        }
    }
}

There, now what does this do? Well, first off, using that double for-loop system that I keep using (Some must be wondering if this is my favourite for-loop setup. Well, it is not, I prefer to use single for-loops if any, and it is actually the one I hate the most. The reason it is used here is because it does the job in hand better than other systems, like do-loops, and while-loops, oh, and my favourite setup of for-loops, singular for-loops), we start the comparison / parse process.

Yet another array is created here, or should I say, the return array we created earlier and initialized to null is resized and the newly created index filled with an extension if the extension matches the one in the file list.

The next bit of code I am not going to bother with, as it only adds some verbosity to the application, and can be removed without the algorithm braking, but I will bother with the bit after it, and that is:

// Pass the info back
retExts = ret;

return true; 

This code is simply putting the newly created and cleaned file extension list into that variable in the header that got mentioned as the only return variable: retExts.

The only reason this function is non-void in the return type is because I needed some way of determining whether or not the function succeeded in doing the parse. It will return true for succeeded, at this point.

When you wish to call this algorithm, you will need to have an un-initialized variable to recieve the data, and two variables with the type array<String^>^ filled with the appropriate data; an example of this is included in the source, and is compiled in the demo. The source file for this is "Main.cpp", it keeps in line with the style and spec I set with the media player code.

Points of interest

Within the initial collection, it counts backward till it hits the "." of the start of an extension. If you wish to use this on files without an extension, it will throw an error by not putting anything into the Ext temporary variable; instead, it will be null which causes .NET to throw the null variable exception when the clone happens. (Please do not do this as it is a tough to find error if you use it as I do - inside a pre-compiled library (.dll file) - that is being called into, and you cannot use a debug version because it does not even work one bit).

History

  • [19 July 2007] - Initial article.
  • [23 January 2009] - Updated the downloads to something that can be read by normal Windows users.

License

This article, along with any associated source code and files, is licensed under The GNU Lesser General Public License (LGPLv3)

Share

About the Author

Rachel Mant
Student University of York
United Kingdom United Kingdom
Having finished college, gaining 3 A-Levels, I am now at University. Successfully I have completed my second year of a 3 year BEng course.

Since starting programming when I was 7, I have worked with many languages and platforms. Most noteably: BBC BASIC on a RISC-OS desktop (my first language), PHP/MySQL for web, and C/C++ and Assembly on Windows and Linux in the last 6 years. I know many more languages and can use any of them on the fly but the list is too extensive for this.

My first version of C/C++ was Microsoft Visual C++ 6, and I was lucky enough to (one year later) have copies of the Visual Studio 2005 Express Editions. Since then I have aquired Visual Studio 2005 Pro which I use as my default environment on Windows. I use the GCC toolchain on Linux along with Makefiles that I write myself.

In the last couple of years I have built my own distribution of Linux from scratch, based on Cross-compiled Linux From Scratch. This distribution, designed to be development-stable, is called Time-Bomb Linux.

You may also be interested in...

Comments and Discussions

 
GeneralMy vote of 1 Pin
Tomas Rapkauskas27-Jan-09 22:33
memberTomas Rapkauskas27-Jan-09 22:33 
GeneralRe: My vote of 1 Pin
RichardMant15-Sep-09 7:12
memberRichardMant15-Sep-09 7:12 
GeneralJust been checking Pin
RichardMant7-Nov-07 12:48
memberRichardMant7-Nov-07 12:48 
GeneralRe: Just been checking Pin
RichardMant10-Nov-08 3:34
memberRichardMant10-Nov-08 3:34 
GeneralRe: Just been checking Pin
RichardMant23-Jan-09 1:09
memberRichardMant23-Jan-09 1:09 
GeneralCMaps Pin
NoExperience24-Jul-07 4:54
memberNoExperience24-Jul-07 4:54 
GeneralRe: CMaps Pin
Richardrichardmant6525-Jul-07 1:49
memberRichardrichardmant6525-Jul-07 1:49 
Generalhmmm... Pin
KellyLeahy19-Jul-07 11:57
memberKellyLeahy19-Jul-07 11:57 
GeneralRe: hmmm... Pin
Richardrichardmant6520-Jul-07 4:55
memberRichardrichardmant6520-Jul-07 4:55 
GeneralRe: hmmm... Pin
KellyLeahy20-Jul-07 5:31
memberKellyLeahy20-Jul-07 5:31 
GeneralRe: hmmm... Pin
Richardrichardmant6520-Jul-07 5:53
memberRichardrichardmant6520-Jul-07 5:53 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.150819.1 | Last Updated 23 Jan 2009
Article Copyright 2007 by Rachel Mant
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid