Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

File System Enumerator using lazy matching

0.00/5 (No votes)
23 Feb 2006 1  
An efficient solution for finding files.

Introduction

Many programmers have come across the problem of finding files matching a certain specification that exist in one or more directories or subdirectories. This article demonstrates a technique of enumerating files that integrates well with .NET programming idioms and that takes advantage of some of the new features of C# 2.0. The result is a class that's very easy to use and is more efficient (in time and space) than many other solutions.

Background

On Windows, finding files is done by calling the Win32 functions FindFirstFile{Ex} and FindNextFile. A call to FindFirstFile{Ex} locates an initial match, and subsequent calls to FindNextFile return subsequent matches. When all matches have been found, FindClose must be called to reclaim the resources used by Windows to maintain the enumeration state machine.

Unfortunately, many real-world programs need more than what FindFirstFile and its siblings can provide. Features such as searching subdirectories, searching for files matching any of the several specifications, and searching from multiple starting points all require additional programming. FindFirstFileEx, which first appeared in Windows 2000, offers a bit more functionality, but it's insufficient for most needs that go beyond what FindFirstFile provides, and enhances only a couple of very specialized scenarios.

The .NET Framework, starting with version 1.0, provides a more abstract interface for enumerating file names in directories in the form of System.IO.Directory class, specifically the GetFiles and GetDirectories methods of that class. GetFiles allows the programmer to get, in a single call, all the files matching a "wildcard" that exist in a specified directory, optionally including subdirectories. For many uses, this is a huge improvement and is often sufficient, but it's not a panacea. It still falls short in a few places:

  • Since the return is a string[], your program cannot start working with any results until the entire list has been built. If you're searching a network path, or a directory with 10's of 1000's of files, that can be several seconds.
  • There's no provision for searching files matching several specifications in a single search.
  • There's no provision for searching from several starting paths in a single search.
  • If there are a large number of matches, a large amount of memory may be consumed since the function returns the full path to each and every match in a single array.

A better solution

Without getting into the class design, imagine that you could write code like:

using System.IO;
// ...

foreach (FileInfo fi in SomeObject.SomeMethod())
{
    // ... do something with this match

}

Here, we're taking advantage of the C# compiler's built in support for .NET's System.Collections.IEnumerable pattern. In this hypothetical case, SomeObject.SomeMethod would return IEnumerable, or better yet, System.Collections.Generic.IEnumerable<FileInfo>, allowing the caller to iterate over the matches one at a time.

This model offers a number of advantages:

  • The results of the search are presented one at a time, the way you probably wanted them anyway.
  • The IEnumerable pattern facilitates lazy evaluation.
  • Returning FileInfo provides easy access to other information about the file beyond just the name.
  • This pattern doesn't require the use of a large amount of memory, like returning a string[] does.

Lazy evaluation refers to a technique that puts off doing work until the last possible moment. In this case, scanning directories for files can be delayed as long as possible. An implementation can delay calls to FindFirstFile and FindNextFile until the consumption of the IEnumerable<FileInfo>. If your program only needs the first 10 results, 10 calls to MoveNext on the IEnumerator will find those matches and you won't pay the cost of finding all the matches. Equally important, no large array of matching files is ever constructed, reducing the memory pressure that your application puts on the CLR.

The FileSystemEnumerator class

The FileSystemEnumerator class has the following public interface (C#):

using System;
using System.Collections.Generic;
using System.IO;

/// <summary>

/// File system enumerator. This class provides an easy to use,

/// efficient mechanism for searching a list of

/// directories for files matching a list of file

/// specifications. The search is done incrementally as matches

/// are consumed, so the overhead before processing

/// the first match is always kept to a minimum.

/// </summary>

public sealed class FileSystemEnumerator : IDisposable
{
    /// <summary>

    /// IDisposable.Dispose

    /// </summary>

    public void Dispose() 
    { 
        // ... 

    }

    /// <summary>

    /// Constructor.

    /// </summary>

    /// <param name="pathsToSearch">Semicolon- or 

    ///   comma-delimitted list of paths to search.</param>

    /// <param name="fileTypesToMatch">Semicolon- or

    ///   comma-delimitted list of wildcard filespecs to match.</param>

    /// <param name="includeSubDirs">If true, 

    ///   subdirectories are searched.</param>

    public FileSystemEnumerator(string pathsToSearch, 
           string fileTypesToMatch, bool includeSubDirs)
    {
        // ...

    }

    /// <summary>

    /// Get an enumerator that returns all

    /// of the files that match the wildcards that

    /// are in any of the directories to be searched.

    /// </summary>

    /// <returns>An IEnumerable that returns

    /// all matching files one by one.</returns>

    /// <remarks>The enumerator that is returned

    /// finds files using a lazy algorithm that

    /// searches directories incrementally

    /// as matches are consumed.</remarks>

    public IEnumerable<FileInfo> Matches()
    {
        // ...

    }
}

Using the class

Using the FileSystemEnumerator class is very simple:

string pathsToSearch = "your paths here, separated by semicolons";
string filesToMatch = 
      "your file specifications here, separated by semicolons";
bool includeSubDirectories = true;  // or false if you'd prefer


using (FileSystemEnumerator fse = new 
   FileSystemEnumerator(pathsToSearch, 
   filesToMatch,includeSubdirectories))
{
    foreach (FileInfo fi in fse.Matches())
    {
        // Do something with fi

    }
}

The demo project

The included demo project is a simple Windows Forms (VS 2005) project that does directory listings using the FileSystemEnumerator class:

About the implementation

The implementation of FileSystemEnumerator makes use of the new iterator support built into C# 2.0. See the C# Programmer's Guide for information about the new yield return construct which makes it very easy to wrap an arbitrarily complex state machine (such as the recursive directory searching) in an iterator interface. The implementation also makes use of simulated recursion. This technique makes use of an explicit stack of structures that store information about the context of the search at a particular level of directory nesting. The use of simulated recursion allows FileSystemEnumerator to make use of the C# 2.0 yield return construct, which would not be possible if a truly recursive search routine was used.

Interfacing with Win32

To get information about the files, we still need to use FindFirstFile and the related functions from Win32. Unfortunately (and inexplicably), the .NET Framework still doesn't provide access to these functions directly, we have to use P/Invoke to get access to these functions. All this P/Invoke code is wrapped up in a nested namespace to make it clear that it's not part of the official FileSystemEnumerator interface, but rather an implementation detail:

namespace Win32
{
  /// <summary>

  /// Structure that maps to WIN32_FIND_DATA

  /// </summary>

  [StructLayout(LayoutKind.Sequential, CharSet = CharSet.Auto)]
  class FindData
  {
    // ... other members  

    [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
    public String fileName;
    // ... other members

  }

  /// <summary>

  /// SafeHandle class for holding find handles

  /// </summary>

  class SafeFindHandle : 
    Microsoft.Win32.SafeHandles.SafeHandleMinusOneIsInvalid
  {
    /// <summary>

    /// Constructor

    /// </summary>

    public SafeFindHandle() { ... }

    /// <summary>

    /// Release the find handle

    /// </summary>

    /// <returns>true if the handle was released</returns>

    [ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]
    protected override bool ReleaseHandle() { ... }
  }

  /// <summary>

  /// Wrapper for P/Invoke methods used by FileSystemEnumerator

  /// </summary>

  [SecurityPermissionAttribute(SecurityAction.Demand, 
                               UnmanagedCode = true)]
  static class SafeNativeMethods
  {
    [DllImport("Kernel32.dll", CharSet = CharSet.Auto)]
    public static extern SafeFindHandle 
      FindFirstFile(String fileName, 
      [In, Out] FindData findFileData);

    [DllImport("kernel32", CharSet = CharSet.Auto)]
    [return: MarshalAs(UnmanagedType.Bool)]
    public static extern bool FindNextFile(SafeFindHandle hFindFile, 
                         [In, Out] FindData lpFindFileData);

    [DllImport("kernel32", CharSet = CharSet.Auto)]
    [return: MarshalAs(UnmanagedType.Bool)]
    public static extern bool FindClose(IntPtr hFindFile);
  }
}

In this namespace, you'll find three components:

  • FindData - a struct that receives the output from Find{First|Next}File.
  • SafeNativeMethods - a static class containing the P/Invoke declarations for the Win32 file find functions.
  • SafeFindHande - a SafeHandle that holds onto a find handle and ensures that it's closed even in the face of thread aborts and other asynchronous exceptions.

FindData and SafeNativeMethods

These two components are a straightforward P/Invoke wrapper of the FindFirstFile family of functions. In case you don't know, you can find 100's of P/Invoke declarations pre-built on the site pinvoke.net. For example, here's the declaration for FindFirstFile from that site. The declarations used in the implementation are derived from an MSDN sample.

SafeFindHandle

The SafeFindHandle class wraps the unmanaged HANDLE that's returned by FindFirstFile. This class derives (indirectly) from the System.Runtime.InteropServices.SafeHandle class and provides a key safety mechanism to ensure that unmanaged resources are always cleaned up. SafeHandles have what are known as Critical Finalizers, which means that the CLR will make every attempt to run the finailizer even in the face of asynchronous exceptions like ThreadAbortException and OutOfMemoryException. Having the unmanaged resources protected by a critical finalizer helps make FileSystemEnumerator usable in critical hosted environments such as SQL Server 2005.

Interfacing with the programmer

The programmer deals with FileSystemEnumerator primarily through two functions: the constructor, and the Matches. The constructor takes the paths and file specifications that the programmer supplies, and does some pre-processing on them to simplify the implementation of Matches:

private bool m_includeSubDirs;
private string[] m_paths;
private List<Regex> m_fileSpecs;

// ...


public FileSystemEnumerator(string pathsToSearch, 
       string fileTypesToMatch, bool includeSubDirs)
{
  // ...


  m_includeSubDirs = includeSubDirs;
  m_paths = 
    pathsToSearch.Split(new char[] { ';', ',' });
  
  string[] specs = 
     fileTypesToMatch.Split(new char[] { ';', ',' });
  m_fileSpecs = new List<Regex>(specs.Length);
  foreach (string spec in specs)
  {
    // trim whitespace off file spec and convert

    // Win32 wildcards to regular expressions

    string pattern = spec
      .Trim()
      .Replace(".", @"\.")
      .Replace("*", @".*")
      .Replace("?", @".?")
      ;
    m_fileSpecs.Add(new Regex("^" + pattern + 
            "$", RegexOptions.IgnoreCase));
  }
}

Here, the pathsToSearch value is split into an array of strings using string.Split. Note that if you'd like to change the delimiters that are valid in pathsToSplit, simply change the array of char that's passed to the parameter of Split.

The parameter fileTypesToMatch gets a little more processing. Instead of storing the file specifications as supplied by the programmer, we convert them into regular expressions, as implemented by the .NET System.Text.RegularExpressions.Regex class. We discard leading and trailing whitespaces, then make a few substitutions to convert the DOS-like "wildcard" into a valid regular expression.

Finding files recursively without using recursion

Of course, the real meat of FileSystemEnumerator is in the Matches function. Here it is, working from "the outside in":

public IEnumerable<FileInfo> Matches()
{
  foreach (string rootPath in m_paths)
  {
    string path = rootPath.Trim();

    // .. search "path" for files matching

    // any of the regular expressions in m_fileSpecs

  }
}

Pretty simple so far - we iterate over the root paths the caller supplied, and look for files in each one. Let's dig into the code that goes in that loop:

// find all the files in 'path'

// that match any of the regular expressions

Win32.FindData findData = new Win32.FindData();
Win32.SafeFindHandle handle = 
  Win32.SafeNativeMethods.FindFirstFile(
            Path.Combine(path, "*"), findData);

if (!handle.IsInvalid)
{
  do
  {
    // don't match . or ..

    if (findData.fileName.Equals(@".") || 
        findData.fileName.Equals(@".."))
      continue;

    if ((findData.fileAttributes & 
        (int)FileAttributes.Directory) != 0)
    {
      if (m_includeSubDirs)
      {
        // it's a directory - recurse into it

        // ... we'll come back to this

      }
    }
    else
    {
      // it's a file, see if any of the filespecs matches it

      foreach (Regex fileSpec in m_fileSpecs)
      {
        // if this spec matches, return this file's info

        if (fileSpec.IsMatch(findData.fileName))
          yield return new 
            FileInfo(Path.Combine(path, findData.fileName));
      }
    }
  } while (Win32.SafeNativeMethods.FindNextFile(
           handle, findData));

  // close this find handle

  handle.Close();
}

Ignoring the fact that we're not yet handling subdirectories, this is a typical find/next/close loop that you'll see in any discussion of listing files on Windows operating systems. A couple of things are worth noting at this point:

  • The implementation always passes '*' to FindFirstFile so that directories are returned regardless of their name. File entries that are returned are tested against the file specifications that were passed in to the constructor, by using the Regex.IsMatch function to determine if a file name matches a particular file specification.
  • The new C# 2.0 yield return construct is used to return matches from the middle of a conventional procedural loop. To the caller, this results in a return from IEnumerator<FileInfo>.MoveNext(), and the next time the programmer calls MoveNext, our loop picks up right after the yield return statement. Cool, isn't it? (Not to knock the C# designers at all, but the idea of iterators as a language construct isn't new. Metaware's Pascal compiler implemented even fancier iterator support over 20 years ago, as did their "High C" compiler.)

Now, a typical recursive file find routine (such as the example from pinvoke.net), would simply call itself recursively to find files in subdirectories. Unfortunately, it's not possible to write a recursive function that uses yield return - the function return type and parameters simply aren't "right" to use recursively, and yield return can't be used indirectly - it must appear directly in the body of a function that returns an enumerator or enumerable type.

All is not lost though. In data structures and algorithms classes, we're taught that recursion can always be replaced by iteration (interesting, the reverse is also true). How do you convert recursion to iteration? It's simple - you identify the variables that are unique to each activation of the recursive function, and build your own stack to store those variables:

/// <summary>

/// Information that's kept in our

/// stack for simulated recursion

/// </summary>

private struct SearchInfo
{
  /// <summary>

  /// Find handle returned by FindFirstFile

  /// </summary>

  public Win32.SafeFindHandle Handle;

  /// <summary>

  /// Path that was searched to yield the find handle.

  /// </summary>

  public string Path;

  /// <summary>

  /// Constructor

  /// </summary>

  /// <param name="h">Find handle returned by FindFirstFile.</param>

  /// <param name="p">Path corresponding to find handle.</param>

  public SearchInfo(Win32.SafeFindHandle h, string p)
  {
    Handle = h;
    Path = p;
  }
}

/// <summary>

/// Stack of open scopes.

/// This is a member (instead of a local variable)

/// to allow Dispose to close any open

/// find handles if the object is disposed

/// before the enumeration is completed.

/// </summary>

private Stack<SearchInfo> m_scopes;

Here, we have a simple structure that holds two pieces of information that are unique to each "recursion": the full path being searched and the find handle for that path. We maintain a stack of these structures using the new System.Collections.Generic.Stack<T> class that the 2.0 .NET Framework supplies.

We implement our simulated recursion by manually maintaining a stack:

public IEnumerable<FileInfo> Matches()
{
  foreach (string rootPath in m_paths)
  {
    string path = rootPath.Trim();

    // we "recurse" into a new directory by jumping to this spot

  top:
    // ...

    Win32.SafeFindHandle handle = // ...

    m_scopes.Push(new SearchInfo(handle, path));
    bool restart = false;

    // we "return" from a sub-directory by jumping to this spot

  restart:
    if (!handle.IsInvalid)
    {
      do
      {
        // if we restarted the loop

        // (unwound a recursion), fetch the next match

        if (restart)
        {
          restart = false;
          continue;
        }
        
        // ...


        if ((findData.fileAttributes & 
            (int)FileAttributes.Directory) != 0)
        {
          if (m_includeSubDirs)
          {
            // it's a directory - recurse into it

            path = Path.Combine(path, findData.fileName);
            goto top;
          }
        }
        else
        {
            // ...

        }
      } while (Win32.SafeNativeMethods.FindNextFile(
               handle.DangerousGetHandle(), findData));

      // ...


      // unwind the stack - are we still in a recursion?

      m_scopes.Pop();
      if (m_scopes.Count > 0)
      {
        SearchInfo si = m_scopes.Peek();
        handle = si.Handle;
        path = si.Path;
        restart = true;
        goto restart;
      }
    }
  }
}

Yes, you read right, I used goto. Not once, but twice. There's a reason why it's still in the language this many decades after the formalization of structured programming - sometimes it's just the only way. Note that, if C# allowed goto into the middle of a loop, the variable restart could be eliminated. All in all, not being able to jump into the middle of a loop is a good thing, so we'll have to be content with having an extra bool to compensate.

Some readers have commented that it's possible to eliminate the simulated recursion (and hence the goto) by restructuring the code (see the comments, below). I've kept the original form because it produces matches in what many users will find to be the most "natural" order.

Cleaning up afterwards

Since FileSystemEnumerator implements IDisposable, it should generally be used in a using block in the C# code, since it guarantees that Dispose will be called. The implementation of Dispose makes sure that no find handles are still open (remember that stack? it's a member variable of the class so Dispose can get at it):

public void Dispose()
{
  while (m_scopes.Count > 0)
  {
    SearchInfo si = m_scopes.Pop();
    si.Handle.Close();
  }
}

Note that FileSystemEnumerator is not finalizable. Since the handles are held by SafeFindHandle, which is itself finalizable, there's no need for FileSystemEnumerator to have a finalizer.

Limitations and caveats

The current implementation comes with a couple caveats:

  • Win32 wildcard matching is actually a lot more complex than "? matches 1 character and * matches any number". This class doesn't make any attempt to match all of the subtlety of the actual Windows matching rules. For most uses, this simple interpretation is sufficient, and is much closer to how most people think ? and * work anyway. Supporting all of the special cases is left as an exercise for the reader. See this newsgroup posting for a discussion of the rules that Windows really uses.
  • No support for concurrent searches using a single instance of this class. If two threads call Matches, or a single thread calls Matches recursively on a single instance, the results will be wrong as the two searches will interfere with one another. Handling concurrent searches with a single instance probably would require a change in design. Adding code to defend against attempts at concurrent searches (e.g., throw an exception) is left as an exercise for the reader. Note that concurrent searches using separate instances of this class will work just fine with the present implementation.

Thanks to

Richard Deeming for his suggestions on making better use of SafeHandle.

History

  • 21st Feb, 2006
    • Changed FindFirstFile to return SafeFindHandle directly.
  • 20th Feb, 2006
    • Added FileIOPermission checks, added checks for invalid arguments.
  • 15th Jan, 2006
    • Significantly expanded the implementation discussion.
  • 14th Jan, 2006
    • Initial version.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here