Introduction
Many programmers have come across the problem of finding files matching a certain specification that exist in one or more directories or subdirectories. This article demonstrates a technique of enumerating files that integrates well with .NET programming idioms and that takes advantage of some of the new features of C# 2.0. The result is a class that's very easy to use and is more efficient (in time and space) than many other solutions.
Background
On Windows, finding files is done by calling the Win32 functions FindFirstFile{Ex}
and FindNextFile
. A call to FindFirstFile{Ex}
locates an initial match, and subsequent calls to FindNextFile
return subsequent matches. When all matches have been found, FindClose
must be called to reclaim the resources used by Windows to maintain the enumeration state machine.
Unfortunately, many real-world programs need more than what FindFirstFile
and its siblings can provide. Features such as searching subdirectories, searching for files matching any of the several specifications, and searching from multiple starting points all require additional programming. FindFirstFileEx
, which first appeared in Windows 2000, offers a bit more functionality, but it's insufficient for most needs that go beyond what FindFirstFile
provides, and enhances only a couple of very specialized scenarios.
The .NET Framework, starting with version 1.0, provides a more abstract interface for enumerating file names in directories in the form of System.IO.Directory
class, specifically the GetFiles
and GetDirectories
methods of that class. GetFiles
allows the programmer to get, in a single call, all the files matching a "wildcard" that exist in a specified directory, optionally including subdirectories. For many uses, this is a huge improvement and is often sufficient, but it's not a panacea. It still falls short in a few places:
- Since the return is a
string[]
, your program cannot start working with any results until the entire list has been built. If you're searching a network path, or a directory with 10's of 1000's of files, that can be several seconds.
- There's no provision for searching files matching several specifications in a single search.
- There's no provision for searching from several starting paths in a single search.
- If there are a large number of matches, a large amount of memory may be consumed since the function returns the full path to each and every match in a single array.
A better solution
Without getting into the class design, imagine that you could write code like:
using System.IO;
foreach (FileInfo fi in SomeObject.SomeMethod())
{
}
Here, we're taking advantage of the C# compiler's built in support for .NET's System.Collections.IEnumerable
pattern. In this hypothetical case, SomeObject.SomeMethod
would return IEnumerable
, or better yet, System.Collections.Generic.IEnumerable<FileInfo>
, allowing the caller to iterate over the matches one at a time.
This model offers a number of advantages:
- The results of the search are presented one at a time, the way you probably wanted them anyway.
- The
IEnumerable
pattern facilitates lazy evaluation.
- Returning
FileInfo
provides easy access to other information about the file beyond just the name.
- This pattern doesn't require the use of a large amount of memory, like returning a
string[]
does.
Lazy evaluation refers to a technique that puts off doing work until the last possible moment. In this case, scanning directories for files can be delayed as long as possible. An implementation can delay calls to FindFirstFile
and FindNextFile
until the consumption of the IEnumerable<FileInfo>
. If your program only needs the first 10 results, 10 calls to MoveNext
on the IEnumerator
will find those matches and you won't pay the cost of finding all the matches. Equally important, no large array of matching files is ever constructed, reducing the memory pressure that your application puts on the CLR.
The FileSystemEnumerator class
The FileSystemEnumerator
class has the following public interface (C#):
using System;
using System.Collections.Generic;
using System.IO;
public sealed class FileSystemEnumerator : IDisposable
{
public void Dispose()
{
}
public FileSystemEnumerator(string pathsToSearch,
string fileTypesToMatch, bool includeSubDirs)
{
}
public IEnumerable<FileInfo> Matches()
{
}
}
Using the class
Using the FileSystemEnumerator
class is very simple:
string pathsToSearch = "your paths here, separated by semicolons";
string filesToMatch =
"your file specifications here, separated by semicolons";
bool includeSubDirectories = true;
using (FileSystemEnumerator fse = new
FileSystemEnumerator(pathsToSearch,
filesToMatch,includeSubdirectories))
{
foreach (FileInfo fi in fse.Matches())
{
}
}
The demo project
The included demo project is a simple Windows Forms (VS 2005) project that does directory listings using the FileSystemEnumerator
class:
About the implementation
The implementation of FileSystemEnumerator
makes use of the new iterator support built into C# 2.0. See the C# Programmer's Guide for information about the new yield return
construct which makes it very easy to wrap an arbitrarily complex state machine (such as the recursive directory searching) in an iterator interface. The implementation also makes use of simulated recursion. This technique makes use of an explicit stack of structures that store information about the context of the search at a particular level of directory nesting. The use of simulated recursion allows FileSystemEnumerator
to make use of the C# 2.0 yield return
construct, which would not be possible if a truly recursive search routine was used.
Interfacing with Win32
To get information about the files, we still need to use FindFirstFile
and the related functions from Win32. Unfortunately (and inexplicably), the .NET Framework still doesn't provide access to these functions directly, we have to use P/Invoke to get access to these functions. All this P/Invoke code is wrapped up in a nested namespace to make it clear that it's not part of the official FileSystemEnumerator
interface, but rather an implementation detail:
namespace Win32
{
[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Auto)]
class FindData
{
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
public String fileName;
}
class SafeFindHandle :
Microsoft.Win32.SafeHandles.SafeHandleMinusOneIsInvalid
{
public SafeFindHandle() { ... }
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]
protected override bool ReleaseHandle() { ... }
}
[SecurityPermissionAttribute(SecurityAction.Demand,
UnmanagedCode = true)]
static class SafeNativeMethods
{
[DllImport("Kernel32.dll", CharSet = CharSet.Auto)]
public static extern SafeFindHandle
FindFirstFile(String fileName,
[In, Out] FindData findFileData);
[DllImport("kernel32", CharSet = CharSet.Auto)]
[return: MarshalAs(UnmanagedType.Bool)]
public static extern bool FindNextFile(SafeFindHandle hFindFile,
[In, Out] FindData lpFindFileData);
[DllImport("kernel32", CharSet = CharSet.Auto)]
[return: MarshalAs(UnmanagedType.Bool)]
public static extern bool FindClose(IntPtr hFindFile);
}
}
In this namespace, you'll find three components:
FindData
- a struct
that receives the output from Find{First|Next}File
.
SafeNativeMethods
- a static
class containing the P/Invoke declarations for the Win32 file find functions.
SafeFindHande
- a SafeHandle
that holds onto a find handle and ensures that it's closed even in the face of thread aborts and other asynchronous exceptions.
FindData and SafeNativeMethods
These two components are a straightforward P/Invoke wrapper of the FindFirstFile
family of functions. In case you don't know, you can find 100's of P/Invoke declarations pre-built on the site pinvoke.net. For example, here's the declaration for FindFirstFile
from that site. The declarations used in the implementation are derived from an MSDN sample.
SafeFindHandle
The SafeFindHandle
class wraps the unmanaged HANDLE
that's returned by FindFirstFile
. This class derives (indirectly) from the System.Runtime.InteropServices.SafeHandle
class and provides a key safety mechanism to ensure that unmanaged resources are always cleaned up. SafeHandle
s have what are known as Critical Finalizers, which means that the CLR will make every attempt to run the finailizer even in the face of asynchronous exceptions like ThreadAbortException
and OutOfMemoryException
. Having the unmanaged resources protected by a critical finalizer helps make FileSystemEnumerator
usable in critical hosted environments such as SQL Server 2005.
Interfacing with the programmer
The programmer deals with FileSystemEnumerator
primarily through two functions: the constructor, and the Matches
. The constructor takes the paths and file specifications that the programmer supplies, and does some pre-processing on them to simplify the implementation of Matches
:
private bool m_includeSubDirs;
private string[] m_paths;
private List<Regex> m_fileSpecs;
public FileSystemEnumerator(string pathsToSearch,
string fileTypesToMatch, bool includeSubDirs)
{
m_includeSubDirs = includeSubDirs;
m_paths =
pathsToSearch.Split(new char[] { ';', ',' });
string[] specs =
fileTypesToMatch.Split(new char[] { ';', ',' });
m_fileSpecs = new List<Regex>(specs.Length);
foreach (string spec in specs)
{
string pattern = spec
.Trim()
.Replace(".", @"\.")
.Replace("*", @".*")
.Replace("?", @".?")
;
m_fileSpecs.Add(new Regex("^" + pattern +
"$", RegexOptions.IgnoreCase));
}
}
Here, the pathsToSearch
value is split into an array of strings using string.Split
. Note that if you'd like to change the delimiters that are valid in pathsToSplit
, simply change the array of char
that's passed to the parameter of Split
.
The parameter fileTypesToMatch
gets a little more processing. Instead of storing the file specifications as supplied by the programmer, we convert them into regular expressions, as implemented by the .NET System.Text.RegularExpressions.Regex
class. We discard leading and trailing whitespaces, then make a few substitutions to convert the DOS-like "wildcard" into a valid regular expression.
Finding files recursively without using recursion
Of course, the real meat of FileSystemEnumerator
is in the Matches
function. Here it is, working from "the outside in":
public IEnumerable<FileInfo> Matches()
{
foreach (string rootPath in m_paths)
{
string path = rootPath.Trim();
}
}
Pretty simple so far - we iterate over the root paths the caller supplied, and look for files in each one. Let's dig into the code that goes in that loop:
Win32.FindData findData = new Win32.FindData();
Win32.SafeFindHandle handle =
Win32.SafeNativeMethods.FindFirstFile(
Path.Combine(path, "*"), findData);
if (!handle.IsInvalid)
{
do
{
if (findData.fileName.Equals(@".") ||
findData.fileName.Equals(@".."))
continue;
if ((findData.fileAttributes &
(int)FileAttributes.Directory) != 0)
{
if (m_includeSubDirs)
{
}
}
else
{
foreach (Regex fileSpec in m_fileSpecs)
{
if (fileSpec.IsMatch(findData.fileName))
yield return new
FileInfo(Path.Combine(path, findData.fileName));
}
}
} while (Win32.SafeNativeMethods.FindNextFile(
handle, findData));
handle.Close();
}
Ignoring the fact that we're not yet handling subdirectories, this is a typical find/next/close loop that you'll see in any discussion of listing files on Windows operating systems. A couple of things are worth noting at this point:
- The implementation always passes '*' to
FindFirstFile
so that directories are returned regardless of their name. File entries that are returned are tested against the file specifications that were passed in to the constructor, by using the Regex.IsMatch
function to determine if a file name matches a particular file specification.
- The new C# 2.0
yield return
construct is used to return matches from the middle of a conventional procedural loop. To the caller, this results in a return from IEnumerator<FileInfo>.MoveNext()
, and the next time the programmer calls MoveNext
, our loop picks up right after the yield return
statement. Cool, isn't it? (Not to knock the C# designers at all, but the idea of iterators as a language construct isn't new. Metaware's Pascal compiler implemented even fancier iterator support over 20 years ago, as did their "High C" compiler.)
Now, a typical recursive file find routine (such as the example from pinvoke.net), would simply call itself recursively to find files in subdirectories. Unfortunately, it's not possible to write a recursive function that uses yield return
- the function return type and parameters simply aren't "right" to use recursively, and yield return
can't be used indirectly - it must appear directly in the body of a function that returns an enumerator or enumerable type.
All is not lost though. In data structures and algorithms classes, we're taught that recursion can always be replaced by iteration (interesting, the reverse is also true). How do you convert recursion to iteration? It's simple - you identify the variables that are unique to each activation of the recursive function, and build your own stack to store those variables:
private struct SearchInfo
{
public Win32.SafeFindHandle Handle;
public string Path;
public SearchInfo(Win32.SafeFindHandle h, string p)
{
Handle = h;
Path = p;
}
}
private Stack<SearchInfo> m_scopes;
Here, we have a simple structure that holds two pieces of information that are unique to each "recursion": the full path being searched and the find handle for that path. We maintain a stack of these structures using the new System.Collections.Generic.Stack<T>
class that the 2.0 .NET Framework supplies.
We implement our simulated recursion by manually maintaining a stack:
public IEnumerable<FileInfo> Matches()
{
foreach (string rootPath in m_paths)
{
string path = rootPath.Trim();
top:
Win32.SafeFindHandle handle =
m_scopes.Push(new SearchInfo(handle, path));
bool restart = false;
restart:
if (!handle.IsInvalid)
{
do
{
if (restart)
{
restart = false;
continue;
}
if ((findData.fileAttributes &
(int)FileAttributes.Directory) != 0)
{
if (m_includeSubDirs)
{
path = Path.Combine(path, findData.fileName);
goto top;
}
}
else
{
}
} while (Win32.SafeNativeMethods.FindNextFile(
handle.DangerousGetHandle(), findData));
m_scopes.Pop();
if (m_scopes.Count > 0)
{
SearchInfo si = m_scopes.Peek();
handle = si.Handle;
path = si.Path;
restart = true;
goto restart;
}
}
}
}
Yes, you read right, I used goto
. Not once, but twice. There's a reason why it's still in the language this many decades after the formalization of structured programming - sometimes it's just the only way. Note that, if C# allowed goto
into the middle of a loop, the variable restart
could be eliminated. All in all, not being able to jump into the middle of a loop is a good thing, so we'll have to be content with having an extra bool
to compensate.
Some readers have commented that it's possible to eliminate the simulated recursion (and hence the goto
) by restructuring the code (see the comments, below). I've kept the original form because it produces matches in what many users will find to be the most "natural" order.
Cleaning up afterwards
Since FileSystemEnumerator
implements IDisposable
, it should generally be used in a using
block in the C# code, since it guarantees that Dispose
will be called. The implementation of Dispose
makes sure that no find handles are still open (remember that stack? it's a member variable of the class so Dispose
can get at it):
public void Dispose()
{
while (m_scopes.Count > 0)
{
SearchInfo si = m_scopes.Pop();
si.Handle.Close();
}
}
Note that FileSystemEnumerator
is not finalizable. Since the handles are held by SafeFindHandle
, which is itself finalizable, there's no need for FileSystemEnumerator
to have a finalizer.
Limitations and caveats
The current implementation comes with a couple caveats:
- Win32 wildcard matching is actually a lot more complex than "? matches 1 character and * matches any number". This class doesn't make any attempt to match all of the subtlety of the actual Windows matching rules. For most uses, this simple interpretation is sufficient, and is much closer to how most people think ? and * work anyway. Supporting all of the special cases is left as an exercise for the reader. See this newsgroup posting for a discussion of the rules that Windows really uses.
- No support for concurrent searches using a single instance of this class. If two threads call
Matches
, or a single thread calls Matches
recursively on a single instance, the results will be wrong as the two searches will interfere with one another. Handling concurrent searches with a single instance probably would require a change in design. Adding code to defend against attempts at concurrent searches (e.g., throw an exception) is left as an exercise for the reader. Note that concurrent searches using separate instances of this class will work just fine with the present implementation.
Thanks to
Richard Deeming for his suggestions on making better use of SafeHandle
.
History
- 21st Feb, 2006
- Changed
FindFirstFile
to return SafeFindHandle
directly.
- 20th Feb, 2006
- Added
FileIOPermission
checks, added checks for invalid arguments.
- 15th Jan, 2006
- Significantly expanded the implementation discussion.
- 14th Jan, 2006