Click here to Skip to main content
Click here to Skip to main content

A small Content Detection Library

By , 1 May 2007
Rate this:
Please Sign up or sign in to vote.

Screenshot - ContentDetectorLib_01.png

Introduction

In my recently published article about the Zeta Uploader application (in short, a website to upload files and send e-mail messages with links to the uploaded files), the discussion came up (Thanks to Phil.Benson) about the need to administer the uploaded files in order to avoid copyright infringements.

This article introduces a library that I have written last evening and this morning (so it is "really" fresh) to take a first step in the right direction.

What the library does

Since I wanted to avoid (at least for now) forcing users of the Zeta Uploader to register and login in order to use the service, I decided to try a different approach:

After a file is uploaded, it is checked for whether it is considered "prohibited" in terms of that it cannot be uploaded with Zeta Uploader. Currently I've included files like movies (AVI, MOV, etc.) and music (MP3, WAV, etc.) as being prohibited.

How the library works

The detection algorithm uses the following mechanisms to test a file for being prohibited or allowed:

  • File extension

    Look at the file extension. If it matches a given extension on the prohibited list, the file is considered "prohibited".

  • File content

    Look inside the first few bytes of the file for known binary pattern ("magic bytes") to match a list of prohibited patterns.

  • Archive extraction

    The file is detected to be an archive file, the file is being temporarily extracted and the extracted files are scanned, too (recursively, if they contain archives, too).

The next section briefly discusses these different mechanisms.

File extension checking

This goes straight to the extension of the file name. Since this is rather easy to cheat, the file extension checking is done as a first quick check only. If it matches, the whole detection is done for a given file.

If not, a content analysis is done, as described next.

Content analysis

The main work of the library is to apply simple "pattern matching" to the content of a file. Through an extensible ISignatureChecker interface, more complex tests can be added later. I've included a simple check for MP3s that does a little bit more than just pattern matching (class Mp3SignatureChecker).

The ISignatureChecker interface is defined as follows:

/// <summary>
/// Interface to implement when checking a buffer
/// for a certain signature.
/// </summary>
internal interface ISignatureChecker
{
    /// <summary>
    /// Check whether a given buffer matches the signature.
    ///
    /// <param name="buffer">The buffer.</param>
    /// <returns></returns>
    bool MatchesSignature(
        byte[] buffer );

    /// <summary>
    /// Gets the first number of bytes to read.
    /// </summary>
    /// <value>The first number of bytes to read.</value>
    int FirstNumberOfBytesToRead
    {
        get;
    }

    /// <summary>
    /// Gets the minimum length of the required buffer.
    /// </summary>
    /// <value>The minimum length of the required buffer.</value>
    int MinimumRequiredBufferLength
    {
        get;
    }
}

Through this interface, the check engine communicates with the discrete interfaces. See the source files for details and examples.

Archive extraction

Since most files are compressed archives, it is important to extract these too.

Again, I've built an extensible mini-framework based on the IArchiveExtractor interface to allow for adding more archive extractors in the future.

The interface is defined as follows:

/// <summary>
/// Interface for archive extractors.
///
internal interface IArchiveExtractor
{
    /// <summary>
    /// Extracts the specified file path.
    /// </summary>
    /// <param name="filePath">The file path.</param>
    /// <param name="folderPathToExtractInto">The folder path
    /// to extract into.</param>
    void Extract(
        FileInfo filePath,
        DirectoryInfo folderPathToExtractInto );
}

Currently I am using the SharpZipLib to provide extractors for ZIP, gzip and bzip2.

Test application

There is no test application in the download. Instead the following code snippet is the complete Main function of my own testing console application.

/// <summary>
/// The main function.
/// </summary>
private static void Main()
{
    // Instantiate the engine.
    ContentDetectorEngine engine = new ContentDetectorEngine();

    // --
    // Testing discrete files.

    // Collect some files to test.
    FileInfo[] filePaths = new FileInfo[]
    {
        new FileInfo( @"c:\AnotherFolder\112431940.mp3" ),
        new FileInfo( @"c:\AnotherFolder\247293565.txt" ),
        new FileInfo( @"c:\AnotherFolder\008284502.zip" ),
        new FileInfo( @"c:\AnotherFolder\190243241.mdb" ),
        new FileInfo( @"c:\AnotherFolder\182944456.zip" ),
    };

    // Iterate over the files.
    foreach ( FileInfo filePath in filePaths )
    {
        bool contains =
            engine.ContainsFileProhibitedContent( filePath );
        Console.WriteLine(
            @"Contains '{0}': {1}.",
            filePath.Name,
            contains );
    }

    // --
    // Testing a complete folder.

    // Find all files in the given folder.
    FileInfo[] prohibitedPaths =
        engine.ContainsFolderProhibitedContent(
        new DirectoryInfo(
        @"C:\SomeFolder" ) );

    Console.WriteLine( @"Folder contains {0} prohibited files.",
        prohibitedPaths.Length );

    foreach ( FileInfo prohibitedPath in prohibitedPaths )
    {
        Console.WriteLine(
            @"\tProhibited file: '{0}'.", prohibitedPath );
    }
}

Simply copy it into your own console application and you are done.

Conclusion

In this article I've shown you a library to detect file types based on their content. Although this is only a first version of the library and probably some approaches are somewhat naive, I'm sure the code is useful and can be extended in the future to be even more usable.

If you have feedback, questions or comments, simply post them in the comments section below. I'm looking forward to your messages!

References

  1. HeaderSig.txt - Several signatures for file types
  2. Magic number (programming) - Wikipedia article

History

  • 2007-05-01: Initial release of the library

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Uwe Keim
Chief Technology Officer Zeta Producer Desktop CMS
Germany Germany
Uwe does programming since 1989 with experiences in Assembler, C++, MFC and lots of web- and database stuff and now uses ASP.NET and C# extensively, too. He has also teached programming to students at the local university.
 
In his free time, he does climbing, running and mountain biking. Recently he became a father of a cute boy.
 
Some cool, free software from us:
 
Free Test Management Software - Intuitive, competitive, Test Plans. Download now!  
Homepage erstellen - Intuitive, very easy to use. Download now!  
Send large Files online for free by Email
Some random fun stuff in German

Comments and Discussions

 
Generalvoicexml audio file upload in iis srever folder with out upload control using c# aspx . PinmemberS.S.Sivaprasad7-Feb-08 0:55 
GeneralTag der Arbeit(er)... PinmemberPhil.Benson1-May-07 20:28 
GeneralRe: Tag der Arbeit(er)... PinsitebuilderUwe Keim1-May-07 21:52 
GeneralRe: Tag der Arbeit(er)... PinmemberPhil.Benson2-May-07 1:02 
GeneralRe: Tag der Arbeit(er)... PinsitebuilderUwe Keim2-May-07 1:10 
GeneralRe: Tag der Arbeit(er)... PinmemberPhil.Benson2-May-07 2:16 
GeneralRe: Tag der Arbeit(er)... PinsitebuilderUwe Keim2-May-07 2:41 
GeneralRe: Tag der Arbeit(er)... PinmemberPhil.Benson6-May-07 21:44 
GeneralRe: Tag der Arbeit(er)... PinsitebuilderUwe Keim7-May-07 5:00 
Generaltry to use TrID PinmemberBaselNimer1-May-07 0:54 
GeneralRe: try to use TrID PinsitebuilderUwe Keim1-May-07 2:41 
GeneralRe: try to use TrID PinmemberBaselNimer1-May-07 3:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140415.2 | Last Updated 1 May 2007
Article Copyright 2007 by Uwe Keim
Everything else Copyright © CodeProject, 1999-2014
Terms of Use
Layout: fixed | fluid