Robust MD5 Digest Extended Stored Procedure for Microsoft SQL Server 2000

David A. Gray

Rate me:

3.25/5 (4 votes)

26 Oct 2008CPOL11 min read

28.6K

388

MD5 Digest XSP with error trapping and full Unicode support, in and out.

Introduction

In the course of developing a small application, which I expect to expand in the next few years, I needed to compute and store some MD5 digest strings in a SQL Server 2000 database. When I went in search of working code, so that I wouldn't need to reinvent a wheel that I felt confident existed, I found a tutorial on Extended Stored Procedures for MS SQL Server v7.0, which almost met my needs. Having made the effort to upgrade the code to work with SQL Server 2000, and to support Unicode text in and out, I decided that I should return the favor by publishing the update here.

Background

Although you could put the MD5 digest code in one of the higher layers, or even (perish the thought!) implement the algorithm as a conventional User Defined Function, a really robust implementation, capable of efficiently processing long strings, binary objects such as images, and even whole files, is very CPU intensive. Such applications beg for the speed of a well written C program. Since I already had a stable, proven implementation, written in straight ANSI C, I thought it would be wasteful to attempt to port it to T-SQL. Even if it could be done, the result would be slow and ugly, and would severely limit the scalability of any application that used it.

Extended Stored Procedures were intended to solve exactly this kind of problem, because they let you run CPU bound algorithms at the lowest architectural layer of the database - tables and views - by integrating them into SQL Server, itself.

Extended Stored Procedures communicate with SQL Server through the Open Data Services API, which uses a series of callbacks into the running SQL Server service to identify and fetch parameters required as inputs to the function, to return results, and to report errors such as missing or invalid parameters.

The ODS interface is very flexible. You can pass just about anything into or out of your function, including entire tables.

Using the code

Extended Stored Procedures are implemented as standard Windows DLLs, which SQL Server loads as needed, via LoadLibrary. This DLL exports two functions, xp_md5 and xp_md5W. Another function, __GetXpVersion, is recommended to be incorporated, to be used by the SQL Server service to retrieve the library version for inclusion in error messages.

xp_md5 takes a text, or binary variable, or column value as input, and returns its MD5 digest as a 32 character hexadecimal string. This function treats text as ANSI, and returns the digest as a string of ANSI characters.
xp_md5W takes a text, or binary variable, or column value as input, and returns its MD5 digest as a 32 character hexadecimal string. This function treats text as Unicode, and returns the digest as a string of Unicode characters.

Other than native Unicode support, as opposed to ANSI, the two functions are identical. Both take one required argument, and two optional arguments, as detailed in the following table:

Name	Type	In or Out	Required	Notes
Plaintext	`CHAR`, `VARCHAR`, `TEXT`, or `BINARY`	In	Yes	1
Plaintext Length	Long Integer	In	No	2
Digest	`CHAR (32)` or `NCHAR (32)`	Out	Yes	3

Notes

Essentially, any text, including fixed and variable length character strings, image, and binary formats are permitted. If it can fit in a SQL Server variable, you can digest it. Plaintext is the technical term for the text fed to any cryptographic algorithm.
Regardless of what type of data you pass into the function, the ODS library reports the length in bytes. This makes processing binary objects and Unicode text trivially easy, since the length can be passed to the core MD5 algorithm. Since the MD5 digest algorithm processes bytes, it is unaffected by embedded nulls in binary data and Unicode strings composed entirely of ASCII characters. This also means that you can pass a length of -1, which tells the Extended Stored Procedure to use the length reported by the library for the first argument.
MD5 digests are always 16 bytes in length, and they always convert to a 32 character hexadecimal string. They can be stored in CHAR (32) columns (for ANSI) or NCHAR (32) columns (for Unicode). If you omit the third argument, the function returns a table of one row and one column, containing the MD5 digest.

Before you can use an Extended Stored Procedure, it must be installed into SQL Server, and registered in the Master database.

Copy the library, xp_md5.dll, into your Microsoft SQL Server binaries directory. For a default installation of Microsoft SQL Server, this will be C:\Program Files\Microsoft SQL Server\MSSQL\Binn\.
Adapt the T-SQL script xm_md5.sql, which is included in the \Reference directory of the download, open it into SQL Server Query Analyzer, ensure that the current database is Master, and execute it.
Adapt the T-SQL scripts ww_md5.sql and ww_md5W.sql, which are also in the \Reference directory, and install them into either the Master database, or into the individual databases that need the MD5 digest functions.
Use the functions [dbo].[ww_md5] and dbo].[ww_md5W] just as you would any built-in function or regular User Defined Function. You can use them in computed columns, views, queries, and other T-SQL scripts.

The following simple T-SQL script computes the MD5 digest of the Unicode string "Hello, World!", and returns the result as a single celled table.

SQL

declare @data nchar (32)
declare @text nvarchar (255)
SELECT [MyPlaceMassage].[dbo].[ww_md5W]( 'Hello, world!' )

The above sample is in Unicode_Hello.sql, which is included in the \Reference directory. Connect to your database in Query Analyzer, open the file, and press F5.

A companion script, ANSI_Hello.sql, computes the MD5 digest of the same string, rendered as ANSI text. If you run both, you will see that they produce different strings.

ANSI string 'Hello, world!'	6cd3556deb0da54bca060b4c39479839
Unicode string 'Hello, world!'	d227f77fc6c5c3969a0af57ce1789144

Points of interest

Unlike most C functions, the prototype of an Extended Stored Procedure, as it might appear in a C header file, is not especially useful to a working programmer. This means that you must find another way to document the arguments. I chose to export the T-SQL scripts that I used to install the companion User Defined Functions that make these two Extended SPs available to the database. For a more complex procedure, these UDFs would be an ideal place to house the documentation, since they are readily accessible from both the Query Analyzer and the Enterprise Manager.

I dispensed with a header altogether, and intended to dispense with a .LIB file, since you would never link to this DLL from a regular C or C++ program. If this library had one, it would list identical signatures for every exported function, because these are dictated by the ODS API. This is because the ODS library, opends60.dll, calls all Extended Stored Procedures on behalf of user code.

Your function calls back into opends60.dll, using methods similar to the ones used by COM and other interfaces that support late binding. As is usually the case with such interfaces, you must copy data supplied by the interface into your own working storage before you can pass it around. Hence, the outermost layer of any Extended Stored Procedure, and the only aspect of it that differs from ordinary C code, is the code that fetches parameters and returns data to the interface.

Just because a caller is supposed to give us at least one, and up to three, parameters, doesn't mean that we can assume that he did. Therefore, the first task is to get an argument count from the interface and test it.

C++

int nArgs = srv_rpcparams ( pSrvProc ) ;
if ( nArgs > 0 )

Since all data binding is late, and both procedures accept arbitrarily large chunks of data, the next task is to call the library with null pointers, to get the size and type of data to be processed.

C++

srv_paraminfo (
    pSrvProc ,
    PARAM_ORDINAL_1 ,
    &cType ,
    &uMaxLen ,
    &uLen ,
    NULL ,
    &fNull ) ;
    // Get info about param 1 (type & size), but not data.

if ( IsValidPlaintextType ( cType ) == FALSE )
{
    srv_sendmsg (
        pSrvProc ,
        SRV_MSG_ERROR ,
        XP_MD5_ARG1_INV_TYPE ,
        SRV_INFO ,
        XP_STATE_1 ,
        NULL ,
        0 ,
        ( DBUSMALLINT ) __LINE__ ,
        "Extended Stored Procedure xp_md5: Parameter 1 " 
        "must be a Character, Unicode Character, or binary type." ,
        SRV_NULLTERM ) ;
    srv_senddone (
        pSrvProc ,
        SRV_DONE_ERROR ,
        0 ,
        0 ) ;
    return FAIL ;
}    // End of if ( IsValidPlaintextType ( cType ) == FALSE )

The function IsValidPlaintextType, a conventional C function, evaluates the type of data being offered by the interface.

C++

/*******************************************************************************
Name:      IsValidPlaintextType

Synopsis:  This function is used internally to evaluate the type of the
           paramter that is being offered as plaintext.

Arguments: BYTE pbytType = Parameter type returned by Extended Stored
           Procedure API function srv_paraminfo.

Revision History

Date Version Author Synopsis 
---------- ---------- ------ -----------------------------------------------
2008/10/26 1, 0, 0, 7 DAG/WW This is the initial version of this function.
*******************************************************************************/

BOOL _cdecl IsValidPlaintextType ( BYTE pbytType )
{
    if ( pbytType == SRVVARCHAR )
        return TRUE ;

    if ( pbytType == SRVCHAR )
        return TRUE ;

    if ( pbytType == SRVTEXT )
        return TRUE ;

    if ( pbytType == SRVNTEXT )
        return TRUE ;

    if ( pbytType == SRVBIGCHAR )
        return TRUE ;

    if ( pbytType == SRVNVARCHAR )
        return TRUE ;

    if ( pbytType == SRVNCHAR )
        return TRUE ;

    if ( pbytType == SRVBINARY )
        return TRUE ;

    if ( pbytType == SRVIMAGE )
        return TRUE ;

    if ( pbytType == SRVBIGVARBINARY )
        return TRUE ;

    if ( pbytType == SRVBIGVARCHAR )
        return TRUE ;

    if ( pbytType == SRVBIGBINARY )
        return TRUE ;

    return FALSE ;
}    // End of function IsValidPlaintextType

The symbolic constants used by this function are defined in srv.h. If you build this project from scratch, be sure that you move C:\Program Files\Microsoft SQL Server\80\Tools\DevTools\Include to someplace near the top of your list of include paths. Otherwise, you may get the older version of srv.h that comes with Visual Studio 6, which is appropriate for SQL Server 7.0, but lacks many of the newer interfaces and constants used in this procedure, which is intended for use with SQL Server 2000.

I copied the constants into an Excel workbook, srv_constants.xls, which I included in the \Reference directory, because it contains the constant values expressed as hexadecimal (from the header), decimal, and even binary format. The intent for the binary version is to identify bit patterns that I might be able to use to simplify IsValidPlaintextType and make it a tad faster. Alas, I didn't see any.

Having identified the type of input data, and found it acceptable, the next step is to allocate a buffer for the data. This is accomplished by calling malloc, followed by SecureZeroMemory, as follows:

C++

psize    = ( uLen + TRAILING_NULL_ALLOWANCE ) ;  // Compute requird size.
pData    = ( BYTE* ) malloc ( psize ) ;  // Allocate buffer from heap.
SecureZeroMemory ( pData , psize ) ;     // Zero the input buffer before use.

Alternatively, you could call HeapAlloc and specify HEAP_ZERO_MEMORY for its dwFlags argument. However, since the original version from which this was derived, and all of my other MD5 wrapper functions perform their allocations in two steps, as shown here, I elected to be consistent with them. Note the use of SecureZeroMemory, rather than ZeroMemory or memset, because, unlike the other two, SecureZeroMemory is written in a way that prevents the optimizer from optimizing it away. This is important from a security perspective, because all memory used by cryptographic functions must be cleared before and after each use, to prevent accidental information disclosure.

Now that we have a buffer big enough to hold the data, we call srv_paraminfo again, passing a pointer to our nice, empty buffer, and its size, in bytes.

C++

srv_paraminfo (
    pSrvProc ,
    PARAM_ORDINAL_1 ,
    &cType ,
    &uMaxLen ,
    &uLen ,
    pData ,
    &fNull ) ;

Next, we call again, expecting an integer, which we ignore unless it is greater than zero.

C++

if ( nArgs > 1 )
{
    LONG nInputLen ;
    srv_paraminfo ( 
        pSrvProc ,
        PARAM_ORDINAL_2 ,
        &cType ,
        &uMaxLen ,
        &uLen ,
        ( BYTE* ) &nInputLen ,
        &fNull ) ;

    if ( IsValidPlaintextLen ( cType ) )
    {
        if ( nInputLen >= 0 )
        {
            uDataLen = ( ULONG ) nInputLen ;
        }
    }
    else

IsValidPlaintextLen is similar to IsValidPlaintextType, but it's looking for integers.

At last, it's time to digest the text.

C++

MD5String ( pData , uDataLen , szHash ) ;

szHash is an array of 33 characters (32 for the digest, plus the obligatory trailing null, to make it a valid C string). It gets returned to the caller.

Since we are finished with pData, and its contents are sensitive, we call SecureZeroMemory to clear it, then free to discard it.

C++

SecureZeroMemory ( pData , psize ) ;
free ( pData ) ;

Contrary to the acquisition of pData, there are no shortcuts to clearing and freeing it, because HeapFree doesn't have an option to tell Windows to clear the memory before it is released.

That being the case, and since we got it from malloc, we may as well call free.

Regardless, we'll be using HeapFree, either directly or indirectly, because free does so under the covers.

If the argument count is 3, symbolized by FILL_CALLERS_BUFFER, we call srv_paramsetoutput to return a pointer to the hash.

C++

if ( nArgs > FILL_CALLERS_BUFFER )
{    
    // Caller provided us with a buffer.
     if ( srv_paramsetoutput (
         pSrvProc ,
         PARAM_ORDINAL_3 ,
         ( BYTE* ) szHash ,
         HEX_CHAR_HASH_SIZE ,
         FALSE ) == FAIL )

Note that szHash is cast to a pointer to BYTE, even though it is actually an array of type unsigned char. It doesn't matter what type of data you return, the pointer is always a pointer to bytes, and the length is always specified in bytes.

If the caller doesn't pass a pointer for an out parameter, we must call srv_describe once, followed by a call to srv_sendrow, to return our digest, as a table of one row and one column.

C++

{    // Caller expects a row.
     // Send the first, and only, column.
     if ( srv_describe (
         pSrvProc ,
         1 ,
         "MD5" ,
         SRV_NULLTERM ,
         SRVNCHAR ,
         HEX_CHAR_HASH_SIZE ,
         SRVNCHAR ,
         HEX_CHAR_HASH_SIZE ,
         ( void* ) szHash ) == SRV_DESCRIBE_ERROR )
     {
         srv_sendmsg (
             pSrvProc ,
             SRV_MSG_ERROR ,
             XP_MD5_SENDCOL_ERROR ,
             SRV_INFO ,
             XP_STATE_1 ,
             NULL ,
             0 ,
             ( DBUSMALLINT ) __LINE__ ,
             "Extended Stored Procedure xp_md5W: An error was encountered " 
             "while returning the message digest as column 1 of row 1." ,
             SRV_NULLTERM ) ;
         srv_senddone (
             pSrvProc ,
             SRV_DONE_ERROR ,
             0 ,
             0 ) ;
         return FAIL ;
     }

     // Send the first, and only, row.
     if ( srv_sendrow ( pSrvProc ) == FAIL )

Significantly, both srv_paramsetoutput and srv_describe expect a value of 32, the number of characters in the digest represented as HEX_CHAR_HASH_SIZE, for the size of the data block. Thus, it seems that it might be possible, though unwise, to allocate room for only 32 characters for the message digest, and dispense with the trailing null.

Unicode

There is almost no difference between the ANSI and Unicode versions of this stored procedure. However, there are two, and both are critical.

The digest array, szHash, is of type wchar_t.
The data lengths, which must be computed twice, are HEX_CHAR_HASH_SIZE * sizeof ( wchar_t ).

The data length issue tripped me up. The first couple of times I ran the Unicode function, I got back only half of the digest. Oops!

More often than not, the Visual C++ compiler does an amazingly good job of covering for you, but when it fails, the result is almost certain to be a buffer overflow. Not only did I leave half of my digest on the heap, but I left the heap in a corrupted state. Fortunately, the debugger brought that to my attention with a very clear error message. I was lucky; there have been plenty of instances in which I got no such warning, even in debug mode.

History

Monday, 27 October 2008 is the first publication on The Code Project.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

David A. Gray

Software Developer (Senior)

United States

I deliver robust, clean, adaptable, future-ready applications that are properly documented for users and maintainers. I have deep knowledge in multiple technologies and broad familiarity with computer and software technologies of yesterday, today, and tomorrow.

While it isn't perceived as sexy, my focus has always been the back end of the application stack, where data arrives from a multitude of sources, and is converted into reports that express my interpretation of The Fundamental Principle of Tabular Reporting, and are the most visible aspect of the system to senior executives who approve the projects and sign the checks.

While I can design a front end, I prefer to work at the back end, getting data into the system from outside sources, such as other computers, electronic sensors, and so forth, and getting it out of the system, as reports to IDENTIFY and SOLVE problems.

When presented with a problem, I focus on identifying and solving the root problem for the long term.

Specialties: Design: Relational data base design, focusing on reporting; organization and presentation of large document collections such as MSDS libraries

Development: Powerful, imaginative utility programs and scripts for automated systems management and maintenance

Industries: Property management, Employee Health and Safety, Services

Languages: C#, C++, C, Python, VBA, Visual Basic, Perl, WinBatch, SQL, XML, HTML, Javascript

Outside Interests: Great music (mostly, but by no means limited to, classical), viewing and photographing sunsets and clouds, traveling by car on small country roads, attending museum exhibits (fine art, history, science, technology), long walks, especially where there is little or no motor traffic, reading, especially nonfiction and thoughtfully written, thought provoking science fiction

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.