Download source - 37.4 Kb

Introduction

The code for this article allows enumerating the entries in message tables of arbitrary Win32 PE files.

Background

Only recently, I introduced a more or less generalized scheme for error codes in Win32 native applications, in the company that I am currently working for. It is based on the same principles that MS itself uses for its own error codes, most of which can be found in the winerror.h header file and have a string representation embedded as a message table resource in kernel32.dll and a few other DLLs. This way, using the FormatMessage API, an error code can be resolved into a meaningful human-readable error description. While working on this topic, I was curious to find out which error strings reside in the DLLs of a typical Windows installation. I wanted to enumerate all DLLs in a certain directory (say, the c:\windows\system32 directory) and enumerate the message table entries for any given language ID in each DLL.

However, unlike with various other resource types, the Win32 API lacks such an API to enumerate message table resources. Hence, my first approach was to use brute-force: call FormatMessage for all possible message IDs between 0 and 0xFFFFFFFF. The idea was that if FormatMessage fails, then let it fail. However, if it succeeds, then there must be a message table string entry for the given ID which FormatMessage returns to the caller. You can certainly imagine that this naive approach not only burned needless CPU cycles, it also took hours to execute for only a bunch of DLLs. After the first initial tests, I estimated that it would take some three or four years to scan my entire system32 directory, let alone that this approach requires a-priori knowledge about the language ID that I should scan the DLLs for. This is because one and the same message table can exist in any number of languages, with identical message table entry IDs within the same binary. Therefore, a different solution had to be found.

Low level APIs to the rescue

The Windows APIs that deal with resource loading come in two flavours: the low-level functions such as FindResource, LoadResource, LockResource, and the high level functions, such as LoadMenu, LoadString, LoadImage. Other than the high-level APIs, the low-level APIs are resource-type agnostic, i.e., they don't know anything about the type or binary layout of the resource for which they are used. Needless to say, the high-level APIs internally call the low-level APIs. As a consequence, using the low-level APIs, it is quite straightforward to enumerate and load any resource in any language, once you know the binary layout of the raw resource that is acquired this way. Fortunately, the message table binary layout is pretty well documented. A message table consists of one or more blocks of data that are organized as a MESSAGE_RESOURCE_DATA structure (defined in winnt.h):

typedef struct _MESSAGE_RESOURCE_DATA {
    DWORD NumberOfBlocks;
    MESSAGE_RESOURCE_BLOCK Blocks[ 1 ];
} MESSAGE_RESOURCE_DATA, *PMESSAGE_RESOURCE_DATA;

An actual MESSAGE_RESOURCE_DATA block doesn't contain only one member of the type MESSAGE_RESOURCE_BLOCK, as the struct definition suggests. Instead, the member variable NumberOfBlocks indicates how many MESSAGE_RESOURCE_BLOCK entries a MESSAGE_RESOURCE_DATA block, loaded via a sequence of calls to the FindResource, LoadResource, LockResource APIs, contains. The data type MESSAGE_RESOURCE_BLOCK is defined in winnt.h as well, and looks like this:

typedef struct _MESSAGE_RESOURCE_BLOCK {
    DWORD LowId;
    DWORD HighId;
    DWORD OffsetToEntries;
} MESSAGE_RESOURCE_BLOCK, *PMESSAGE_RESOURCE_BLOCK;

Each MESSAGE_RESOURCE_BLOCK represents a sequence of consecutive message table entries in a message table, starting at the ID indicated by the member LowId and ending with the ID indicated by the HighId member of the MESSAGE_RESOURCE_BLOCK struct. Adding the value in the OffsetToEntries member to the address of the MESSAGE_RESOURCE_BLOCK struct itself then yields the start address of the message table entry with the first ID of the MESSAGE_RESOURCE_BLOCK which is contained in the LowId member. This address points to a MESSAGE_RESOURCE_ENTRY data structure, also defined in the winnt.h, as such:

typedef struct _MESSAGE_RESOURCE_ENTRY {
    WORD   Length;
    WORD   Flags;
    BYTE  Text[ 1 ];
} MESSAGE_RESOURCE_ENTRY, *PMESSAGE_RESOURCE_ENTRY;

Each MESSAGE_RESOURCE_ENTRY block represents a single message table string item. As you might already have guessed, the actual string address of the message table item starts at the address of the Text member of this structure, and is of variable length, which is determined by the Length member of the structure. The Length member contains the length of the string, in bytes, without the terminating zero character. But here comes an additional twist: the string itself can either be a codepage based ANSI string, or a UTF-16 Unicode string, and this is what the Flags member of the structure is good for: if it is zero, the string is an ANSI string, if it is one, it is a Unicode string. Other values for the Flags member are not defined. The next message table item then follows the current MESSAGE_RESOURCE_ENTRY block at the address of the current MESSAGE_RESOURCE_ENTRY's Text member plus the number of bytes encoded in the Length member.

Using the code

The function that I wrote for this article in order to enumerate message tables in a PE binary has the following prototype:

BOOL EnumMessageTableStrings(HMODULE hMod, LPCTSTR lpName, 
                             ENUM_MESSAGES enfn, LONG_PTR lParam);

The first parameter, hMod, is a module/instance handle to the DLL or EXE file, whose message table is to be enumerated. The second parameter is the name or ID of the resource. For message tables whose strings can be loaded via FormatMessage, this is always MAKEINTRESOURCE(1). While it is theoretically possible to have a message table resource with a different numerical ID or a string ID, it doesn't happen in practice, because this would require manual intervention and manipulation of the message compiler's output during the build process.

The third parameter is an enumeration callback function that the caller has to supply, and which will be invoked once for each message table entry per language while EnumMessageTableStrings executes. The last parameter, lParam, is a user defined parameter that will always be passed to the callback function from within EnumMessageTableStrings. You can pass anything you want as this parameter, e.g., a pointer to an object whose member functions will be invoked in your callback function, or whatever strikes you fancy.

The function returns a nonzero value if it succeeds, and FALSE if it either fails or if the enumeration callback returned FALSE to discontinue enumeration. In order to distinguish both cases where the function returns FALSE, extended information is provided with GetLastError. If GetLastError returns ERROR_SUCCESS, the enumeration was aborted by the callback returning FALSE. If an error occurred during enumeration, GetLastError will return a nonzero error code defined in winerror.h.

The prototype for the ENUM_MESSAGES callback function looks like this:

typedef BOOL (CALLBACK * ENUM_MESSAGES)(LPVOID lpMsg, DWORD dwMsgId, 
              WORD wFlags, WORD wIDLanguage, LONG_PTR lParam);

The first parameter, lpMsg, is the string of the enumerated message table entry. It is prototyped as LPVOID, because it is either a Unicode string (UTF-16), or an ANSI string, so it should be cast to either an LPCWSTR or to an LPCSTR. The third parameter, wFlags, determines if lpMsg is to be interpreted as an ANSI string (wFlags=0) or Unicode (wFlags=1). The usage of numbers in code makes me really crazy, therefore, I defined the macros EMT_MSG_IS_ANSI (0) and EMT_MSG_IS_UNICODE (1) in the header files that contain the EnumMessageTableStrings prototype and the definition of the ENUM_MESSAGES callback. The second parameter, dwMsgId, specifies the message ID of the enumerated message, and the fourth parameter, wIDLanguage, is the Win32 language ID for which the message table entry was found. The last parameter, as already explained before, is the custom parameter that the caller passed to EnumMessageTableStrings as the lParam parameter. If an enumeration should be aborted, the callback function should return FALSE. To continue enumeration, a nonzero value should be returned from the enumeration callback.

ANSI and Unicode message table entries

As explained above, MESSAGE_RESOURCE_ENTRY blocks represent a single message table string entry, and come in two flavours: with the Flags WORD set to 0, the Text member has to be interpreted as an ANSI string, and set to 1, it represents a Unicode (UTF-16) string. It should be noted that, normally, for message table resources in a PE file, this doesn't change between individual MESSAGE_RESOURCE_ENTRY blocks. Creating a message table is, typically, done with the MS message compiler (mc.exe), which, by default, creates a message table with ANSI strings, and by virtue of the "-U" command line parameter will create Unicode message table entries, resulting in a slightly larger resulting binary.

The demo application

The application that comes with the source code of this article, msgdump.exe, simply enumerates all DLLs in the current working directory, and prints their message table entries to stdout. In order to see if a particular message table entry is a Unicode based string or an ANSI based string, the text printed to stdout starts with the lowercase letters "id" for ANSI based strings, and with uppercase letters "ID" for Unicode based strings. An interesting experiment is to build the application, copy it into a directory in the %PATH% environment variable, run a console (cmd.exe), navigate inside the console to the Windows sytem32 directory, and finally, run msgdump in this directory. This will then dump all message tables from the system DLLs.

Other applications for this code

In the company that I currently work for, we always ship binaries that are localized into German and English. Traditionally, it has been a problem to always keep both resource variants in sync. It frequently happened that for a given resource, the German variant was there, but not the English version, and vice versa. Also, it sometimes occurred that resources which contain format strings suitable for sprintf or FormatMessage did not have the placeholders (such as %d, %s, %1, %2) in the correct order, or to the same amount in both language variants, which sometimes lead to "very interesting behaviour" (read: "crashes") of the software, depending on the user's language preferences. I, therefore, wrote a tool named compres which I will enhance in the near future to include support for scanning message tables as well, using the functionality outlined in this article. In a nutshell, the compres tool is designed to run as part of an automated build process over all the EXE and DLL files that have been built as part of the build process. If it finds resources in one language but not in the other language, or if it finds different format strings in the two languages for which it compares resources, it prints an error message with a description of the error to stdout. Other possible applications for the code are localization tools, or resource editors, or simply academic curiosity.

Points of interest

Using the demo application for this article, I looked at various operating system versions in order to see if there are any peculiarities in their usage of message tables. As expected, a Windows 95 installation has all its message table entries encoded as ANSI strings, in order to save both hard disk and memory space. A typical Windows XP installation, as of today, has the majority of its message table entries encoded as Unicode strings. Another interesting point is the fact that using the demo application, which is built as a native x86 application, it is also possible to enumerate message tables of native x64 DLLs on Windows XP/2003 x64 editions. This demonstrates that the Win32 PE ("portable executable") format really deserves the "P" in its name.

History

06/10/2006 - Initial version of article and code.