Download demo executable - 50.5 KB

Visit the Ultimate TCP-IP main page for an overview and configuration guide to the Ultimate Toolbox library.

Overview

The enabling of Unicode compilation in Ultimate TCP-IP version 4.2 allows for improved decoding and representation of multi-language character sets in RFC-822 based messages.

Introduction

The Ultimate TCP-IP library includes classes for sending and receiving Mail and News messages. At the core of these is the CUT_Msg class, which in turn makes use of various encoding classes to allow for multi-part MIME, attachment, and header encoding/decoding. The MsgMapper sample is essentially a testing tool for CUT_Msg operations in general, and encoding/decoding in particular.

A Brief History of Messages

Encoding and Multi-Part Messages

Had the Internet evolved with 8 bit data transfers as standard, there might be quite a bit less fussing to be done in dealing with news and mail messages. Because many routers, etc. in the early days were not quite '8-bit ready', various encoding schemes were developed to enable binary transfers over 7-bit channels. BinHex, Quoted Printable, UUEncode, and Base64 to name a few. These encoding schemes are still active in various client implementations of internet RFC-822 (Mail) and RFC-1036 (News) based messages.

Along with these various encoding schemes, e-mail messages can optionally incorporate MIME extensions, allowing for multiple representations of the content (text or HTML) and attachments. MIME (Multipurpose Internet Mail Extensions - RFC-2045) is not in itself an encoding scheme, but allows for the specification of encoding schemes for the various parts of a multi-part message.

Leaving aside attachments and HTML message contents, proper representation (or desired representation) of the text content of a modern day RFC-822 based message will need to focus on decoding and displaying the message headers and body.

Decoding

We'll start at the level of the message object. The CUT_Msg class can load a block of ascii text corresponding to an RFC-822 or 1036 message and isolate the message headers and attachments (if any). The MsgMapper sample can be used to open *.eml and *.nws files saved by Outlook Express, for example.

Once a message has been successfully loaded, we can iterate through looking for standard and 'custom' headers - the following enumeration is currently used to identify message headers:

// Header fields IDs enumeration


enum HeaderFieldID {
    UTM_ALL_FIELDS,        // All types of fields


    UTM_MESSAGE_ID,        // Message Id ......................(RFC 822, 1036)


    UTM_TO,                // To ..............................(RFC 822, 1123)


    UTM_CC,                // CC ..............................(RFC 822, 1123)


    UTM_BCC,               // BCC .............................(RFC 822, 1123)


    UTM_FROM,        // From ............................(RFC 822, 1036, 1123)


    UTM_SUBJECT,           // Subject .........................(RFC 822, 1036)


    UTM_RECEIVED,          // Trace of MTAs ...................(RFC 822, 1123)


    UTM_IN_REPLY_TO,       // Reference to replied to message .(RFC 822)


    UTM_KEYWORDS,          // Search keys for DB retrieval ....(RFC 822, 1036)


    UTM_COMMENTS,          // Comments on this message ........(RFC 822)


    UTM_DATE,        // Date ............................(RFC 822, 1036, 1123)


    UTM_REPLY_TO,          // Reply-To ........................(RFC 822, 1036)


    UTM_REFERENCES,        // References ......................(RFC 822, 1036)


    // v4.2 could add RFC Resent-xxx headers.


    UTM_NEWSGROUPS,        // NewsGroups ......................(RFC 1036)


    UTM_XREF,              // XRef ............................(RFC 1036)


    UTM_XMAILER,           // X-Mailer     (non-standard)


    UTM_XNEWS_READER,      // X-Newsreader (non-standard)


    UTM_MIME_VERSION,      // MIME version ....................(RFC 1521)


    UTM_THREAD_TOPIC,      // Proprietary - included for charset decoding


    UTM_CONTENT_TYPE,      // Content-Type .......(RFC 1049, 1123, 1521, 1766)



    UTM_CUSTOM_FIELD,      // Custom field


    UTM_MAX_FIELD
};

Here is the code from the sample that populates the tree view in the left pane with decoded representations of the header contents:

    // get standard headers


    int count = 0;
    HTREEITEM hTemp;
    _TCHAR strName[MAX_PATH+1];
    size_t size;
    int i = (int)(HeaderFieldID)UTM_MESSAGE_ID;
    for ( i; i < (int)(HeaderFieldID)UTM_CUSTOM_FIELD; ++i) {
        count = doc->m_message.GetHeaderCount((HeaderFieldID)i);
        if(count > 0) {
            doc->m_message.GetFieldName(strName, MAX_PATH, 
                (HeaderFieldID)i, &size);
            hTemp = tree.InsertItem(strName, hHeaders, TVI_SORT);
            for(int index = 0; index < count; ++index) {
                char test[76];
                doc->m_message.DecodeHeader((HeaderFieldID)i, test);
                doc->m_message.GetHeaderByType(strName, MAX_PATH, 
                    (HeaderFieldID)i, index, &size);
                tree.InsertItem(strName, hTemp);
            }
            tree.Expand(hTemp, TVE_EXPAND);
        }        
    }

Note that not all headers need to be decoded, but the calls to CUT_Msg::DecodeHeader handle the checks to see if the header is decoded as well as which encoding scheme (Base64 or Quoted Printable) is to be used, so while calling DecodeHeader for all headers is perhaps less efficient than just targeting certain ones (e.g. UTM_SUBJECT and UTM_FROM) it simplifies coding.

The call to DecodeHeader examines the content of the header string to see if either of the built in encoding schemes can decode all or part of the text. In the string

From: =?Utf-8?B?YV9zaA==?= <xxx@discussions.microsoft.com>

the text =?Utf-8?B?YV9zaA==?= is intended to be represented with a UTF-8 (Unicode Transfer Format) character set and was encoded using Base64. In this next one

Subject: =?koi8-r?B?Qmx1ZXRvb3RoIMHEwdDUxdI=?=

The =?koi8-r?B?Qmx1ZXRvb3RoIMHEwdDUxdI=?= portion is a koi8-r (Russian/Cyrillic) string again encoded with Base64.

Once DecodeHeader is called, the string has been decoded in place in the message class's list of header fields, but may still appear as garbage if the current code page of the client machine is not set to that of the sender - much like the type of garbage received when a mail program uses extended code page characters without specifying a content type charset - these are badly formed messages but may work for transfers between clients set to the same language code page.

Note: A code page for a single byte character set is essentially a block of 256 character glyphs. Those for characters 0 to 127 are usually the standard ASCII most western speakers would be familiar with, while the extended 128 - 255 codes reference characters that are not available in the standard western European character set. In this way multiple languages can be supported with 8 bit ascii.

The MsgMapper application relies on Unicode translation of message header and text to accommodate multiple languages. The way it does this is quite transparent, but worth a quick description.

Display

In the above code the call to GetHeaderByType hides the conversion process from decoded 8-bit characters to a Unicode representation of the string.

When the Ultimate TCP-IP classes were to be enhanced to support Unicode compilation, it became apparent that there would be a conflict with many of the existing protocol classes which were ASCII based protocols. We chose to overload many calls to allow for conversion from and to Unicode so that Unicode based apps could interact with char based function calls without the need for conversion. Many of these overloads are simple wrappers that call their char based counterparts.

In certain cases where the call to get a string could benefit from charset translation, a code page lookup is done before the call to MultiByteToWideChar, and this allows us to compactly return a proper string to display. A set of CUT_CharSetStruct items is maintained in the CUT_Charset class to enable the proper charset to be retrieved.

        ...
        utCharSets[index].strCharSet = "koi8-r";
        utCharSets[index].nCodePage = 20866;
        utCharSets[index].nCharSet = RUSSIAN_CHARSET;            // Russian


        ++index;

        utCharSets[index].strCharSet = "koi8-u";
        utCharSets[index].nCodePage = 21866;
        utCharSets[index].nCharSet = EASTEUROPE_CHARSET;        // Ukrainian


        ++index;
        ...
        // etc

And presto, our overloaded GetHeaderByType grabs the charset recorded when the header was decoded, looks up the code page, and translates to Unicode before retuning the string.

In general, a message that makes use of a non-ascii character set specification will be consistent in its use. A charset can be gleaned from an individual encoded header, a Content-Type header, or a text or HTML section of a multi-part message. The message class maintains a default charset string, a header charset string, and for multi-part messages a separate string for text and html messages.

Additional Considerations

Note that there is a bit more system overhead to this in that you may need to install various language packs for this to work, and again this presupposes that the application is compiled for Unicode.

There is also the question of right-to-left ordering of characters for languages like Hebrew and Arabic which can be tricky - currently, the code takes the characters in the order they appear in the message, which may or may not be correct depending on the originating mail client and it's settings.

The CUT_Charset class is still not exhaustive in its catalogue, and there may be some improvements to be made, but does seem to accommodate the most common charsets found in mail messages. Microsoft maintains a complete list of code page identifiers that has provided a useful reference.

Summary

The MsgMapper sample has proven useful as a test harness for troubleshooting message format issues in general. When the right pane of the current view is active, three toolbar buttons allow for switching between the raw message source, the message text, and the html portion of a multi-part MIME alternative. This is handy for checking to see if a message is well formed with respect to character sets, encodings, headers etc.

If you use an email client that can save news and email messages in their RFC-822 format, you'll find that it's not hard to accumulate a collection of badly formed messages - a lot of spam falls into this category, as well as many non-Western European mails which like to embed text directly. MsgMapper is a real help in identifying these, and trouble-shooting the message classes themselves.

History

Initial CodeProject release August 2007.