Click here to Skip to main content
Click here to Skip to main content

MIME Compliant Parser

, 8 Jul 2008
Rate this:
Please Sign up or sign in to vote.
An attempt to separate MIME parsing from mail protocol.

Introduction

This article and its code sample aim to disconnect MIME parsing functionality from any mail protocol, i.e. it aims to implement RFC2045 without coupling it too tightly with either the POP3 or IMAP protocol.

Background

The motivation for me writing this code was originally that I needed support for mail download automation. I started to look around for free or open source alternatives. However, the projects or solutions I found either did not have full support for attachments or their implementation was not modular enough. I therefore decided to start writing my own POP3 client implementation. After fighting a while trying to do a fast hack I soon realised that I had to read the concerned RFC's. I then realised that POP3 (RFC 1939) as a protocol in turn relied on the concept of MIME (RFC 2045, 2046 etc) for attachments. When realizing this, I came up with the idea of trying to write a parser which could be used in both IMAP and POP3.

The main features are as briefly stated above; that the code aims to separate MIME functionality from any mail transfer protocol.

It is also an attempt to parse MIME messages on the fly i.e. it reads portions of the stream and then parses it. This behaviour will hopefully minimize memory consumption. As one might notice, the code takes advantage of a StringBuilder to compile the whole message source, which is against this whole argument of minimizing memory consumption. However, this StringBuilder could easily be removed if one does not need to be able to read the whole message source as such.

The library is also written with the aim of keeping it as "pluggable" as possible i.e. I have tried to keep the library and its classes as loosely coupled as possible. To achieve this I have tried to publish all functionality as Interfaces and used dependency injection as often as possible.

MIME

When I first started out with this project, I read many articles. Among the ones I read was this one written by Peter Huber SG here at Code Project. It covers much of the topic on MIME. However I found it too tightly coupled with the POP3 protocol to fit my needs. Nevertheless Peter explains the MIME concept in detail which helped me a lot in starting to grasp the concept. Other excellent sources of information are sites such as this and this.

Using the Code

When reading the RFC 2045 specification, one soon recognizes that the concept which everything revolves around is a concept called entity. Since the entity is so central to the MIME concept I have tried to model a class hierarchy which depicts concepts such as "Message", "Entity", "Body part" and "Body" as they are described in the RFC 2045. specification.

Screenshot - RFC2045.gif

The main entry point for the library is the MIMER.RFC2045.MailReder which implements the MIMER.IMailReader. The IMailReader only contains one method signature "Read".

IMailMessage Read(ref System.IO.Stream dataStream, IEndCriteriaStrategy
                        endOfMessageCriteria); 

The Read function requires a System.IO.Stream and a MIMER.IEndCriteriaStrategy. The IEndCriteriaStrategy should reference an object with a method which can determine when the stream has reached the end of a mailmessage. Hence it should (even if not implemented yet) be possible to rather easily extend the functionality of this MIME parser to conform with IMAP as well. To extend with IMAP functionality would in theory only require one to write a class which implements the MIMER.IEndCriteriaStrategy interface and then use this class when calling the MailReader constructor. A worst case scenario could require one to write a new IMailReader. Nevertheless much of the functionality spread among the supporting classes could probably be reused.

The IMailReader interface is the most universal (RFC822) implementation of a MailReader. Since the RFC822 specification came before the MIME (RFC2045 etc.) specification this Interface and its Read method return an IMailMessage which does not support attachments.

public interface IMailMessage
{
        MailAddress From
        {get; set;}
        MailAddressCollection To
        {get; set;}
        MailAddressCollection CarbonCopy
        {get; set;}
        MailAddressCollection BlindCarbonCopy
        {get; set;}
        String Subject
        {get; set; }
        string Source
        {get; set; }
        string TextMessage
        {get; set; }
        bool IsNull();
}

However, the MIMER.RFC2045.MailReader also has a ReadMimeMessage method which returns an IMimeMailMessage which is a specialization of the IMailMessage interface, and this interface supports attachments.

IMimeMailMessage ReadMimeMessage(ref System.IO.Stream dataStream,
                IEndCriteriaStrategy endOfMessageCriteria);
public interface IMimeMailMessage : IMailMessage
{
    IDictionary<string, string> Body{}
    IList<IAttachment> Attachments{}
    IList<IMimeMailMessage> Messages{} //Added in version 0.4
    IList<ternateView> Views{}
    System.Net.Mail.MailMessage ToMailMessage();
}

Decoders

The library has implemented decoder functionality for base64 encodings and QuotedPrinteable encoding. The IDecoder interface publishes the signatures expected by the MIMER.RFC2045.MailReader which therefore can be easily extended with more decoders to support more encodings.

public interface IDecoder
{
    bool CanDecode(string encodign);
    byte[] Decode(ref System.IO.Stream dataStream);
    byte[] Decode(ref string data);
}

public MailReader(IList<IDecoder> decoders)

Header Fields

Much of the work in parsing mail messages is done by reading and parsing Fields. The most basic field is defined in the RFC 822 specification. Conceptually it contains a "name" and a "body". This definition is implemented in the MIMER.RFC822.Field.

From the RFC822 specification:

field = field-name ":" [ field-body ] CRLF field-name = 1*<any CHAR, excluding CTLs, SPACE, and ":"> field-body = field-body-contents [CRLF LWSP-char field-body]

public class Field
{
    public string Name{}
    public string Body{}
}

The RFC2045 specification does however extend the RFC822 field definition with fields such as Content-type etc. These definitions are implemented in the MIMER.RFC2045.ContentTypeField and the MIMER.RFC2045. ContentTransferEncodingField.

public class ContentTypeField : MIMER.RFC822.Field
{
    public string Type{}
    public string SubType{}
    public StringDictionary Parameters{}
}

public class ContentTransferEncodingField : MIMER.RFC822.Field
{
    public string Encoding{}
}

FieldParser

The logic of the field parsing is divided among the FieldParser classes all of which implement the IFieldParser interface.

public interface IFieldParser
{void Parse(ref IList<RFC822.Field> fields, ref stringfieldString);}

The parsing is implemented by using regular expressions as much as possible. This is done by imitating the definitions found in the RFC's as identically as possible.

public class FieldParser:IFieldParser
{
    protected readonly string m_QuotedPairPattern = "\x5C\x5C[\x00-" +
        "\x7F]";
    protected readonly string m_DtextPattern =
        "[^]\x0D\x5B\x5C\x5C\x80-\xFF]";
protected readonly string m_AtomPattern = "[^][()<>@,;:." +
     "\x5C\x5C\x22\x00-\x20\x7F]+";
protected readonly string m_UnfoldPattern = "\x0D\x0A\x5Cs";
protected readonly string m_FieldPattern = "[^\x00-\x20\x7F:]{1,}:{1,1}.+";
protected readonly string m_FieldNamePattern = "[^\x00-\x20\x7F:]{1,}(?=:)";
protected readonly string m_QuotedStringPattern = "\x22(?:(?:(?:\x5C\x5C" +
    "{2})+|\x5C\x5C[^\x5C\x5C]|[^\x5C\x5C\x22])*)\x22";
protected readonly string m_CtextPattern = "[^()\x5C\x5C]+";
...

Since the RFC2045 specification leaves room for future media subtypes, the parsing functionality needed some easy way to be extended. This I have attempted to resolve by defining a virtual CompilePattern() method.

public class FieldParser:IFieldParser
{
    public virtual void CompilePattern(){}
    ...
public class ContentTypeFieldParser:RFC822.FieldParser, IFieldParser
    {
        protected IList<string>
        m_ApplicationSubtypes;

        public override void CompilePattern()
        {
            m_ApplicationSubtypes.Add("octet-stream");
            m_ApplicationSubtypes.Add("PostScript");
            m_ApplicationSubtypes.Add("pdf");
            …
                m_SubType = new Regex("((?<=multipart/)"
            + m_MultipartSubtypesBuilder.ToString() + "|" +
                "(?<=text/)" +
                m_TextSubtypesBuilder.ToString() + "|" + "(?<=image/)" +
                m_ImageSubtypesBuilder.ToString() + "|"+
                "(?<=application/)" +
                m_ApplicationSubtypesBuilder.ToString() + "|"+
                "(?<=message/)" +
                m_MessageSubtypesBuilder.ToString() + "|" +
                "(?<=audio/)" +
                m_AudioSubtypesBuilder.ToString() + ")",
                RegexOptions.Compiled);
            // This should be called if we want to add functionality in
            //this method but let base build/compile it
            base.CompilePattern();
        }

By defining theIList<string> m_ApplicationSubtypes; as protected it can be accessed by its child classes which means they could add new application subtypes not needing to rewrite the whole parsing logic. A child implementation might then look something like this:

Public class ExtendedContentTypeFieldParser:RFC2045.ContentTypeFieldParser
{
    Public override void CompilePattern()
    {
        m_ApplicationSubtypes.Add("ms-word");
        base.CompilePattern();
    }
}

Additions

Since I first wrote this article a few Issues with the code have surfaced. Among these issues were the one pointed out by fellow coder "Lex1985". It turns out that I had embarrassingly enough forgotten to implement support for embedded messages (message/rfc822).

Embedded messages are essentially messages within a message i.e. there can be any number of messages within another message. This is truly a recursive behaviour. Since an embedded message (message/rfc822) is a type of Multipart-entity, it made me look for a boundary when parsing it from the stream. However a 'Content-Type' header field does not have to have a boundary parameter, it was this assumption that made the code throw an exception stating that it "could not find the mandatory delimiter in multi part entity". Aside from this, the parsing of an embedded message differs from parsing of other multipart entities. An embedded message has as all other entities descriptive 'Content-' headers but they also have their special message headers.

   // These are the descriptive content headers of the entity
        
    ------_=_NextPart_006_01C7E5C1.06454400
    Content-Type: message/rfc822
    Content-Transfer-Encoding:7bit       

   // These are the message headers
    
    Received: by server.smithimage.com
        id 01C7E35C.64B60917@server.smithimage.com; 
                    Mon, 20 Aug 2007 21:00:36 +0200
    Content-class: urn:content-classes:message
    Subject: VB:
    Date: Mon, 20 Aug 2007 21:00:28 +0200
    MIME-Version: 1.0
    Content-Type: multipart/mixed;
        boundary="----_=_NextPart_004_01C7E35C.64B60917"
    Message-ID: <13176CE1A8A2C4428E514E5E603A56C0039BC7@
                        server.smithimage.com>
    Thread-Index: AcfjWkdexeNWdEWXRm6O87G7fcacpwAAhhFE
    References: <13176CE1A8A2C4428E514E5E603A56C06802@
                        server.smithimage.com>
    From: "client" <client@smithimage.com>
    To: client@smithimage.com

    This is a multi-part message in MIME format.

    ------_=_NextPart_004_01C7E35C.64B60917

This forces the flow of parsing an embedded message to be a bit different from the parsing of a 'normal' multipart entity. When the parser finds a multipart entity with content-type defined as "Content-Type: message/rfc822" it must create a new message and recursively call upon itself. The call-trace of the parsing is as follows:

public IMimeMailMessage ReadMimeMessage(ref System.IO.Stream dataStream, 
                IEndCriteriaStrategy endOfMessageCriteria)

calls:

private string ParseMessage(ref Stream dataStream, 
            ref Message message, IList<rfc822.field> fields)

calls:

private string CreateEntity(ref Stream dataStream, 
            ref IMultipartEntity parent, out IEntity entity)

It is within the CreateEntity method we recursively have to call upon ourselves if we come upon a message/rfc822 entity.

private string CreateEntity(ref Stream dataStream, 
            ref IMultipartEntity parent, out IEntity entity)
    {
        entity = null;
        IList<RFC822.Field> fields;
        int cause = ParseFields(ref dataStream, out fields);
        if (cause > 0)
        {
            foreach (RFC822.Field contentField in fields)
            {
                if (contentField is ContentTypeField)
                {
                    ContentTypeField contentTypeField = 
                    contentField as ContentTypeField;

                    if (m_FieldParser.CompositeType.IsMatch
                        (contentTypeField.Type))
                    {
                        MultipartEntity mEntity = new MultipartEntity();
                        mEntity.Fields = fields;
                        entity = mEntity;
                        entity.Parent = parent;
                        parent.BodyParts.Add(entity);

                    // It is here we must call upon our self when 
                    // finding a multipart entity of type message/rfc822
                        if (Regex.IsMatch(contentTypeField.Type, 
                        "(?i)message") &&
                            Regex.IsMatch(contentTypeField.SubType, 
                        "(?i)rfc822"))
                        {
                            Message message = new Message();
                            IList<RFC822.Field> messageFields;
                            cause = ParseFields(ref dataStream, 
                        out messageFields);
                            message.Fields = messageFields;
                            mEntity.BodyParts.Add(message);
                             message.Parent = mEntity;
                             if(cause > 0)
                                return ParseMessage(ref dataStream, 
                        ref message, messageFields);
                                break;
                        }
                        else
                        {
                            mEntity.Delimiter = ReadDelimiter
                        (ref contentTypeField);
                            return parent.Delimiter;
                        }
                    }
                    else if (m_FieldParser.DescriteType.IsMatch
                        (contentTypeField.Type))
                    {
                        entity = new Entity();
                        entity.Fields = fields;
                        entity.Parent = parent;
                        parent.BodyParts.Add(entity);
                        return parent.Delimiter;
                    }
                }
            }
        }
        return string.Empty;
    }

It is this recursive call that has been added. However some changes were also needed in the RFC2045.IMIMEMailMessage definition to support embedded messages. The RFC2045.IMIMEMailMessage now looks like this:

public interface IMimeMailMessage : IMailMessage
{
    IDictionary<string, string> Body{}
    IList<IAttachment> Attachments{}
    IList<IMimeMailMessage> Messages{}
    IList<ternateView> Views{}
    System.Net.Mail.MailMessage ToMailMessage();
}

This design makes it possible to read any number of recursively embedded message e.g.: A recursive MIME message structure like the one below will be possible to access through code. See example.

1Message
-1:1Message
--1:1:1Message
---1:1:1Message
----etc.

Message m = ReadMimeMessage(ref s, endCriteria);
string subject = m.Messages[0].Messages[0].Messages[0].Subject;

To Sum Up

This article and its code have aimed to explain my attempt of implementing a MIME competent parser which should not be too tightly coupled with the POP3 mail protocol. Hopefully you can use the source completely or partially in your own coding. Although the code is not stable and thoroughly tested and it can most definitely be extensively improved with regard to both architecture and performance. I do however think its overall architecture and idea are worth studying. Also look out for my next article which will describe the implementation of this library in a POP3 compliant client library.

History

  • 2007-07-27: Article created
  • 2007-08-10: Zip file updated
  • 2007-09-03: Article and Zip file updated
  • 2008-06-13: Zip file updated
  • 2008-07-08: Zip file updated

License

This article, along with any associated source code and files, is licensed under The BSD License

Share

About the Author

smithimage
Web Developer
Sweden Sweden
No Biography provided

Comments and Discussions

 
QuestionSamle on how to use this code PinmemberChandrashekar yeskay30-May-13 0:08 
AnswerRe: Samle on how to use this code Pinmembersmithimage13-Aug-13 2:13 
GeneralMissing Bcc from .eml file PinmemberBrad Bruce22-Dec-10 10:41 
QuestionMime Message Body parser for windows mobile 6 PinmemberChincoCodus3-Nov-10 0:23 
AnswerRe: Mime Message Body parser for windows mobile 6 Pinmembersmithimage4-Nov-10 9:19 
GeneralIncorrect parsing of encoded subject (Extended Field) PinmemberJabberwok123-Mar-10 6:30 
GeneralRe: Incorrect parsing of encoded subject (Extended Field) Pinmembersmithimage4-Nov-10 9:10 
GeneralRe: Incorrect parsing of encoded subject (Extended Field) PinmemberJabberwok14-Nov-10 11:25 
QuestionHow to use it in vc6? Pinmembernewflying25-Aug-09 1:43 
AnswerRe: How to use it in vc6? Pinmembersmithimage25-Aug-09 2:08 
GeneralRe: How to use it in vc6? Pinmembernewflying26-Aug-09 1:33 
GeneralRe: How to use it in vc6? Pinmembersmithimage27-Aug-09 9:59 
GeneralParsing RFC822 Headers PinmemberEric Legault2-Jun-09 21:26 
AnswerRe: Parsing RFC822 Headers Pinmembersmithimage2-Jun-09 22:35 
GeneralAttached Messages not quite right PinmemberMember 59722813-May-09 23:13 
AnswerRe: Attached Messages not quite right Pinmembersmithimage14-May-09 23:32 
Questionhmm????? PinmemberPoweredByOtgc11-May-09 19:25 
AnswerRe: hmm????? Pinmembersmithimage11-May-09 21:28 
QuestionParser not reading RFC822 Message attachments or embedded images (in alternate views). PinmemberStewart Roberts29-Apr-09 10:42 
AnswerRe: Parser not reading RFC822 Message attachments or embedded images (in alternate views). PinmemberStewart Roberts30-Apr-09 6:25 
GeneralRe: Parser not reading RFC822 Message attachments or embedded images (in alternate views). Pinmembersmithimage1-May-09 1:15 
GeneralRe: Parser not reading RFC822 Message attachments or embedded images (in alternate views). PinmemberStewart Roberts1-May-09 3:21 
GeneralRe: Parser not reading RFC822 Message attachments or embedded images (in alternate views). Pinmemberjoaosilva9981-May-09 4:34 
GeneralRe: Parser not reading RFC822 Message attachments or embedded images (in alternate views). PinmemberStewart Roberts1-May-09 5:13 
GeneralRe: Parser not reading RFC822 Message attachments or embedded images (in alternate views). Pinmembersmithimage1-May-09 22:03 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140827.1 | Last Updated 8 Jul 2008
Article Copyright 2007 by smithimage
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid