|
|||||||||||||||||||||
|
|||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionThis is part 2 of my articles about email receiving with POP3 and MIME processing. My first article POP3 Email Client (.NET 2.0) covered the reliable downloading of emails from POP3 servers, which left us with a pure ASCII representation of the email body. This was the easier part. In this article I provide the code to split the raw ASCII email into body, attachment, alternate views, etc. This was much harder to do, because while the POP3 specification is simple and specified straight forward in one RFC, there are several MIME related RFCs, which provide a multitude of possibilities how simple stuff like an email's actual text can be sent. The MIME specification allows for great flexibility, but Microsoft, being Microsoft, of course supports only a subset (for example no recursion of MIME parts within MIME parts). The provided code supports both worlds completely and gives the programmer the flexibility to access information about the received email as needed. If you wonder why I wrote this article despite the fact that there are various articles on CodeProject for MIME support, here are some of the shortcomings encountered:
My code is based on the following work:
BackgroundStructure of a simple emailA simple email in pure ASCII might look like this: Date: Sat, 2 Sep 2006 17:25:15 +0200
From: Sender@NoSpam.com
To: Receiver@NoSpam.com
Subject: simple plain text mail
Just a plain text email
.
The first 4 lines are called the header of the email and they are separated from the body by an empty line. The end of the email is marked with a line containing just one "." (a period sign). There will be many more header lines when you look at a real email, some RFC standard ones and others, like this one from GMail: X-Gmail-Received: f105c784e77f8b689759558db72ccd07f60387ba
Introduction of MIMEIn the beginning there were just plain ASCII emails as defined in RFC 2822. Plain ASCII was soon not sufficient, though, and the Multipurpose Internet Mail Extensions specification MIME was created to support non US-ASCII texts, multi-part message bodies, rich text (HTML), images, sounds and attachments. The specification tried to offer great flexibility and to cater to all kind of possibilities. The result was numerous RFCs (2045, 2046, 2047, 2049, 2231, 2387, 4288, 4289, ... ). As it often happens in big groups, the whole thing became rather complicated and, even worse, left it to the implementer how precisely body text, etc. are implemented.In order to help you with the extraction of information from MIME based emails, I'm going to explain to you the basic MIME principles. First let's have a look at a complete MIME email. It might be a bit confusing, but it gives a good overview of the various MIME elements which I will explain one by one. This email has one email header, followed by the email body text and a .GIF picture. Notice the " Date: Sat, 2 Sep 2006 17:25:15 +0200
From: Sender@NoSpam.com
To: Receiver@NoSpam.com
Subject: simple gmail mail
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="0-494165446-1157210079=:74253"
Content-Transfer-Encoding: 8bit
--0-494165446-1157210079=:74253
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Content-Disposition: inline
This is the email body
This email has a smallPic.gif attachment
--0-494165446-1157210079=:74253
Content-Type: image/gif; name="SmallPic.GIF"
Content-Transfer-Encoding: base64
Content-Description: 437081412-SmallPic.GIF
Content-Disposition: inline; filename="SmallPic.GIF"
R0lGODlhQQBBAPcAAAAAAIAAAACAAICAAAAAgIAAgACAgICAgMDAwP8AAAD/
NZWZfpnCck/OeTUXvUdXxdi9/SbDPFS4t+/fwIMLH068uPHjyJMrX868ufPn
0KNLn069uvWVAQEAOw==
--0-494165446-1157210079=:74253--
.
Structure of an email header fieldAn email header field as defined in RFC 2822 has the following structure:field-name ":" [ field-body ] CRLF
Example:
MIME-Version: 1.0
"MIME-Version" is the Content-TypeThe most powerful MIME header field is the Content-Type: text/plain;
Content-Type: text/plain; charset=ISO-8859-1
Content-Type: text/plain; charset=us-ascii
Content-Type: text/plain; charset=utf-8
Content-Type: text/html;
Content-Type: text/html; charset=ISO-8859-1
Content-Type: text/css
Content-Type: image/gif; name=image004.gif
Content-Type: image/jpeg; name="image005.jpg"
Content-Type: message/delivery-status
Content-Type: message/rfc822
Content-Type: audio/x-mpeg
Content-Type: video/mpeg-2
Content-Type: application/msword
Content-Type: application/mspowerpoint
Content-Type: application/zip
Content-Type: multipart/mixed;
boundary="----=_Part_3431_12384933.1139387792352"
Content-Type: multipart/alternative;
boundary="----=_Part_4088_29304219.1115463798628"
Content-Type: multipart/related;
boundary="----=_Part_2067_9241611.1139322711488"
Content-Type: multipart/digest;
boundary="----=Next message 15543233913938263541"
Content-Type: multipart/report; report-type=delivery-status;
boundary="k04G6HJ9025016.1136391237/carbon.singnet.com.sg"
Content-Type: multipart/parallel
The
Each of the media type defines its own set of subtypes, which might be followed by a set of parameters, each specified in an attribute=value pair. For example: Content-Type: text/plain; charset=ISO-8859-1; format=flowed
The Content-Type MultipartThe Headerlines
Content-Type: multipart/mixed; boundary="0-494165446-1157210079=:74253"
Headerlines
--0-494165446-1157210079=:74253
Content-Type: text/plain; charset=iso-8859-1
Other MIME part header lines
The plain text email body
--0-494165446-1157210079=:74253
Content-Type: image/gif; name="SmallPic.GIF"
Other MIME part header lines
The attachment coded in Base64
--0-494165446-1157210079=:74253--
.
The first 3 lines are part of the email header. The end of the header is marked by an empty line. All other lines are part of the email body, which ends with the line having only a " Each MIME entity has a entity-header and a entity-body separated by an empty line. Since emails and MIME entities use the same structure and the same kind of header lines, it is possible that whole emails can become a MIME entity, which is useful for mail systems ( Content-Type: multipart/mixedOften the top most multipart subtype is " Content-Type: multipart/alternativeThe subtype " Some header lines
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----_=_NextPart_001_01C6CEA2.EF9BECF8"
------_=_NextPart_001_01C6CEA2.EF9BECF8
Content-Type: multipart/alternative;
boundary="----_=_NextPart_002_01C6CEA2.EF9BECF8"
------_=_NextPart_002_01C6CEA2.EF9BECF8
Content-Type: text/plain; charset="iso-8859-1"
HTML sample email with bold text and attachment.
------_=_NextPart_002_01C6CEA2.EF9BECF8
Content-Type: text/html; charset="iso-8859-1"
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<STYLE>
DIV { FONT-SIZE: 10pt;
FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif }
</STYLE>
</HEAD>
<BODY>
<DIV>
HTML sample email with <STRONG>bold</STRONG> text and attachment.
</DIV>
</BODY>
</HTML>
------_=_NextPart_002_01C6CEA2.EF9BECF8--
------_=_NextPart_001_01C6CEA2.EF9BECF8
Content-Type: image/gif; name="SmallPic.GIF"
R0lGODlhQQBBAPcAAAAAAIAAAACAAICAAAAAgIAAgACAgICAgMDAwP8AAAD/
NZWZfpnCck/OeTUXvUdXxdi9/SbDPFS4t+/fwIMLH068uPHjyJMrX868ufPn
0KNLn069uvWVAQEAOw==
------_=_NextPart_001_01C6CEA2.EF9BECF8--
.
The structure of this email is: multipart/mixed
| multipart/alternative
| | text/plain; format=flowed; charset=ISO-8859-1
| | text/html; charset=ISO-8859-1
| image/gif; name=SmallPic.GIF
Notice that the picture is part of Content-Type: multipart/related
Content-Transfer-EncodingPOP3 defines that the body of an email is 7bit US ASCII code. Since the text displayed to the user can be any Unicode and file attachments are usually array of bytes, the email sender must encode this content to ASCII and we, the receiver of the email, need to decode it. If the value is "7bit", no encoding was used. "8bit", or "binary" has the same meaning, but is not supported by the .NET framework. I treat "8bit" like "7bit", i.e. take the content as it is, whereas "binary" is illegal in POP3, because some character sequences like CRLF "." CRLF have a special meaning in POP3, but might occur in random binary. Content-Transfer-Encoding: quoted-printableIf a MIME entity consists mostly of US ASCII characters, it is enough to encode just some special characters and all bytes not covered by the US ASCII characterset. "quoted-printable" does this by sending a " Content-Transfer-Encoding: base64
R 0 l G
001001 110100 100101 000110
Resulting 3 bytes:
00100111 01001001 01000110
Details can be found at RFC 1421, 4.3.2.4 Step 4: Printable Encoding Using the codeThe best way to get an understanding of a library is to use it. The Emails are received by Mapping MIME to System.Net.Mail.MailMessageThe The History
| ||||||||||||||||||||