Parsing and Decoding Values of Some Email Message Fields
Regex-based solution for decoding values of email addresses fields ("From", "To") and value of "Subject" field
Introduction
Assume you need to automatically process mail messages.
You are using POP3 client to automatically get messages and you have raw sources of mail messages after get ones by client.
Raw source of any mail message is a plain text that consists of ASCII characters only and includes all the headers of mail message as well as body of mail message and attachments of mail message, if any.
According to RFC standard, any mail message may contain ASCII characters only. Any non-ASCII character should be encoded using MIME Base64 or MIME Quoted-Printable algorithm, so raw source of mail message may contain encoded string
s like these ones:
From: =?gb2312?B?p6+n2qfcp9qn5KfRIKezp+Sn0afip+Cn1aflp9Kn6KfWp9M=?=
To: =?gb2312?B?J0NvbWl0qKYgTm9ydWVnbyBkZWwgTm9iZWwn?=
Subject: =?gb2312?B?TGEgaW5jcmWoqmJsZSB5IHRyaXN0ZSBoaXN0b3JpYSBkZSBsYSBj?=
=?gb2312?B?qKJuZGlkYSBFcqimbmRpcmEgeSBkZSBzdSBhYnVlbGEgZGVzYWxtYWRh?=
or:
From: =?UTF-8?Q?Garc=C3=ADa_M=C3=A1rquez?=
To: =?UTF-8?Q?Comit=C3=A9_Noruego_del_Nobel?=
Subject: La =?UTF-8?Q?incre=C3=ADble=20y=20triste=20historia=20de=20la=20c=C3=A1nd?=
=?UTF-8?Q?ida=20Er=C3=A9ndira=20y=20de=20su=20abuela=20desalmada?=
However, any mail program will show the user this:
From: García Márquez
To: Comité Noruego del Nobel
Subject: La increíble y triste historia de la cándida Eréndira y de su abuela desalmada
How to get the same using C# code?
The method named "decodeMailPropertyValue
" is suggested to do it.
That method finds email field value by field name and decodes the value found.
It can be useful for extracting email field values from raw sources of email messages not only in case of processing emails just after get ones by POP3 client.
For example, you may need to process emails stored in database or in file system on local machine.
Using the Code
Assume you are using certain POP3 client and created instance of PopHandler
class of that client:
PopHandler MyHandler = new PopHandler(server, port, user, password, false);
Getting PopMail class object:
PopMail mail = MyHandler.GetMail(i);
where i - index of any mail message from the list of mails got by MyHandler.GetList().
Now you can call method "decodeMailPropertyValue
" like this:
string From = decodeMailPropertyValue(mail.Source, "FROM");
string To = decodeMailPropertyValue(mail.Source, "TO");
string Subject = decodeMailPropertyValue(mail.Source, "SUBJECT", false);
And here is the method itself:
/// <summary>
/// Decodes email field value ("From", "To" or "Subject")
/// </summary>
/// <param name="mailSource">Raw source of email message (string)</param>
/// <param name="fieldName">Case insensitive email field name ("From", "To" or "Subject")</param>
/// <param name="addressField">"true" for address fields ("From" or "To"),
/// "false" for other fields ("Subject"). Default is true.</param>
/// <returns>email field decoded value (string)</returns>
private string decodeMailPropertyValue(string mailSource, string fieldName, bool addressField = true)
{
//looking for string(s) that contains value of field "fieldName"
Match temp =
Regex
.Match
(
mailSource,
@"(?:(?:\A|\r?\n)Field name: ([^\r\n]+)\r?\n){1}(?:(\s[^\r\n]*)\r?\n)*"
.Replace("Field name", fieldName),
RegexOptions.IgnoreCase
);
string tempStr = string.Empty;
string fieldValue = string
.Join
(null,
new string[1] { temp.Groups[1].Value }
.Concat(temp.Groups[2].Captures.OfType<Capture>().Select(x => x.Value))
.ToArray()
);
//if field "fieldName" has value
if (!string.IsNullOrEmpty(fieldValue.Trim()))
{
//only for address fields
if (addressField)
Regex
.Matches
(
fieldValue,
@"(?:\A|\s+)(<([\w-\.]+@(?:(?:\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|" +
@"(?:(?:[\w-]+\.)+))(?:[a-zA-Z]{2,4}|[0-9]{1,3})(?:\]?))>)",
RegexOptions.IgnoreCase
)
.OfType<Match>()
.Select
(
x =>
{
return fieldValue =
fieldValue
//removes brackets from email addresses those look like "<email>"
.Replace(x.Groups[1].Value, x.Groups[2].Value);
}
)
.ToList();
tempStr = fieldValue;
//looking for MIME Quoted-Printable encoding(s) in the value of field "fieldName"
MatchCollection temp2 =
Regex
.Matches
(
tempStr,
@"(?:(=\?[\w\s-_]+\?Q\?[^\?]+\?=\s*)+\S*)+",
RegexOptions.IgnoreCase
);
if (temp2.Count > 0)
{
temp2
.OfType<Match>()
.Select
(x =>
{
var captures = x.Groups[1].Captures.OfType<Capture>();
captures
.Select
(
(y, index) =>
{
temp =
Regex
.Match
(
y.Value,
@"=\?([\w\s-_]+)\?Q\?(?:""([^\?]+)""|([^\?]+))\?=\s*",
RegexOptions.IgnoreCase
);
string decodedStr =
Attachment
.CreateAttachmentFromString
(
string.Empty,
string
.Format
(
"=?{0}?Q?{1}?=",
temp.Groups[1].Value,
Regex
.Unescape
(
(temp.Groups[2].Success ?
temp.Groups[2] : temp.Groups[3])
.Value
//Character "_" in Quoted-Printable
//replaces spaces.
//It remains the same after decoding
//(only for .NET version < 4.0),
//so replace "_" with space
//before decoding.
.Replace('_', ' ')
)
)
)
.Name;
return
tempStr =
tempStr
.Replace
(
y.Value,
decodedStr
.PadRight(addressField &&
index == captures.Count() - 1 ?
decodedStr.Length + 1 : 0)
);
}
)
.ToList();
return
tempStr;
}
)
.ToList();
fieldValue = tempStr;
}
//looking for Base64 encoding(s) in the value of field "fieldName"
else
{
temp2 =
Regex
.Matches
(
tempStr,
@"(?:(=\?[\w\s-_]+\?B\?[^\?]+\?=\s*)+\S*)+",
RegexOptions.IgnoreCase
);
if (temp2.Count > 0)
{
temp2
.OfType<Match>()
.Select
(x =>
{
var captures = x.Groups[1].Captures.OfType<Capture>();
captures
.Select
(
(y, index) =>
{
temp =
Regex
.Match
(
y.Value,
@"=\?([\w\s-_]+)\?B\?([^\?]+)\?=\s*",
RegexOptions.IgnoreCase
);
string decodedStr = Encoding.GetEncoding(temp.Groups[1].Value)
.GetString(Convert.FromBase64String(temp.Groups[2].Value));
return
tempStr =
tempStr
.Replace
(
y.Value,
decodedStr
.PadRight(addressField &&
index == captures.Count() - 1 ?
decodedStr.Length + 1 : 0)
);
}
)
.ToList();
return
tempStr;
}
)
.ToList();
fieldValue = tempStr;
}
//only for address fields:
//looking for non-encoded strings that can contain sender/recipient names
//and their addresses
else if (addressField)
{
temp =
Regex
.Match
(
tempStr,
@"(?:(""[^\r\n]+"")*\s*\S+)+",
RegexOptions.IgnoreCase
);
if (temp.Success)
{
fieldValue =
Regex
.Unescape
(
Regex
.Replace
(
fieldValue,
@"(?<!\\)""",
string.Empty,
RegexOptions.IgnoreCase
)
);
}
}
}
}
return fieldValue;
}