|
|||||||||||||||||||||
|
|||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionIn some cases you need to know what the best codepage (encoding) is to either transfer text over the internet or store it in a text file. One could argue that Unicode always does the trick but I needed the most efficient (byte saving) way to transfer data. Detecting a code page from text is a very tricky task. But luckily, Microsoft provides the MLang API, in which the Similarly, the The BackgroundThe problemI started this along with another component that constructs MIME conformant emails. The body of the email is passed as I wondered if it is possible to detect the best encoding from the given text. The dirty hack attemptMy first attempt was a simple brute-force attack:
This is not only ugly, it does not even work properly. All single byte encodings are binary equal in their encoding result. The codepage is only used to map single bytes to the correct character for display. So this method can only distinguish between ASCII (7bit), single byte (8bit) and the different Unicode flavors (UTF-7, UTF8, Unicode, etc.). Finding somthing betterThen I remembered the Internet Explorer 5.5 introduced a new interface exported from the MLang DLL: Wow! This sounded more than promising! The interface has only two methods:
I chose to use the first one. Using MLangThe MLang.dll is in the Windows\system32 directory. Along some exported functions it provides some COM classes but does not contain a typelibrary. So the easy way (Add Reference in Visual Studio) did not work. The MLang.idl is part of the Platform SDK and can be found in the include directory. c:\temp\>midl MLang.idl
C:\temp>midl MLang.idl > null
Microsoft (R) 32b/64b MIDL Compiler Version 6.00.0366
Copyright (c) Microsoft Corporation 1991-2002. All rights reserved.
MLang.idl
unknwn.idl
wtypes.idl
basetsd.h
guiddef.h
oaidl.idl
objidl.idl
oaidl.acf
C:\temp>tlbimp mlang.tlb /silent
The result of those two commands is a brand new Assembly named MultiLanguage.dll. Using Lutz Roeder's and Reflector I had a look at the signature: MethodImpl(MethodImplOptions.InternalCall,
MethodCodeType=MethodCodeType.Runtime)]
void DetectOutboundCodePage([In] uint dwFlags,
[In, MarshalAs(UnmanagedType.LPWStr)] string lpWideCharStr,
[In] uint cchWideChar,
[In] ref uint puiPreferredCodePages,
[In] uint nPreferredCodePages,
[In] ref uint puiDetectedCodePages,
[In, Out] ref uint pnDetectedCodePages,
[In] ref ushort lpSpecialChar);
I was not so happy with the So I first exported the generated assembly to C# source code and then changed it a little: [Flags]
public enum MLCPF
{
// Not currently supported.
MLDETECTF_MAILNEWS = 0x0001,
// Not currently supported.
MLDETECTF_BROWSER = 0x0002,
// Detection result must be valid for conversion and text rendering.
MLDETECTF_VALID = 0x0004,
// Detection result must be valid for conversion.
MLDETECTF_VALID_NLS = 0x0008,
// Preserve preferred code page order.
// This is meaningful only if you have set the puiPreferredCodePages
// parameter
// in IMultiLanguage3::DetectOutboundCodePage
// or IMultiLanguage3::DetectOutboundCodePageInIStream.
MLDETECTF_PRESERVE_ORDER = 0x0010,
// Only return one of the preferred code pages as the detection result.
// This is meaningful only if you have set the puiPreferredCodePages
// parameter
// in IMultiLanguage3::DetectOutboundCodePage
// or IMultiLanguage3::DetectOutboundCodePageInIStream.
MLDETECTF_PREFERRED_ONLY = 0x0020,
// Filter out graphical symbols and punctuation.
MLDETECTF_FILTER_SPECIALCHAR = 0x0040,
// Return only Unicode codepages if the euro character is detected.
MLDETECTF_EURO_UTF8 = 0x0080
}
[MethodImpl(MethodImplOptions.InternalCall,
MethodCodeType=MethodCodeType.Runtime)]
void DetectOutboundCodePage([In] MLCPF dwFlags,
[In, MarshalAs(UnmanagedType.LPWStr)] string lpWideCharStr,
[In] uint cchWideChar,
[In] IntPtr puiPreferredCodePages,
[In] uint nPreferredCodePages,
[In] IntPtr puiDetectedCodePages,
[In, Out] ref uint pnDetectedCodePages,
[In] ref ushort lpSpecialChar);
Then I added the source files to my project (no more MultiLanguage.dll assembly required). Using IMultiLanguage3::DetectOutboundCodePageGetting an instance of COM class implementing // get the IMultiLanguage3 interface MultiLanguage.IMultiLanguage3 multilang3 = new MultiLanguage.CMultiLanguageClass(); if (multilang3 == null) throw new System.Runtime.InteropServices.COMException( "Failed to get IMultilang3"); The next thing is to fill the parameters. The first parameter, The The next two parameters ( With the next two parameters ( The last three parameters contain the result of detection after the method has completed successfully. So the actual call looks like this:uint[] preferedEncodings; // array of uint passed as parameter to the
// function
int[] resultCodePages = new int[preferedEncodings.Length]; // result array
// ... call the function
multilang2.DetectInputCodepage(options,0, ref input[0], ref srcLen,
ref detectedEncdings[0], ref scores);
// evaluate the result
if (scores > 0)
{
for (int i = 0; i < scores; i++)
{
// add the result
result.Add(Encoding.GetEncoding((int)detectedEncdings[i].nCodePage));
}
}
Finally the COM object should be freed. Marshal.FinalReleaseComObject(multilang3);
Using IMultiLanguage2::DetectInputCodepageAfter being able to choose the best encoding to send text over the internet, or save it to a stream, the next task was to detect the best encoding for incoming text if the sender (or storer) did not chose the best encoding. The So if you open a text file containing text created with codepage that is different than the current UI code page, a This is where the In the demo application you can double click on an encoding to test which method has the better result (see "Testing the DetectInputCodepage performance" below). The other practical use is to detect the encoding of emails from badly implemented mime mailers. Some wired mailers send emails in 8-bit encoding without specifying any characterset in the header. In this case As for the public enum MLDETECTCP {
// Default setting will be used.
MLDETECTCP_NONE = 0,
// Input stream consists of 7-bit data.
MLDETECTCP_7BIT = 1,
// Input stream consists of 8-bit data.
MLDETECTCP_8BIT = 2,
// Input stream consists of double-byte data.
MLDETECTCP_DBCS = 4,
// Input stream is an HTML page.
MLDETECTCP_HTML = 8,
//Not currently supported.
MLDETECTCP_NUMBER = 16
}
[MethodImpl(MethodImplOptions.InternalCall,
MethodCodeType=MethodCodeType.Runtime)]
void DetectInputCodepage([In] MLDETECTCP flags, [In] uint dwPrefWinCodePage,
[In] ref byte pSrcStr, [In, Out] ref int pcSrcSize,
[In, Out] ref DetectEncodingInfo lpEncoding,
[In, Out] ref int pnScores);
The usage of the function is almost identical to the int maxEncodings; // parameter specifying how many encodings to return
int srcLen = input.Length; // lengt of the input
int scores = detectedEncdings.Length; // the number of detected scores
// setup options (none)
MultiLanguage.MLDETECTCP options = MultiLanguage.MLDETECTCP.MLDETECTCP_NONE;
// finally... call to DetectInputCodepage
multilang2.DetectInputCodepage(options,0, ref input[0], ref srcLen,
ref detectedEncdings[0], ref scores);
// get result
if (scores > 0)
{
for (int i = 0; i < scores; i++)
{
// add the result
result.Add(Encoding.GetEncoding((int)detectedEncdings[i].nCodePage));
}
}
My first tests were not that promising. I always had a The // expand the string to be at least 256 bytes
if (input.Length < 256)
{
byte[] newInput = new byte[256];
int steps = 256 / input.Length;
for (int i = 0; i < steps; i++)
Array.Copy(input, 0, newInput, input.Length * i, input.Length);
int rest = 256 % input.Length;
if (rest > 0)
Array.Copy(input, 0, newInput, steps * input.Length, rest);
input = newInput;
}
Wrapping it all upI decided to create a static class to provide access to the Here are the six high-level Methods that are should cover most of the usage scenarios:
It also has three public static arrays of predefined codpage sets:
These arrays contain the codepages in the order that return the best result, but not in the natural sort order. Testing the DetectInputCodepage performanceThe the screenshot below shows a comparison of the
All the samples were detected correctly. Using the EncodingTools classThe folowing code snippets show how to use the Outgoing EncodingDetect best encoding for a Stream // save the given text using the optimal encoding
private void SaveToStream(string text, string path)
{
// this is all... detect the encoding
Encoding enc = EncodingTools.GetMostEfficientEncodingForStream(text);
// then safe
using (StreamWriter sw = new StreamWriter(path, false, enc))
sw.Write(text);
}
Detect best encoding for an email body// save the given text using the optimal encoding
private void SaveToAsEmail(string text, string path)
{
// this is all... detect the encoding
Encoding enc = EncodingTools.GetMostEfficientEncoding(text);
// then safe
using (StreamWriter sw = new StreamWriter(path, false, Encoding.ASCII))
{
sw.WriteLine("Subject: test");
sw.WriteLine("Transfer-Encoding: 7bit");
sw.WriteLine(
"Content-Type: text/plain;\r\n\tcharset=\"{0}\"",
enc.BodyName);
sw.WriteLine("Content-Transfer-Encoding: base64"); // should be QP
sw.WriteLine();
sw.Write(Convert.ToBase64String(enc.GetBytes(text),
Base64FormattingOptions.InsertLineBreaks));
}
}
Incoming EncodingOpen a Text Fileprivate void OpenTextFileTest()
{
// read the complete file into a string
string content = EncodingTools.ReadTextFile(@"C:\test\txt");
// create a StreamReader with the guessed best encoding
using (StreamReader sr = EncodingTools.OpenTextFile(@"C:\test\txt"))
{
string fileContent = sr.ReadToEnd();
}
}
Reading from a Streamprivate void ReadStreamTest()
{
// create a streamReader from a stream
using (MemoryStream ms = new MemoryStream(
Encoding.GetEncoding("windows-1252").GetBytes("Some umlauts: öäüß")))
{
using (StreamReader sr = EncodingTools.OpenTextStream(ms))
{
string fileContent = sr.ReadToEnd();
}
}
}
References
History
| ||||||||||||||||||||