Click here to Skip to main content
Click here to Skip to main content

Detect Encoding for In- and Outgoing Text

By , 27 Oct 2009
 

Sample Image - DetectEncoding.gif

Introduction

In some cases, you need to know what the best codepage (encoding) is to either transfer text over the internet or store it in a text file. One could argue that Unicode always does the trick but I needed the most efficient (byte saving) way to transfer data.

Detecting a code page from text is a very tricky task. But luckily, Microsoft provides the MLang API, in which the IMultiLang3 interface is used for outbound encoding detection.

Similarly, the IMultiLang2 interface has a function to detect the encoding of an incoming byte array. This is very handy for codepage detection of text stored in files or for text that needs to be sent over the internet.

The EncodingTools class offers some easy-to-use functions to determine the best encoding for different scenarios.

Background

The Problem

I started this along with another component that constructs MIME conformant emails. The body of the email is passed as String. The user had to provide the charset to use for the Transfer-Encoding by hand. This is fine as long as you know the target character set or always assume Unicode. But it is definitely not a good solution if you have an end-user GUI application (most users do not even know what an "encoding" is).

I wondered if it is possible to detect the best encoding from the given text.

The Dirty Hack Attempt

My first attempt was a simple brute-force attack:

  • Built a list of suitable encodings (only iso-codepages and unicode)
  • Iterate over all considered encodings
  • Encode the text using this encoding
  • Encode it back to Unicode
  • Compare the results for errors
  • If no errors remember the encoding that produced the fewest bytes

This is not only ugly, it does not even work properly. All single byte encodings are binary equal in their encoding result. The codepage is only used to map single bytes to the correct character for display.

So this method can only distinguish between ASCII (7bit), single byte (8bit) and the different Unicode flavors (UTF-7, UTF8, Unicode, etc.).

Finding Something Better

Then I remembered the IMultiLang2.DetectInputCodepage method that was introduced along with Internet Explorer 5.0. This method detects the encoding used in a text (used by Internet Explorer to do automatic codepage detection if the header is missing from a page). Even this was not suitable for my problem, and I wondered if there had been development since version 5.0. A wrapper function to the DetectInputCodepage is provided in the EncodingTools class.

Internet Explorer 5.5 introduced a new interface exported from the MLang DLL: IMultiLang3. This is what MSDN says about this interface:
This interface extends IMultiLanguage2 by adding outbound text detection functionality to it.

Wow! This sounded more than promising! The interface has only two methods:

  • DetectOutboundCodePage (for strings)
  • DetectOutboundCodePageInIStream (for streams)

I chose to use the first one.

Using MLang

The MLang.dll is in the Windows\system32 directory. Along some exported functions, it provides some COM classes but does not contain a typelibrary. So the easy way (Add Reference in Visual Studio) did not work.

The MLang.idl is part of the Platform SDK and can be found in the include directory.
To create an assembly from the IDL file, use the following commands from the Visual Studio Command Prompt:

c:\temp\>midl MLang.idl
C:\temp>midl MLang.idl > null
Microsoft (R) 32b/64b MIDL Compiler Version 6.00.0366
Copyright (c) Microsoft Corporation 1991-2002. All rights reserved.
MLang.idl
unknwn.idl
wtypes.idl
basetsd.h
guiddef.h
oaidl.idl
objidl.idl
oaidl.acf

C:\temp>tlbimp mlang.tlb /silent

The result of those two commands is a brand new Assembly named MultiLanguage.dll.

Using Lutz Roeder's and Reflector I had a look at the signature:

MethodImpl(MethodImplOptions.InternalCall, 
    MethodCodeType=MethodCodeType.Runtime)]
void DetectOutboundCodePage([In] uint dwFlags, 
    [In, MarshalAs(UnmanagedType.LPWStr)] string lpWideCharStr, 
    [In] uint cchWideChar, 
    [In] ref uint puiPreferredCodePages, 
    [In] uint nPreferredCodePages, 
    [In] ref uint puiDetectedCodePages, 
    [In, Out] ref uint pnDetectedCodePages, 
    [In] ref ushort lpSpecialChar);

I was not so happy with the ref uint for the puiPreferredCodePages and puiDetectedCodePages parameters. Also, a typed enum for the dwFlags was missing.

So I first exported the generated assembly to C# source code and then changed it a little:

[Flags]
public enum MLCPF
{
    // Not currently supported.
    MLDETECTF_MAILNEWS = 0x0001,

    // Not currently supported.
    MLDETECTF_BROWSER = 0x0002,
    
    // Detection result must be valid for conversion and text rendering.
    MLDETECTF_VALID = 0x0004,
    
    // Detection result must be valid for conversion.
    MLDETECTF_VALID_NLS = 0x0008,

    // Preserve preferred code page order. 
    // This is meaningful only if you have set the puiPreferredCodePages 
    // parameter
    // in IMultiLanguage3::DetectOutboundCodePage 
    // or IMultiLanguage3::DetectOutboundCodePageInIStream.
    MLDETECTF_PRESERVE_ORDER = 0x0010,

    // Only return one of the preferred code pages as the detection result. 
    // This is meaningful only if you have set the puiPreferredCodePages 
    // parameter 
    // in IMultiLanguage3::DetectOutboundCodePage 
    // or IMultiLanguage3::DetectOutboundCodePageInIStream.
    MLDETECTF_PREFERRED_ONLY = 0x0020,

    // Filter out graphical symbols and punctuation.
    MLDETECTF_FILTER_SPECIALCHAR = 0x0040,
    
    // Return only Unicode codepages if the euro character is detected. 
    MLDETECTF_EURO_UTF8 = 0x0080
}             
        
[MethodImpl(MethodImplOptions.InternalCall, 
    MethodCodeType=MethodCodeType.Runtime)]
void DetectOutboundCodePage([In] MLCPF dwFlags, 
[In, MarshalAs(UnmanagedType.LPWStr)] string lpWideCharStr, 
[In] uint cchWideChar,
[In] IntPtr puiPreferredCodePages, 
[In] uint nPreferredCodePages, 
[In] IntPtr puiDetectedCodePages, 
[In, Out] ref uint pnDetectedCodePages, 
[In] ref ushort lpSpecialChar);

Then I added the source files to my project (no more MultiLanguage.dll assembly required).

Using IMultiLanguage3::DetectOutboundCodePage

Getting an instance of COM class implementing IMultiLanguage3 is straightforward:

// get the IMultiLanguage3 interface
MultiLanguage.IMultiLanguage3 multilang3 = new 
    MultiLanguage.CMultiLanguageClass();
if (multilang3 == null)
    throw new System.Runtime.InteropServices.COMException(
        "Failed to get IMultilang3");

The next thing is to fill the parameters.

The first parameter, dwFlags, is a combination of the tagMLCPF flags. I chose always to set the MLDETECTF_VALID_NLS because the result will be used for conversion.

The MLDETECTF_PRESERVE_ORDER and MLDETECTF_PREFERRED_ONLY are used depending on the parameters passed to my detection method.

The next two parameters (lpWideCharStr and cchWideChar) are simply the sting passed for detection and its length.

With the next two parameters (puiPreferredCodePages and nPreferredCodePages), the detection can be limited to a subset of all codepages. This is very useful if you only want to return a certain subset of codepages.

The last three parameters contain the result of detection after the method has completed successfully.

So the actual call looks like this:
uint[] preferedEncodings; // array of uint passed as parameter to the 
                          // function
int[] resultCodePages = new int[preferedEncodings.Length]; // result array

// ... call the function
multilang2.DetectInputCodepage(options,0, ref input[0], ref srcLen, 
    ref detectedEncdings[0], ref scores);

// evaluate the result
if (scores > 0)
{
    for (int i = 0; i < scores; i++)
    {
        // add the result
        result.Add(Encoding.GetEncoding((int)detectedEncdings[i].nCodePage));
    }
}

Finally the COM object should be freed.

Marshal.FinalReleaseComObject(multilang3);

Using IMultiLanguage2::DetectInputCodepage

After being able to choose the best encoding to send text over the internet, or save it to a stream, the next task was to detect the best encoding for incoming text if the sender (or storer) did not choose the best encoding.

The DetectInputCodepage has (at least) two practical uses. By default, Windows stores text files in the current default (UI) Encoding. For example, on my system this is "Windows-1252". A user from Russia will write text using "Windows-1251". Both codepages are singlebyte and do not have any preamble. So a text file will not contain any information about the used codepage.

So if you open a text file containing text created with codepage that is different than the current UI code page, a StreamReader will read the text as if it was stored in the UI's current codepage. (The encoding detection of the StreamReader is mostly a preamble check. So it will fail for almost any non-Unicode files (or those Unicode files without BOM.)
Most characters outside of the common ASCII charset will be displayed incorrectly.

This is where the DetectInputCodepage comes in handy. Its accuracy is not 100% but it is definitely better than the one from the StreamReader.

In the demo application, you can double click on an encoding to test which method has the better result (see "Testing the DetectInputCodepage performance" below).

The other practical use is to detect the encoding of emails from badly implemented mime mailers. Some wired mailers send emails in 8-bit encoding without specifying any characterset in the header. In this case, DetectInputCodepage can help a lot.

As for the DetectOutboundCodePage method, I change the method signature a little and add the MLDETECTCP enumeration. The resulting code looks like this:

public enum MLDETECTCP {
    // Default setting will be used. 
    MLDETECTCP_NONE = 0,

    // Input stream consists of 7-bit data. 
    MLDETECTCP_7BIT = 1,

    // Input stream consists of 8-bit data. 
    MLDETECTCP_8BIT = 2,

    // Input stream consists of double-byte data. 
    MLDETECTCP_DBCS = 4,

    // Input stream is an HTML page. 
    MLDETECTCP_HTML = 8,

    //Not currently supported. 
    MLDETECTCP_NUMBER = 16
}

[MethodImpl(MethodImplOptions.InternalCall, 
    MethodCodeType=MethodCodeType.Runtime)]
void DetectInputCodepage([In] MLDETECTCP flags, [In] uint dwPrefWinCodePage,
    [In] ref byte pSrcStr, [In, Out] ref int pcSrcSize, 
    [In, Out] ref DetectEncodingInfo lpEncoding, 
    [In, Out] ref int pnScores);
 

The usage of the function is almost identical to the DetectOutboundCodePage described earlier.

int maxEncodings; // parameter specifying how many encodings to return

int srcLen = input.Length; 			// length of the input
int scores = detectedEncdings.Length; 	// the number of detected scores

// setup options (none)
MultiLanguage.MLDETECTCP options = MultiLanguage.MLDETECTCP.MLDETECTCP_NONE; 

// finally... call to DetectInputCodepage 
multilang2.DetectInputCodepage(options,0, ref input[0], ref srcLen,
    ref detectedEncdings[0], ref scores);

// get result
if (scores > 0)
{
    for (int i = 0; i < scores; i++)
    {
        // add the result
        result.Add(Encoding.GetEncoding((int)detectedEncdings[i].nCodePage));
    }
}

My first tests were not that promising. I always had a COMExcpetion with E_FAIL thrown when I tried to detect a codepage.

The DetectInputCodepage will fail on texts that are too short, or that do not have BOM (Byte Order Mask / Encoding Preamble) prefixed data. There are two kinds of failures. If the input data is very short (less than 60 bytes), there is a good chance that the wrong codepage will be detected. Below 200 bytes, there is a good chance that DetectInputCodepage will return E_FAIL, because it could not decide which codepage to use. For the latter problem, I implemented a nasty workaround. I simply multiplied the input data up to 256 bytes. This seems to return reasonable results even for short strings.

// expand the string to be at least 256 bytes
if (input.Length < 256)
{
    byte[] newInput = new byte[256];
    int steps = 256 / input.Length;
    for (int i = 0; i < steps; i++)
        Array.Copy(input, 0, newInput, input.Length * i, input.Length);

    int rest = 256 % input.Length;
    if (rest > 0)
        Array.Copy(input, 0, newInput, steps * input.Length, rest);
    input = newInput;
}

Wrapping It All Up

I decided to create a static class to provide access to the DetectOutboundCodePage and DetectInputCodepage methods. It has some public methods that offer different levels of abstraction.

Here are the six high-level methods that should cover most of the usage scenarios:

  • GetMostEfficientEncoding
  • GetMostEfficientEncodingForStream
  • DetectInputCodepage
  • ReadTextFile
  • OpenTextFile
  • OpenTextStrem

It also has three public static arrays of predefined codepage sets:

  • PreferedEncodings
  • PreferedEncodingsForStream
  • AllEncodings

These arrays contain the codepages in the order that return the best result, but not in the natural sort order.

Testing the DetectInputCodepage Performance

The screenshot below shows a comparison of the StreamReader encoding detection and the EncodingTools detection. The sample texts come from Unciode.org.

Detection Perfomance

All the samples were detected correctly.

Using the EncodingTools Class

The following code snippets show how to use the EncodingTools class.

Outgoing Encoding

Detect best encoding for a Stream

// save the given text using the optimal encoding
private void SaveToStream(string text, string path)
{
    // this is all... detect the encoding
    Encoding enc = EncodingTools.GetMostEfficientEncodingForStream(text);
    // then safe
    using (StreamWriter sw = new StreamWriter(path, false, enc))
        sw.Write(text);
}

Detect best encoding for an email body

// save the given text using the optimal encoding
private void SaveToAsEmail(string text, string path)
{
    // this is all... detect the encoding
    Encoding enc = EncodingTools.GetMostEfficientEncoding(text);
    // then safe
    using (StreamWriter sw = new StreamWriter(path, false, Encoding.ASCII))
    {
        sw.WriteLine("Subject: test");
        sw.WriteLine("Transfer-Encoding: 7bit");
        sw.WriteLine(
            "Content-Type: text/plain;\r\n\tcharset=\"{0}\"", 
            enc.BodyName);
        sw.WriteLine("Content-Transfer-Encoding: base64"); // should be QP
        sw.WriteLine();
        sw.Write(Convert.ToBase64String(enc.GetBytes(text),
            Base64FormattingOptions.InsertLineBreaks));
    }
}

Incoming Encoding

Open a Text File

private void OpenTextFileTest()
{
    // read the complete file into a string
    string content = EncodingTools.ReadTextFile(@"C:\test\txt");

    // create a StreamReader with the guessed best encoding
    using (StreamReader sr = EncodingTools.OpenTextFile(@"C:\test\txt"))
    {
        string fileContent = sr.ReadToEnd();
    }
}

Reading from a Stream

private void ReadStreamTest()
{
    // create a streamReader from a stream
    using (MemoryStream ms = new MemoryStream(
        Encoding.GetEncoding("windows-1252").GetBytes("Some umlauts: öäüß")))
    {
        using (StreamReader sr = EncodingTools.OpenTextStream(ms))
        {
            string fileContent = sr.ReadToEnd();
        }
    }
}

References

  • MLang documentation on MSDN

History

  • 17/01/2007: Initial release
  • 22/01/2007: Fixed code to compile without warnings
  • 27/10/2009: Updated source and demo project

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication

About the Author

Carsten Zeumer
Software Developer (Senior)
Germany Germany
Carsten started programming Basic and Assembler back in the 80’s when he got his first C64. After switching to a x86 based system he started programming in Pascal and C. He started Windows programming with the arrival of Windows 3.0. After working for various internet companies developing a linguistic text analysis and classification software for 25hours communications he is now working as a contractor.
 
Carsten lives in Hamburg, Germany with his wife and five children.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
BugSporadic System.AccessViolationException when using EncodingTools.DetectInputCodepagesmemberEric Popivker25-Apr-13 23:07 
Hi,
 
First, thank you very much for making this code/demo available. It saved me many many hours of work.
 
Recently I started noticing sporadic exceptions in my code:
 
public static Encoding DetectEncodingUsingMLang(string filePath)
        {
            int length = 10240;
 
            using (FileStream fileStream = File.Open(filePath, FileMode.Open, FileAccess.Read))
            {
                var buf = new byte[length];
                fileStream.Read(buf, 0, length);
 
                Encoding[] detected = EncodingTools.DetectInputCodepages(buf, 1);       //Exception occurs here
                if (detected.Length > 0)
                {
                    return detected[0];
                }
 
                return Encoding.Default;
            }
        }
 

Full Exception text is:
System.AccessViolationException was unhandled
Message=Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
Source=EncodingTools
StackTrace:
at MultiLanguage.CMultiLanguageClass.DetectInputCodepage(MLDETECTCP flags, UInt32 dwPrefWinCodePage, Byte& pSrcStr, Int32& pcSrcSize, DetectEncodingInfo& lpEncoding, Int32& pnScores)
at href.Utils.EncodingTools.DetectInputCodepages(Byte[] input, Int32 maxEncodings) in D:\Code\SandBox\DetectEncoding_src\EncodingTools\EncodingTools.cs:line 436
at FindAndReplace.Utils.DetectEncodingUsingMLang(Stream fileStream) in D:\Code\FindAndReplace\Stable\src\FindAndReplace\Utils.cs:line 238
at FindAndReplace.Utils.DetectFileEncoding(String filePath) in D:\Code\FindAndReplace\Stable\src\FindAndReplace\Utils.cs:line 207
at FindAndReplace.Finder.Find() in D:\Code\FindAndReplace\Stable\src\FindAndReplace\Finder.cs:line 98
at FindAndReplace.App.MainForm.DoFindWork() in D:\Code\FindAndReplace\Stable\src\FindAndReplace.App\MainForm.cs:line 245
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()
InnerException:
 

The error sometimes occurs on first call to this function. Sometimes on fifth, and sometimes it just works 20 times in a row.
 
Do you know what could be causing this error?
QuestionC# alternativememberNick Snels26-Feb-13 9:01 
Great article. I tried using your code in my project, but eventually I switched to an online language detection web service that was easy to integrate with my C# project.
QuestionA big THANK YOUmemberMember 79204409-Feb-13 9:35 
Very useful indeed, you rock!Thumbs Up | :thumbsup:
QuestionWhy can't I determine the encoding for this simple file ?memberpschmidt9917-Dec-12 18:53 
Carsten, thank your for posting this code. I am a compete novice regarding encoding. I was thrown here against my will by circumstances beyond my control.
 
I think I ran into what I think is an encoding problem. When I read this one line of text file (Visual Studio 2010/VB), it show up in VB as having embedded spaces between each character. Everything I have tried does not seem to be able to detect the encoding, including your program here. (assuming that is the problem). I am running Windows 7 - 64 bit.
 
The one line should look like this:
"xxx-xx-xxxx|Conversion|Starting PTD|2009-07|2009-07".
 
But in my streamreader.readline, I get this:
"x x x - x x - x x x x | C o n v e r s i o n | S t a r t i n g P T D | 2 0 0 9 - 0 7 | 2 0 0 9 - 0 7".
 
I'm not even sure how I created this stink'en one line of text, but I figure if I could do it, maybe someone else will too. I might have copied this from the first line of a file using Texpad, but I am not sure.
 
In case it help, here is a "Octal Dump" of the contents:
____________________________________________________________________________
0000000000 x \0 x \0 x \0 - \0 x \0 x \0 - \0 x \0
0000000020 x \0 x \0 x \0 | \0 C \0 o \0 n \0 v \0
0000000040 e \0 r \0 s \0 i \0 o \0 n \0 | \0 S \0
0000000060 t \0 a \0 r \0 t \0 i \0 n \0 g \0 \0
0000000100 P \0 T \0 D \0 | \0 2 \0 0 \0 0 \0 9 \0
0000000120 - \0 0 \0 7 \0 | \0 2 \0 0 \0 0 \0 9 \0
0000000140 - \0 0 \0 7 \0 \r \0 \n \0
0000000152
 
I installed your program and it didn't solve the problem (but it really looked promising).
 
Any thoughts ? What have I done wrong ?
 
Thanks,
 
Peter Schmidt
peter@prstech.com
AnswerRe: Why can't I determine the encoding for this simple file ?memberCarsten Zeumer17-Dec-12 19:20 
Hi Peter,
 
have you tried to open the file as UTF-16? It looks like this is a UTF-16 encoded file without BOM at the beginning of the file. Without the BOM UTF-16 will most certainly not be detected.
/cadi
 
24 hours is not enough

QuestionDetects Japanese for Cyrilic textmemberSolarCell7-Dec-12 6:31 
Do not work for Cyrillic text, detects Japanese (50220 codepage) - whatever the length does the Cyrillic sentence have, - it is always detected only as Japanese. Tested on Windows 7/8 x64
AnswerRe: Detects Japanese for Cyrilic textmemberCarsten Zeumer7-Dec-12 7:01 
Hi SolarCell,
 
as far as i remember there was an issue, that the the iso-2022-jp codepage contains the cyriilc symbols.
You might try to combine the Codepage detection with Language detection (Detect a written text's language[^]) to disambiguate?
 
Best regards
Carsten
/cadi
 
24 hours is not enough

GeneralRe: Detects Japanese for Cyrilic text [modified]memberSolarCell7-Dec-12 7:09 
Thanks for idea, it looks perfect! I will try and be back with result.
 
Just applied this approach successfully with Cyrillic, btw, as I have no much experience with languages symbols, have the additional question:
are there any wide used languages whose symbols can be treated as another (larger) codes tables? So I could create an internal associative table of such large coding pages and in case I encounter one of them (one of code page) on detection, I could run merge with language.
The reason for my question is that we must understand the code page from the supplied text and this text can be of wide languages variety..
 
Thanks in advance.

modified 7-Dec-12 15:37pm.

Questionx86 or x64 version? [modified]memberSameers.ME23-Oct-12 21:28 
Hi,
 
I was testing this thing and it seems to be working fine in a test project. When I embedded that into my real project, it didn't detected things correctly. I was wonder why? I then realize that my real project is built under x86 platform. I changed the test project to x86 platfor and BANG! it stopped working on test project too. When I changed that back to x86 or Any CPU, it worked again.
 
I wonder what it makes difference to change the target platform in project settings? I am using it in VB.NET, .NET framework 3.5, Visual Studio 2008
 
Any help will be appreciated.
 
thanks,
Sameers
 
BTW, I tried to rebuild the EncodingTools using x86 platform and also on Any CPU, bit whenever using calling application's platform = x86, it doesn't read file with correct encoding.
FREE MSN Auto Responder[^]
 
History Remember Vendors, NOT Developers


modified 24-Oct-12 3:46am.

AnswerRe: x86 or x64 version?memberCarsten Zeumer23-Oct-12 21:47 
Hi Sameers,
 
it actually can make a difference as soon as you bind to native libraries.
Do you get any error message? Something like "Invalid Binary Format" (I forgott the exact error message).
/carsten
 
24 hours is not enough

GeneralRe: x86 or x64 version?memberSameers.ME23-Oct-12 21:52 
No, there is no error message, just the characters are not read properly.
It does read file, just that the file content are not read properly as they should be. Like
"Casque tour de cou - ?tanche - Rouge" is read instead of
"Casque tour de cou - étanche - Rouge"
 
Notice the ? sign
 
thanks for the quick reply.
Sameers
FREE MSN Auto Responder[^]
 
History Remember Vendors, NOT Developers

GeneralMy vote of 5memberManivannan Ponnusamy11-Oct-12 3:08 
very useful
QuestionDLL?membertwinbee8-Oct-12 6:47 
Any chance of making a single compact DLL out of this?
AnswerRe: DLL?memberCarsten Zeumer23-Oct-12 21:50 
Hi twinbee,
 
I don't get your point. The EncodingTools.dll in the Demo-Project is a single DLL.
Or am I missing something?
/carsten
 
24 hours is not enough

QuestionIMultiLanguage2::DetectInputCodepage() is broke.memberKevinSW19-Aug-12 4:24 
Note: IMultiLanguage2::DetectInputCodepage() is broke, at least on XP SP3 32bit, not tested anywhere else.
 
It will not detect UTF-16 input regardless of the use of the MLDETECTCP_DBCS flag or any other.
Will always return "US-ASCII"
Verified from my own testing, plus various places on the web.
AnswerRe: IMultiLanguage2::DetectInputCodepage() is broke.memberCarsten Zeumer19-Aug-12 5:01 
Hi Kevin,
 
to be honset, I have never testet the Code on Windows XP. Regarding the MSDN documentation XP is supposed to have MLANG implemented. XP is supposed to support Unicode from SP2 on.
Probably you will have to enable unicode support (or install at least one unicode font) in the firstplace?
 
I found this page http://www.higopi.com/ucedit/unicodeenable.html[^]
 
Not sure if it helps, but probably worth a try.
 
Best Regards
Carsten
/cadi
 
24 hours is not enough

QuestionTraditional Chinese characters aren’t being read from network streammemberMember 864850821-Mar-12 19:57 
Hi Guys,
 
I am facing this unique situation in my Application. I am having a Windows Service installed and running 24/7 on my server. This server basically listens to a server port offsite. The data received will be a mix of English characters and Traditional Chinese. I write the data received from the Offsite Server to a db table and file.
 
The issue is that the traditional Chinese characters aren’t being read as they have to be read. English characters are read perfectly.
 
I am using the following code
 
TcpClient clientSocket
NetworkStream networkStream = clientSocket.GetStream();
 
byte[] bytes = new byte[clientSocket.ReceiveBufferSize + 1];
networkStream.Read(bytes, 0, clientSocket.ReceiveBufferSize);
 
string clientdata = Encoding.GetEncoding(1252).GetString(bytes);
 
This clientdata contains the received string sent from the Offsite Server. This data is not as I excepted for Chinese characters.
 
I have tried using both Big5 and 1252 encoding.
 
Any help would be good.
 
Thanks
Balaji V
QuestionError referencing in a VS 2010 project (assembly does not have a strong name)memberBarbara Post17-Feb-12 2:13 
Hello,
 
I downloaded the sources and recompiled after solution conversion for VS 2010. Then I couldn't reference EncodingTools.dll in my project, because I have the following error : Assembly generation failed -- Referenced assembly 'EncodingTools' does not have a strong name. I changed target framework to 4 and have the same error. I don't fully understand how to follow MS instructions below, due to the way youembedded "Multilang".
 
http://support.microsoft.com/kb/313666/en-us?fr=1[^]
 
Thank you for some help.
 
Barbara
 
PS : I work with a 64-bit Windows 7 machine.
AnswerRe: Error referencing in a VS 2010 project (assembly does not have a strong name)memberCarsten Zeumer17-Feb-12 3:37 
Hi Barbara,
 
any assembly referenced from an signed assembly must be signed too.
I have not added any signature to the source project, so you have to do it yourself.
 
All you have to do is follow the instructions from Microsoft:
 
http://msdn.microsoft.com/en-us/library/ms247123.aspx
/carste
 
24 hours is not enough

QuestionSource Files not able to downloadmemberMember 864850815-Feb-12 17:15 
Source Files not being downloaded. Please update link. Thanks.
AnswerRe: Source Files not able to downloadmemberCarsten Zeumer15-Feb-12 21:44 
Hi Member,
 
I was not able to reprocduce any problems downlading the attached files.
Please validate your setup.
/cadi
 
24 hours is not enough

QuestionCannot detect unicode 16 without a BOMmemberBrianK9-Jan-12 15:26 
For files that are Unicode 16 without a BOM, this reports the encoding as ASCII. Does this not look at the data and guess thats it's unicode vs ascii.?
Thanks, Brian
QuestionThanksmemberleomicheloni25-Nov-11 3:58 
Thanks a lot!! great job!
QuestionThanks alot for youmembercresol27-Sep-11 2:39 
i owned for you because this article
Thanks you
GeneralMy vote of 5membercresol27-Sep-11 2:37 
wonderful article
GeneralMy vote of 5memberFilip D'haene19-May-11 5:47 
Thanks for sharing! Wink | ;)
GeneralGreat thanksmembermohamedenew9-May-11 5:59 
hi,Actually you make me happy to solve my problem with Germany file.
 
thanks,thanks.
GeneralMy vote of 5memberJulien Berube24-Mar-11 9:48 
That made my day, today! Good work!
Questioncan you provide an interface with DetectOutboundCodePageInIStream (for streams)memberMember 676414618-Oct-10 23:31 
can you provide an interface with DetectOutboundCodePageInIStream (for streams)
GeneralPut the source code to codeplex.memberSan Chen27-May-10 20:33 
I've put the source code to the mlangnet.codeplex.com and leave your name as the copyright owner. This will make the code more convenient to view & download, hope you wouldn't mind.
GeneralDetection on Win7 fails [modified]memberMarkQuestion18-May-10 4:08 
Hi,
 
the detection of the encoding fails on Win7. Often only utf-8 or utf-16 is returned, but not the correct encoding (e.g. shift-jis).
The same project on Win-XP works fine. I'm usign the "OpenTextFile" function.
 
Any idea?
 
thanks!

modified on Tuesday, May 18, 2010 10:53 AM

QuestionNo detection of UTF-8 with Byte Order Markings?memberjimiscott11-May-10 22:21 
We have an issue where a file with an UTF-8 BOM, is detected with a Windows-1251 encoding vs. UTF-8.
 
If we use the DetectInputCodepages function, we get both encodings, but I would have expected the UTF-8 to take precedence over Windows-1251 in this instance.
 
Otherwise, this looks great!
GeneralEncodingmemberchezduck27-Apr-10 23:05 
Thanks Carsten Zeumer, this program really helped me out with finding the right encoding page Smile | :)
GeneralStackOverflowExceptionmvpUwe Keim11-Mar-10 19:13 
Currently I am investigating a StackOverflowException that occurs in file "EncodingTools.cs", line 436 in the function
 
public static Encoding[] DetectInputCodepages(byte[] input, int maxEncodings)
 
when calling
 
// finally... call to DetectInputCodepage
multilang2.DetectInputCodepage(options,0,
    ref input[0], ref srcLen, ref detectedEncdings[0], ref scores);
 
I have a German Windows 7 Ultimate, 64-bit.
 
Can someone help me on resolving this? I currently seem to be unable to step into the DetectInputCodepage method.
 
Thanks
Uwe
My personal 24/7 webcam
Zeta Test - Intuitive, competitive Test Management environment for Test Plans and Test Cases. Download now!
Zeta Producer Desktop CMS - Intuitive, very easy to use. Download now!

GeneralRe: StackOverflowExceptionmemberCarsten Zeumer11-Mar-10 22:21 
hi uwe,
 
this is a little wired. The multilang2.DetectInputCodepage function is provided by Microsoft (actually it is part of the Internet Explorer distribution (which uses this function to detect the encoding of a web page if none is specified)).
 
Some questions (hints) to get a little closer to the bug:
- Does this error occur for any input given?
- Have you tried to compile for x86 (32bit)
- Can you enable native code debugging, disable "Just my Code" and download the debug symbols? The stack should point to some native method when the error occurs.
- can you check the same input on a different system (not 64bit, not windows 7 or not IE 8?)
 
hope that helps?
/carsten
 
24 hours is not enough

GeneralRe: StackOverflowExceptionmvpUwe Keim11-Mar-10 22:36 
Danke, Carsten Smile | :)
 
I'll try to investigate and post my updates here.
My personal 24/7 webcam
Zeta Test - Intuitive, competitive Test Management environment for Test Plans and Test Cases. Download now!
Zeta Producer Desktop CMS - Intuitive, very easy to use. Download now!

GeneralRe: StackOverflowExceptionmvpUwe Keim11-Mar-10 22:43 
Just tried. Compiling to x86 works; no more stack overflow exception. Arrgh!
 
This is my code snippet to test your function:
 
private static byte[] readFileBinary(string filePath)
{
	using (var fs = new FileStream(filePath, FileMode.Open, FileAccess.Read))
	using (var r = new BinaryReader(fs))
	{
		return r.ReadBytes((int)fs.Length);
	}
}
 
private void button2_Click(object sender, EventArgs e)
{
	var f = @"myfilename.ext";
	var content = readFileBinary(f);
	EncodingTools.DetectInputCodepages(content, 20);
}
 
The file I work with ("myfilename.ext") is the resource file contained in this ZIP [^].
 
I will try the next steps, just as you suggested.
My personal 24/7 webcam
Zeta Test - Intuitive, competitive Test Management environment for Test Plans and Test Cases. Download now!
Zeta Producer Desktop CMS - Intuitive, very easy to use. Download now!

GeneralRe: StackOverflowExceptionmvpUwe Keim11-Mar-10 22:51 
Just tried native debugging.
 
Actually the native debugging will be no help here, since the excption is being raised in the debugger at a too late stage, when the thread already ended and the call stack is empty.
 
Here is a picture of my IDE [^].
 
The exception is:
 
System.StackOverflowException wurde nicht behandelt.
Message: Eine nicht behandelte Ausnahme des Typs "System.StackOverflowException" ist in Unbekanntes Modul. aufgetreten.

 
When I got StackOverflowExceptions in the past (in managed code), I always ended up with the debugger pausing at the erroneous location and the call stack being correctly filled.
 
Seems that for native exception this is a bit more difficult/impossible to catch?
My personal 24/7 webcam
Zeta Test - Intuitive, competitive Test Management environment for Test Plans and Test Cases. Download now!
Zeta Producer Desktop CMS - Intuitive, very easy to use. Download now!

GeneralRe: StackOverflowExceptionmemberCarsten Zeumer12-Mar-10 1:05 
it's an 64bit issue (not the same as here Re Oops Intermittent SystemAccessViolationException....). I can reprocude (and fix it) on my system.
 
if you change the target from "Any CPU" to "x86" it should work (very nast workaround, but as long as you do not need any 64bit specific features it should be feasable).
 
I probably should try to generate the wrapper classes on my 64bit system anc check if it works properly then... but as always: so much to do - so little time ;(
/cadi
 
24 hours is not enough

GeneralRe: StackOverflowExceptionmvpUwe Keim12-Mar-10 1:09 
Thank you!
 
If you say you can fix it, can you tell me roughly the steps, so I can try to fix it by myself?
 
Thanks
Uwe
My personal 24/7 webcam
Zeta Test - Intuitive, competitive Test Management environment for Test Plans and Test Cases. Download now!
Zeta Producer Desktop CMS - Intuitive, very easy to use. Download now!

GeneralRe: StackOverflowExceptionmemberCarsten Zeumer12-Mar-10 1:21 
simply set the target from "Any CPU" (was ist denn das auf deutsch?) to "x86". (tool bar next to target "Debug/Release" or Menu "Build"->"Configuration Manager").
You must set the version of the library and the main executable assembly to x86 (other assemblies that do not use EncodingTools can stay what ever they are)
 
hope this helps
/cadi
 
24 hours is not enough

QuestionLicense?memberMagnus_Beije6-Oct-09 21:38 
Hi,
 
It seems like your not refering to any license for your code.
Is this correct?
 
Br
/Magnus
AnswerRe: License?memberCarsten Zeumer7-Oct-09 22:52 
Hi Magnus,
 
back when I joined CP there were no licence options. Everything was supposed to be "free".
 
I applied the most open licence I was able to find.
 
In essence: use my code in any way you like but do it on you own risk! If improve it, it would be nice if you share it.
 
/cadi
 
24 hours is not enough

QuestionCan't detect Japanese encodingmemberbadtoto19-Aug-09 16:45 
i prepared 3 files with shift-jis, euc-jp and utf-8.
 
it just work at utf-8
 
give me
shift-jis to western european language
and
euc-jp to turkey language
 
any suggestion ?
AnswerRe: Can't detect Japanese encodingmemberCarsten Zeumer7-Oct-09 23:02 
Hi badtoto,
 
first of all: what do you try to determin? the incoming encoding or the best outgoing?
 
Suggestions for Incoming:
- use longer texts
- limit the number of detectable encodings
- if you can not limit the detectable encodings change the order to the most probable for your scenario
 
Suggestions for Outgoing:
- utf-8 is always correct
- use GetMostEfficientEncoding(string input, int[] preferedEncodings) and omit "utf-8" in the list of prefered encodings
 
/cadi
 
24 hours is not enough

GeneralRe: Can't detect Japanese encodingmemberbdjc14-Oct-10 18:23 
IMultiLanguage2::DetectInputCodepage is kinda broken, i dont think it was designed to solve the problem badtoto (and me) was faced with.
 
The problem (if i understand it correctly), is that:
1) someone used a machine running Japanese Windows (native code page is shift-jis, i.e 932) to create a plain text file.
2) this file is uploaded to a website
3) another person running english windows (where the native code page is western: 1252) downloaded this text file, but cannot read the contents, so wants to determine which code page it was originally created in. In this case, the user knows the file is encoded using just one encoding, he just donest know which. The input is just a byte stream of the file here.
 
IMultiLanguage2::DetectInputCodepage simply doesn't cut it in this situation. I've tried a few text files with Chinese encoding, and the result always says it's encoded in shift-jis; completely wrong.
 
In my opinion, the suggestions given above (use longer texts, limit the number of detectable encodings etc) are plain wrong.
 
Maybe IMultiLanguage2::DetectInputCodepage was meant to solve a different problem, a web page that contains many languages, say: English, Japanese, French. and it wants to determine which is the predominant language.
Questioncan you provide the source code in C?membergsan_bot23-May-09 3:41 
can you provide the source code in C? Need it urgently. Thanks
AnswerRe: can you provide the source code in C?memberCarsten Zeumer24-May-09 0:28 
hi gsan_bot,
 
sorry, but I have currently no intention to port the code to C. I have not done that much C lately and do not plan to do it in the near future...
 
But if you know how to handle COM in C it should be easy to implement the detection in plain C. The real work is done by Microsofts COM-Classes anyway.
 
/cadi
 
24 hours is not enough

AnswerRe: can you provide the source code in C?memberfrugalmail24-Oct-09 19:32 
Ridiculous....
 
Hire somebody to do it, or at least extend him an offer.
 
Thanks for writing the article, and doing it in whatever language you wanted to.
GeneralOops! Intermittent System.AccessViolationException in x64 environment.memberDarrell19833-Apr-09 7:09 
First of all, great stuff. This has been most useful in the project I have been working on.
 
During testing, we noticed that while we were detecting the codepage of a byte[], using DetectInputCodePage of IMultilanguage2, we received System.AccessViolationException.
 
This is consistent on two 64bit machines we have tried running it on. One was Vista (x64) and the other was Windows 2008 Server (x64).
 
Any ideas? I'm suspecting COM :P

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130617.1 | Last Updated 27 Oct 2009
Article Copyright 2007 by Carsten Zeumer
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid