C# automatically detect text file encoding (code page)

Question

0.00/5 (No votes)

See more:

Hello everyone. I am new in c#
Would like me to try this code from the link below in my app.
Thanks to GitHub, this Class automatically detects the code page from the text file.
But I don't know how to call this class in my application, I don't know how to use this class...
For example, if my file is located in the path: C:\TestFolder\Test.txt, how can I detect code page of this file Test.txt.
He is asking for your help if that is possible, if not then I apologize in advance.Thanks in advance.

Link of GitHub class:
https://gist.github.com/TaoK/945127

What I have tried:

Something like

C#

private void Button1_Click(object sender, EventArgs e)
       {
           string path = @"C:\TestFolder\Test.txt";
           TextFileEncodingDetector.DetectTextFileEncoding(path);

       }

Posted 23-Oct-22 2:17am

Member 12673286

Updated 23-Oct-22 5:52am

PIEBALDconsult

v2

Add a Solution

Comments

PIEBALDconsult 23-Oct-22 10:00am

You may be confusing two different things.
As far as I know, you can't detect the code page, and you shouldn't need to.
You may be able to detect the Unicode encoding (UTF-8, UTF-16, etc.), but you still shouldn't need to.

What is it you are actually trying to do? And why? Very likely you are simply causing yourself trouble for no reason.

Dave Kreskowiak 23-Oct-22 12:52pm

Why would you do this? There is no way to reliably detect the encoding of a text file. If you think this is a sure-fire way of detecting the encoding, you would be wrong. This point is even covered in the code comments:

This class does NOT try to detect arbitrary codepages/charsets, it really only
aims to differentiate between some of the most common variants of Unicode
encoding, and a "default" (western / ascii-based) encoding alternative provided
by the caller.

As there is no "Reliable" way to distinguish between UTF-8 (without BOM) and
Windows-1252 (in .Net, also incorrectly called "ASCII") encodings, we use a
heuristic - so the more of the file we can sample the better the guess. If you
are going to read the whole file into memory at some point, then best to pass
in the whole byte byte array directly. Otherwise, decide how to trade off
reliability against performance / memory usage.

The UTF-8 detection heuristic only works for western text, as it relies on
the presence of UTF-8 encoded accented and other characters found in the upper
ranges of the Latin-1 and (particularly) Windows-1252 codepages.

If you don't know what "heuristic" means, it's basically an algorithm that makes a "best guess".

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Answer 1 · 2022-10-23T03:15:00

Solution 1

Go back to where you got the code from, and ask there: we aren't a tech support service for random packages!

Posted 23-Oct-22 3:15am

OriginalGriff

Richard MacCutchan · Answer 2 · 2022-10-23T05:52:00

Solution 2

It returns a value as from Encoding.GetEncoding Method (System.Text) | Microsoft Learn[^].

Posted 23-Oct-22 5:52am

Richard MacCutchan