Click here to Skip to main content
Licence CPOL
First Posted 13 Apr 2007
Views 52,158
Downloads 1,547
Bookmarked 31 times

CodePage File Converter

By | 13 Apr 2007 | Article
Used to convert Text Files to a different code-page

Introduction

There are some people on planet Earth who still don't understand the difference between a byte and a character. So, let's start with the basics.

A byte is information storage. 1 byte = 8 bits. That is it.

A character is any written symbol. It ranges from English letters to Chinese letters to any other special characters, or even scientific, or mathematical symbols. In order to let the computer store characters, there are many ways for encoding. If the encoding is 8-bit encoding, this means that every character will be stored in one single byte. But other encodings offer 7-bit encoding (like ANSI) or 16-bits encoding (like Unicode).

Code Page represents the encoding mechanism that is used to encode characters into a bit-stream. Here are some examples of mostly common Code Pages:

Code Page Name AKA
1200 utf-16 Unicode
1250 windows-1250 Central European (Windows)
1251 windows-1251 Cyrillic (Windows)
1252 Windows-1252 Western European (Windows)
1253 windows-1253 Greek (Windows)
1254 windows-1254 Turkish (Windows)
1255 windows-1255 Hebrew (Windows)
1256 windows-1256 Arabic (Windows)
20127 us-ascii US-ASCII
20936 x-cp20936 Chinese Simplified (GB2312-80)
20949 x-cp20949 Korean Wansung
28591 iso-8859-1 Western European (ISO)
65001 utf-8 Unicode (UTF-8)
65005 utf-32 Unicode (UTF-32)

Background

Files are a stream of bytes. If the file is a text file, then this stream of bytes should represent those characters in one Code Page as mentioned above. However what is not stored within the file is that piece of information that tells which Code Page was used to do the encoding (although we can write algorithms that try to find a best guess). So, if the file is written in a Code Page that is not supported by the system that interprets the file, a conversion will be needed to re-encode the file in the expected Code Page.

For the example in the screen shot below, if the Regional and Language Options in Windows XP sets the language to match non-Unicode programs to Arabic (Egypt), this means that the encoding used to encode TXT files will be (Windows-1256) .

If the text file is opened later on, a system that has a different setting in Regional and Language Options (like English (United States)). This will cause the file to be interpreted incorrectly.

Screenshot - screen_shot_1.jpg

Using the Code

The software requires .NET 2.0 to run. First provide a path for the input file, and the input code-page. And the path for the output files with the target Code-Page.

Screenshot - screen_shot_2.jpg

References

History

  • 13th April, 2007: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

emad_awad



Egypt Egypt

Member



Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
Questionvery usefull Pinmembermarsze7:27 8 Mar '12  
Generalasmo 449 Pinmemberdice13721:49 25 May '09  
QuestionError -problem of code page transaltion from 1256-1252 Pinmemberyusuf_kumar444:02 18 May '08  
GeneralJust Another Thank You Pinmemberbear00723:54 29 Sep '07  
Questionconvert? PinmemberMark_VC10:22 16 Aug '07  
GeneralExplain & show source code PinmemberOlivier Oswald1:42 18 Jul '07  
QuestionHow to Identify Mail Language ? PinmemberArunkumarSundaravelu4:18 4 Jun '07  
QuestionAnd? PinmemberJohn R. Shaw18:21 14 Apr '07  
GeneralImprovement idea PinmemberMihai Nita6:42 13 Apr '07  
GeneralRe: Improvement idea Pinmemberemad_awad8:21 13 Apr '07  
GeneralRe: Improvement idea Pinmemberunlimited14:26 13 Apr '07  
GeneralRe: Improvement idea PinmemberMihai Nita22:03 13 Apr '07  
GeneralRe: Improvement idea Pinmemberemad_awad2:44 14 Apr '07  
GeneralRe: Improvement idea PinmemberMihai Nita22:43 14 Apr '07  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web01 | 2.5.120517.1 | Last Updated 13 Apr 2007
Article Copyright 2007 by emad_awad
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid