Click here to Skip to main content
15,881,172 members
Please Sign up or sign in to vote.
3.50/5 (2 votes)
See more:
Hello,

I've been struggling for a while trying to figure out something that I think should be simple. I'm not very familiar with C# (yet) though, so that could be part of the problem.

Basically, I just need to convert this C code to work in C#:

C
char chr;
FILE *in, *out;

// fopen blah blah (just omitting this part, not important)

while((chr = getc(in)) != EOF)
{
if(chr != '\r' && chr != '\n')
  chr = ~chr;

  putc(chr, out);
}

// fclose blah blah 


I've tried something like this, but it doesn't work properly -- some of the characters aren't changed or are outputted incorrectly.

C#
StreamReader inStream = new StreamReader(inputFile, Encoding.GetEncoding(1252));
StreamWriter outStream = new StreamWriter(outputFile, true, Encoding.GetEncoding(1252));


while (!inStream.EndOfStream)
{
    char[] chr = new char[1];

    inStream.Read(chr, 0, 1);

    if (chr[0] != '\r' && chr[0] != '\n')
        chr[0] = (char)(byte)(~(int)(byte)chr[0]);

    outStream.Write(chr[0]);
}

outStream.Close();
inStream.Close();


I also think all the typecasts are a bit silly, but I'm not sure how I'm 'supposed' to do it.
I have a feeling it might have something to do with the encoding of the file (it's "western european", hence the encoding I had to use on streamreader) -- but is there a way I can do this without even worrying about the encoding? And to mimic the C code exactly?

Thanks for any help.
Posted
Comments
Sergey Alexandrovich Kryukov 25-Jun-12 19:10pm    
What characters, exactly? Are you sure the input file is really 1252? Strictly speaking, everything which is not written in Unicode is potentially incorrect. The encoding to work with Unicode which give identical results with ASCII when all characters fit in ASCII range is UTF-8.
--SA

Not that your result is correct or not; it's better to say that the whole idea of this "translation" between languages makes no sense. You could make it meaningful it you explained to ultimate purpose of your character calculations (which look strange). Apart from some application context, the question does not makes any sense.

Here is why: you are doing seemingly similar operations of very different objects.

Your C characters are 8-bit objects. Moreover, you use signed characters, but if you do just bitwise operations, it does not matter. And you use complement operator '~'. The idea of complement would not make any sense without specification of "complement to what". You could typecase some object to a wider type and complement to its value corresponding to a value with all bits set and get a different result. With C char type, the complement means a bitwise complement to the value 0xFF. For example, if your character is blank space (char source = ' ';), the complement gets the value -33, which corresponds to 0xDF in the unsigned char form.

In .NET, a character is a Unicode character. In memory, it is represented using the encoding UTF-16LE, which uses 16-bit words to express a character in a Base Multilingual Place (BMP) and a pair of such words to express one character outside BMP. When you calculate a complement of the same very blank space, you get a "character" 0xFFDF, which is not standardized as a character:
http://www.unicode.org/charts/PDF/UFF00.pdf[^].

Please see:
http://www.unicode.org[^].

Now, you wrap all intermediate results to a byte, it will give identical result: 0xDF. So, up to this point everything is "correct" (if this is really what you want to get), and the problems could be somewhere else. What is your input file is not actually all "Western European". Or you interpret it incorrectly. So, to go further, let's see what exactly characters are "wrong". You could easily run this code under the debugger to see a calculation on some specific characters. Please see my comment to the question and answer my question.

As to your idea to "do this without even worrying about the encoding", it strongly resembles the thinking of monsieur Jourdain, a character of Molière's play Le Bourgeois gentilhomme. This guy was proud of the fact he could express himself in prose, after his teacher explained it to him. :-)
Please see:
http://en.wikipedia.org/wiki/Prose[^],
http://en.wikipedia.org/wiki/Le_Bourgeois_gentilhomme[^].

[EDIT]

Anyway, I decided to try it out. First of all, let me re-write the code is a literate way (but it does not mean is should work correctly):
C#
class Program {

    const string fileName = "input.txt";
    const string outfileName = "output.txt";

    static void Main(string[] args) {

        using (StreamReader reader = new StreamReader(fileName, Encoding.GetEncoding(1252))) {
            using (StreamWriter writer = new StreamWriter(outfileName, false, Encoding.GetEncoding(1252))) {
                while(true) {
                    int value = reader.Read();
                    if (value < 0) break;
                    char character = (char)value;
                    if (character != '\r' && character != '\n')
                        character = (char)(byte)(~(int)(byte)character);
                    writer.Write(character);
                } //loop
            }
        } //using

    } //Main

} //class Program

Your text sample is "converted" like this:
??????? ??ßÂßÝ????????Ý
??????? ??ßÂßÝÎÏÏÝß

where each question mark is really a question mark (code point 0x003f). The reason is this: it is incorrect to work with characters and encodings in principle. In this case, your complement function produces an image of a source character which does not fit into the range of the valid code range for the encoding, so it is replaced by a question mark.

Here is the background: C characters are not really characters, they are signed bytes and are processed in the bitwise manner, ignoring the cultural meaning of them. As to .NET, it follows Unicode standard.

Let me tell you that all your "1251", as well as the whole idea of "code page" do not exist anymore, in a way. They exist only in the form of some Microsoft legacy. Look at the result of System.Text.Encoding.GetEncoding — this is the real encoding object. Also, all non-Unicode encodings are only good for some legacy (such as ASCII, as a subset of Unicode). If you use any encoding except one of Unicode UTFs on an arbitrary text, a correct result is not guaranteed.

Now, to reproduce the effect of your C code, you need to work with binary bytes, as it is suggested in Solution 4. This is the only way.

Then again, this is a kind of "obfuscation" which makes no sense, whatsoever. If you needs encryption, use encryption (again, why?).

—SA
 
Share this answer
 
v3
Comments
Espen Harlinn 25-Jun-12 19:32pm    
Eloquently put :-D
Sergey Alexandrovich Kryukov 25-Jun-12 19:34pm    
Thank you very much, Espen.
--SA
Espen Harlinn 25-Jun-12 19:38pm    
Time to crawl to bed - it's around 01:30 here ...
gboost 25-Jun-12 20:51pm    
Well, this is a much more interesting reply than I imagined I'd get.
(Sorry if I don't reply to your satisfaction!)

I'd have to agree with you about it not making (much) sense.

I thought the problem would be something someone would spot just with those snippets as an 'oops, you did something wrong (languange-wise)', and didn't think more details were needed at the time. But I suppose it wouldn't have hurt to include them anyway -- and I probably should have.
Also hadn't considered the difference between the object types, which may be what the issue is altogether. (?)

To be a little more specific:
The strange character calculations really weren't my idea to start with. Another application is doing this to a configuration/settings file, I suppose to make it unreadable (e.g. in a text editor) to avoid issues with a user going in and changing things without being certain of what they're doing.
I'm actually just making a tool to tweak the settings (in a safe way), but need to be able to "decode"/"encode" the way said-app does -- which I was able to mimic in C, but obviously not C# (which I am trying to learn / adapt to)
Otherwise I probably wouldn't even want to do such a thing! :)

I posted an example of the text/characters to another answer.
I'm honestly not 100% sure if the encoding is 1252, but the results came out even worse unless I specified 1252 (which is what a text editor told me the encoding was, IF it is.)
Don't you love how certain .. I mean, confused, I am?

Anyway (again), an example is:

"normal"/decoded text (characters):
testing_one = "something"
testing_two = "100"

"encoded" text (by other app, or my C code):
‹šŒ‹–‘˜ ‘šßÂßÝŒ’š‹—–‘˜Ýß
‹šŒ‹–‘˜ ‹ˆßÂßÝÎÏÏÝ

attempt to "decode" the "encoded" text with C# code above
Æ?­Æìç#_oç? = "­oæ?Æëìç#"
Æ?­Æìç#_Æ9o = "100"
(only some characters are "decoded" correctly -- it should show the 'normal'/decoded text as above)

I'm not sure if that helps any or not.

I am going to try to analyze what you wrote better and check out the links though, I am interested in learning more and understanding what's actually going on here. I appreciate it.

Hmm, not sure how to take that last statement about prose. In a way it could be insulting, I suppose. But maybe not! Because? (Kidding...)
Sergey Alexandrovich Kryukov 25-Jun-12 21:20pm    
Not helping much. You need to analyze what comes out in binary for a single character, or two. Is there BOM in the C# result (it should not be)?
Why do you read an array of chars in a loop? You need to read a single char. Not clear why ", '0', '1' and '=' was not mangled.
I still don't know the purpose, and that makes a question not quite correct.
--SA
Why not just read the file as bytes and write it back out the same?


CSS
public Program(string[] args)
{
    byte[]  buf;
    int     i;

    buf = File.ReadAllBytes(@"C:\Test\InFile.txt");

    for (i = 0; i < buf.Length; i++)
    {
        if ((buf[i] != '\r') && (buf[i] != '\n'))
        {
            buf[i] = (byte) (~buf[i]);
        }
    }
    File.WriteAllBytes(@"c:\Test\OutFile.txt", buf);
}
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 26-Jun-12 12:49pm    
Right idea, my 5. (Even though the purpose does not worth it.)
--SA
gboost 26-Jun-12 18:37pm    
Does exactly what I needed, thank you very much :)
Sergey Alexandrovich Kryukov 26-Jun-12 19:02pm    
Accepting it formally was right thing. As I say, it should do what you expected.
--SA
I don't see why
C#
chr[0] = (char)~c;

shouldn't work.

However, note that in C# characters are 16-bits, where to my knowledge they're usually 8 in C. That may or may not make a difference in your output.
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 25-Jun-12 19:30pm    
It would work, but as a complement to a 16-bit word, which is not the same as in C, a complement to a 8-bit word. Please see my answer.
--SA
Probably will not like but here it is:

C#
var charValue = 'c';
var intValue = Convert.ToByte(charValue);
var invertedValue = intValue ^ 255;
var newCharValue = Convert.ToChar(invertedValue);


or just:
var intValue2 = (char) ((byte)charValue ^ 255);
 
Share this answer
 
v2
Comments
gboost 25-Jun-12 19:28pm    
This seems to work fine, except it's still not converting it as I expected. I wonder if it does have something to do with what lewax said (8 bit vs 16 bit)

For example, this "encoded" text:
‹šŒ‹–‘˜ ‘šßÂßÝŒ’š‹—–‘˜Ýß
‹šŒ‹–‘˜ ‹ˆßÂßÝÎÏÏÝ

Comes out as this (in C# -- looks like some of it converted ok):
Æ?­Æìç#_oç? = "­oæ?Æëìç#"
Æ?­Æìç#_Æ9o = "100"

But it's supposed to be (and comes out as this with C):
testing_one = "something"
testing_two = "100"
Clifford Nelson 25-Jun-12 20:23pm    
I would look at exactly what you are getting back from the C program. That way you can see the exact bit array out vs. the bit array in. That way you can determine exactly what it is doing.
Sergey Alexandrovich Kryukov 25-Jun-12 21:24pm    
Sure, but you can see clearly: "100" and "=" comes out unmangled, which is weird.
Anyway, the whole purpose of it is ridiculous -- please see the discussion in the comments to my answer.
--SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900