File.Copy and character encoding

Go To StackoverFlow.com

0

I noticed a strange behaviour of File.Copy() in .NET 3.5SP1. I don't know if that's a bug or a feature. But I know it's driving me crazy. We use File.Copy() in a custom build step, and it screws up the character encoding.

When I copy an ASCII encoding text file over a UTF-8 encoded text file, the destination file still is UTF-8 encoded, but has the content of the new file plus the 3 prefix characters for UTF-8. That's fine for ASCII characters, but incorrect for the remaining characters (128-255) of the ANSI code page.

Here's the code to reproduce. I first copy a UTF-8 file to the destination, then I copy an ANSI file to the same destination. Notice the output of the second console output: Content of copy.txt : this is ASCII encoded: / Encoding: utf-8

File.WriteAllText("ANSI.txt", "this is ANSI encoded: é", Encoding.GetEncoding(0));
File.WriteAllText("UTF8.txt", "this is UTF8 encoded: é", Encoding.UTF8);

File.Copy("UTF8.txt", "copy.txt", true);

using (StreamReader reader = new StreamReader("copy.txt", true))
{
    Console.WriteLine("Content of copy.txt : " + reader.ReadToEnd() + " / Encoding: " +
                reader.CurrentEncoding.BodyName);
}

File.Copy("ANSI.txt", "copy.txt", true);

using (StreamReader reader = new StreamReader("copy.txt", true))
{
    Console.WriteLine("Content of copy.txt : " + reader.ReadToEnd() + " / Encoding: " + 
                reader.CurrentEncoding.BodyName);
}

Any ideas why this happens? Is there a mistake in my code? Any ideas how to fix this (my current idea is to delete the file before if it exists)

EDIT: correct ANSI/ASCII confusion

2009-06-16 08:45
by chris166


1

Where are you writing ASCII.txt? You're writing ANSI.txt in the first line, but that's certainly not ASCII as ASCII doesn't contain any accented characters. The ANSI file won't contain any preamble indicating that it's ANSI rather than ASCII or UTF-8.

You seem to have changed your mind between ASCII and ANSI half way through writing the example, basically.

I'd expect any ASCII file to be "detected" as UTF-8, but the encoding detection relies on the file having a byte order mark for it to be anything other than UTF-8. It's not like it reads the whole file and then guesses at what the encoding is.

From the docs for StreamReader:

This constructor initializes the encoding to UTF8Encoding, the BaseStream property using the stream parameter, and the internal buffer to the default size.

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. See the Encoding.GetPreamble method for more information.

Now File.Copy is just copying the raw bytes from place to place - it shouldn't change anything in terms of character encodings, because it doesn't try to interpret the file as a text file in the first place.

It's not quite clear to me where you see a problem (partly due to the ANSI/ASCII part). I suggest you separate out the issues of "does File.Copy change things?" and "what character encoding is detected by StreamReader?" in both your mind and your question. The answers should be:

  • File.Copy shouldn't change the contents of the file at all
  • StreamReader can only detect UTF-8 and UTF-16; if you need to read a file encoded with any other encoding, you should state that explicitly in the constructor. (I would personally recommend using Encoding.Default instead of Encoding.GetEncoding(0) by the way. I think it's clearer.)
2009-06-16 08:54
by Jon Skeet
The problem is not StreamReader. I only used it to create a short piece of code that can reproduce the problem. (and I screwed up since I confused ASCII and ANSI while playing around with it). I noticed it first in a hex editor, and to my understanding the resulting file is incorrect, since it has the UTF-8 byte order mark (3 bytes at the beginning) and a wrong character code for the accented characte - chris166 2009-06-16 09:20
Something is weird. I'm not able to reproduce it anymore. So something was outdated (my hex editor, the code in VS or whatever). Anyway, thanks for looking into the problem and spending so much time on it - chris166 2009-06-16 09:26
My pleasure - although really this didn't take much more time than it took to just type the answer. Other questions have occasionally soaked up much more effort : - Jon Skeet 2009-06-16 09:29


0

I doubt this has anything to do with File.Copy. I think what you're seeing is that StreamReader uses UTF-8 by default to decode and since UTF-8 is backwards compatible, StreamReader never has any reason to stop using UTF-8 to read the ANSI-encoded file.

If you open ASCII.txt and copy.txt in a hex editor, are they identical?

2009-06-16 08:55
by Josh
No, the encoding detection of StreamReader works fine. The copy.txt has the UTF-8 byte order mark at the beginning and the wrong character for the umlaut cha - chris166 2009-06-16 09:16