-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ByteCount fails to count surrogate characters properly. #2088
Comments
Are you setting |
Encoding set doesn't matter here. After all - even if no encoding is specified, |
Let me know what you find out. |
@JoshClose I've created a draft on the idea I had. Let me know what you think. |
@kikaragyozov could you share a small example showing what you are trying to do here? I do not think you can rely on byte[] b = { 0x17, 0xDE };
string s = Encoding.Unicode.GetString(b);
Console.WriteLine(Convert.ToHexString(b)); // 17DE
Console.WriteLine(s); // �
Console.WriteLine(Convert.ToHexString(Encoding.Unicode.GetBytes(s))); // FDFF You have observed this because the surrogate characters are (on their own) invalid and Regardless, it does sound to me like your approach is a bit backwards. I would think you start with a byte[] bytes = Encoding.UTF8.GetBytes("Hello\ud83d\ude17emoji");
CsvConfiguration config = new(CultureInfo.InvariantCulture)
{
Delimiter = "\ud83d\ude17",
CountBytes = true
};
using MemoryStream stream = new(bytes);
using StreamReader reader = new(stream);
using CsvParser parser = new(reader, config);
Console.WriteLine(parser.Read()); // True
Console.WriteLine(bytes.Length); // 14
Console.WriteLine(stream.Position); // 14
Console.WriteLine(parser.ByteCount); // 16 (wrong)
// But the parsing still works nicely
Console.WriteLine(parser.Record.Length); // 2
Console.WriteLine(parser.Record[0]); // Hello
Console.WriteLine(parser.Record[1]); // emoji |
Hello @Rob-Hague! The idea here is to make But to answer some of your other questions - a perfectly sensible scenario to use That means, that once you've started reading the data from the That is the reason why a |
Ok I think I understand your scenario, but (IMO) it's probably difficult to achieve 100% correctness given that the byte-level/encoding functionality is done at the In any case I think the suggestion to use an |
I tried using an
Try it out here. We expect to see 4 as the result, but using the encoder by utilizing its state yields something different. You can verify it's the expected number by just calling |
Yeah it's a bit tricky because
For our purposes we would probably need to use the return value of var encoding = Encoding.UTF8;
var encoder = encoding.GetEncoder();
byte[] byteBuffer = new byte[encoding.GetMaxByteCount(1)];
var firstPairBytes = encoder.GetBytes(new char[] { '\ud83d' }, byteBuffer, flush: false);
var secondPairBytes = encoder.GetBytes(new char[] { '\ude17' }, byteBuffer, flush: true);
var surrogatePairBytes = encoding.GetByteCount(new char[] { '\ud83d', '\ude17' });
Console.WriteLine(firstPairBytes + secondPairBytes); // 4
Console.WriteLine(surrogatePairBytes); // 4
Console.WriteLine(surrogatePairBytes == firstPairBytes + secondPairBytes); // True |
Nice! I'll modify my pull request with your suggestion to simplify the code. If you'd like to contribute it instead - let me know so I make the appropriate changes for you to be able to edit it. 😃 |
Hi @kikaragyozov I gave it a go in #2106, could you please check it solves your problems? |
Hi @Rob-Hague. Your pull request is exactly what I need! I'm working on a project which reads csv files which have a footer with a different set of columns to the rest of the file. Do you think this could be merged? |
@Avrohom613 glad to hear it 🙂
I'm afraid I am not a maintainer so it is not up to me |
Thanks @Rob-Hague , I understand. |
@JoshClose @Rob-Hague What's the progress of this issue now? When do you think your PR for this fix could be fixed. |
Describe the bug
The following symbol - 😗, is composed of the following two characters:
\ud83d
and\ude17
. CallingEncoding.UTF8.GetByteCount(new char[] { <one of the two characters> }
instead of an array with both symbols in it, will yield in an incorrect byte counting.To Reproduce
Create a single line CSV file or in-memory
StringStream
that has an emoticon inside it. If you rely onByteCount
to correctly count the bytes, and move the stream's position forward by that many bytes. There's a chance you won't be able to process the next line, if for example an opening quote was skipped due the incorrectByteCount
property.Expected behavior
ByteCount
to correctly count such symbols.Additional context
I'm not really in the deep end of things with these emoticons, so I find it really weird why the incorrect byte count happens. I'm aware that if we instantiate an
UTF8Encoding
withthrowOnInvalidBytes
set totrue
, passing either of the two characters by themselves will throw an exception, but passed together it will work.EDIT: It appears those are called surrogate characters. https://www.ibm.com/docs/en/i/7.3?topic=renamed-surrogate-characters
The text was updated successfully, but these errors were encountered: