Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteCount fails to count surrogate characters properly. #2088

Open
kikaragyozov opened this issue Nov 21, 2022 · 15 comments · May be fixed by #2106 or #2090
Open

ByteCount fails to count surrogate characters properly. #2088

kikaragyozov opened this issue Nov 21, 2022 · 15 comments · May be fixed by #2106 or #2090
Labels

Comments

@kikaragyozov
Copy link

kikaragyozov commented Nov 21, 2022

Describe the bug
The following symbol - 😗, is composed of the following two characters: \ud83d and \ude17. Calling Encoding.UTF8.GetByteCount(new char[] { <one of the two characters> } instead of an array with both symbols in it, will yield in an incorrect byte counting.

To Reproduce
Create a single line CSV file or in-memory StringStream that has an emoticon inside it. If you rely on ByteCount to correctly count the bytes, and move the stream's position forward by that many bytes. There's a chance you won't be able to process the next line, if for example an opening quote was skipped due the incorrect ByteCount property.

Expected behavior
ByteCount to correctly count such symbols.

Additional context
I'm not really in the deep end of things with these emoticons, so I find it really weird why the incorrect byte count happens. I'm aware that if we instantiate an UTF8Encoding with throwOnInvalidBytes set to true, passing either of the two characters by themselves will throw an exception, but passed together it will work.

EDIT: It appears those are called surrogate characters. https://www.ibm.com/docs/en/i/7.3?topic=renamed-surrogate-characters

@kikaragyozov kikaragyozov changed the title ByteCount fails to count emoticons properly. ByteCount fails to count emoticons/symbols properly. Nov 21, 2022
@kikaragyozov kikaragyozov changed the title ByteCount fails to count emoticons/symbols properly. ByteCount fails to count surrogate characters properly. Nov 21, 2022
@JoshClose
Copy link
Owner

Are you setting Encoding? That is required for CountBytes to work. I should probably add that to the configuration validation.

@kikaragyozov
Copy link
Author

kikaragyozov commented Nov 21, 2022

Are you setting Encoding? That is required for CountBytes to work. I should probably add that to the configuration validation.

Encoding set doesn't matter here. After all - even if no encoding is specified, CsvConfiguration uses Encoding.UTF8 by default. I'll try to make a PR fixing this, but I'll have to first learn how the code is structured so that may take some time. I'll let you know how things are going! I should have lots of free time in the weekends to tackle this.

@JoshClose
Copy link
Owner

Let me know what you find out.

@kikaragyozov kikaragyozov linked a pull request Nov 23, 2022 that will close this issue
@kikaragyozov
Copy link
Author

@JoshClose I've created a draft on the idea I had. Let me know what you think.

@Rob-Hague
Copy link
Contributor

Rob-Hague commented Dec 13, 2022

If you rely on ByteCount to correctly count the bytes, and move the stream's position forward by that many bytes.

@kikaragyozov could you share a small example showing what you are trying to do here? I do not think you can rely on ByteCount to correctly count the bytes, i.e a given encoding is not necessarily a bijection bytes -> chars (in particular for invalid bytes), so if you go bytes -> chars -> bytes you cannot assume the output matches the input:

byte[] b = { 0x17, 0xDE };
string s = Encoding.Unicode.GetString(b);

Console.WriteLine(Convert.ToHexString(b)); // 17DE
Console.WriteLine(s); // �
Console.WriteLine(Convert.ToHexString(Encoding.Unicode.GetBytes(s))); // FDFF

You have observed this because the surrogate characters are (on their own) invalid and ByteCount is accumulated by encoding each individual (decoded) character, which in this case is replaced by the fallback character \uFFFD. A slightly better thing to do here would be to use an Encoder via encoding.GetEncoder which remembers state across calls rather than the Encoding itself which is stateless. But it does not solve the underlying fact that input may not equal output.

Regardless, it does sound to me like your approach is a bit backwards. I would think you start with a Stream of bytes, wrap it with a StreamReader which correctly handles the decoding bytes -> chars and advancing of the stream, and which is passed to the CsvParser. So things should just work. To then encode the decoded characters to provide feedback to the stream position does not make much sense to me.

byte[] bytes = Encoding.UTF8.GetBytes("Hello\ud83d\ude17emoji");
CsvConfiguration config = new(CultureInfo.InvariantCulture)
{
	Delimiter = "\ud83d\ude17",
	CountBytes = true
};
using MemoryStream stream = new(bytes);
using StreamReader reader = new(stream);
using CsvParser parser = new(reader, config);

Console.WriteLine(parser.Read()); // True

Console.WriteLine(bytes.Length); // 14
Console.WriteLine(stream.Position); // 14
Console.WriteLine(parser.ByteCount); // 16 (wrong)

// But the parsing still works nicely
Console.WriteLine(parser.Record.Length); // 2
Console.WriteLine(parser.Record[0]); // Hello
Console.WriteLine(parser.Record[1]); // emoji

@kikaragyozov
Copy link
Author

kikaragyozov commented Dec 13, 2022

If you rely on ByteCount to correctly count the bytes, and move the stream's position forward by that many bytes.

@kikaragyozov could you share a small example showing what you are trying to do here? I do not think you can rely on ByteCount to correctly count the bytes, i.e a given encoding is not necessarily a bijection bytes -> chars (in particular for invalid bytes), so if you go bytes -> chars -> bytes you cannot assume the output matches the input:

byte[] b = { 0x17, 0xDE };
string s = Encoding.Unicode.GetString(b);

Console.WriteLine(Convert.ToHexString(b)); // 17DE
Console.WriteLine(s); // �
Console.WriteLine(Convert.ToHexString(Encoding.Unicode.GetBytes(s))); // FDFF

You have observed this because the surrogate characters are (on their own) invalid and ByteCount is accumulated by encoding each individual (decoded) character, which in this case have been decoded to the fallback character \uFFFD. A slightly better thing to do here would be to use an Encoder via encoding.GetEncoder which remembers state across calls rather than the Encoding itself which is stateless. But it does not solve the underlying fact that input may not equal output.

Regardless, it does sound to me like your approach is a bit backwards. I would think you start with a Stream of bytes, wrap it with a StreamReader which correctly handles the decoding bytes -> chars and advancing of the stream, and which is passed to the CsvParser. So things should just work. To then encode the decoded characters to provide feedback to the stream position does not make much sense to me.

byte[] bytes = Encoding.UTF8.GetBytes("Hello\ud83d\ude17emoji");
CsvConfiguration config = new(CultureInfo.InvariantCulture)
{
	Delimiter = "\ud83d\ude17",
	CountBytes = true
};
using MemoryStream stream = new(bytes);
using StreamReader reader = new(stream);
using CsvParser parser = new(reader, config);

Console.WriteLine(parser.Read()); // True

Console.WriteLine(bytes.Length); // 14
Console.WriteLine(stream.Position); // 14
Console.WriteLine(parser.ByteCount); // 16 (wrong)

// But the parsing still works nicely
Console.WriteLine(parser.Record.Length); // 2
Console.WriteLine(parser.Record[0]); // Hello
Console.WriteLine(parser.Record[1]); // emoji

Hello @Rob-Hague!

The idea here is to make CsvReader.ByteCount property count surrogate characters correctly. That is the sole idea of what we're trying to accomplish here.

But to answer some of your other questions - a perfectly sensible scenario to use ByteCount as the specifier for Stream.Position is when you actually have a large enough input that a given StreamReader's buffer (By default 4096 bytes) cannot hold all of the data in it.

That means, that once you've started reading the data from the StreamReader, your underlying stream's position would have advanced with 4096 bytes, but that would not effectively be the position at which you've actually read/consumed the data. I could have read 5 of the 20 lines out of those 4096 bytes (by using the CsvReader.Read() method for example) and I'd like to resume reading at the 6th line.

That is the reason why a ByteCount property exists for the CSV parser in question, at least to my understanding. It's there to show you the effectively read bytes, and not just the total number of read bytes, like those in a buffer. Currently though, that number is incorrect if a surrogate pair is involved in the read payload.

@Rob-Hague
Copy link
Contributor

Ok I think I understand your scenario, but (IMO) it's probably difficult to achieve 100% correctness given that the byte-level/encoding functionality is done at the StreamReader and the CsvParser sits "above" that, so we are always going backwards in trying to invert a function which is not injective (i.e to get the input bytes from the output chars). But I agree with the apparent purpose of exposing that information at the parser level.

In any case I think the suggestion to use an Encoder in the CsvParser might help you here.

@kikaragyozov
Copy link
Author

kikaragyozov commented Dec 13, 2022

Ok I think I understand your scenario, but (IMO) it's probably difficult to achieve 100% correctness given that the byte-level/encoding functionality is done at the StreamReader and the CsvParser sits "above" that, so we are always going backwards in trying to invert a function which is not injective (i.e to get the input bytes from the output chars). But I agree with the apparent purpose of exposing that information at the parser level.

In any case I think the suggestion to use an Encoder in the CsvParser might help you here.

I tried using an Encoder, but it seems the count is still incorrect? Here's a short example:

var encoder = Encoding.UTF8.GetEncoder();
var firstPairBytes = encoder.GetByteCount(new char[] { '\ud83d' }, false);
var secondPairBytes = encoder.GetByteCount(new char[] { '\ude17' }, true);
var surrogatePairBytes = encoder.GetByteCount(new char[] { '\ud83d', '\ude17' }, true);
Console.WriteLine(firstPairBytes + secondPairBytes);
Console.WriteLine(surrogatePairBytes);
Console.WriteLine(surrogatePairBytes == firstPairBytes + secondPairBytes);

Try it out here. We expect to see 4 as the result, but using the encoder by utilizing its state yields something different. You can verify it's the expected number by just calling Encoding.UTF8.GetByteCount("😗") and seeing the output.

@Rob-Hague
Copy link
Contributor

Yeah it's a bit tricky because encoder.GetByteCount does not alter the state. From the docs:

This method does not affect the state of the encoder.

To calculate the exact array size that GetBytes requires to store the resulting bytes, the application should use GetByteCount.

If GetBytes is called with flush set to false, the encoder stores trailing characters at the end of the data block in an internal buffer and uses them in the next encoding operation. The application should call GetByteCount on a block of data immediately before calling GetBytes on the same block, so that any trailing characters from the previous block are included in the calculation.

For our purposes we would probably need to use the return value of GetBytes called with a throwaway buffer or stream:

var encoding = Encoding.UTF8;
var encoder = encoding.GetEncoder();

byte[] byteBuffer = new byte[encoding.GetMaxByteCount(1)];

var firstPairBytes = encoder.GetBytes(new char[] { '\ud83d' }, byteBuffer, flush: false);
var secondPairBytes = encoder.GetBytes(new char[] { '\ude17' }, byteBuffer, flush: true);

var surrogatePairBytes = encoding.GetByteCount(new char[] { '\ud83d', '\ude17' });

Console.WriteLine(firstPairBytes + secondPairBytes); // 4
Console.WriteLine(surrogatePairBytes); // 4
Console.WriteLine(surrogatePairBytes == firstPairBytes + secondPairBytes); // True

@kikaragyozov
Copy link
Author

kikaragyozov commented Dec 13, 2022

Nice! I'll modify my pull request with your suggestion to simplify the code. If you'd like to contribute it instead - let me know so I make the appropriate changes for you to be able to edit it. 😃

@Rob-Hague Rob-Hague linked a pull request Dec 26, 2022 that will close this issue
@Rob-Hague
Copy link
Contributor

Hi @kikaragyozov I gave it a go in #2106, could you please check it solves your problems?

@Avrohom613
Copy link

Hi @Rob-Hague. Your pull request is exactly what I need!

I'm working on a project which reads csv files which have a footer with a different set of columns to the rest of the file.
To read the footer, after reading all the rows in the body of the file we reset the position of the underlying Stream to the beginning of the footer using the ByteCount.
Recently we were getting the wrong position, also due to surrogate characters.
I cloned your fork of the project, and tested both your encoder branch and master and found that your fix gets me the correct byte count.

Do you think this could be merged?

@Rob-Hague
Copy link
Contributor

@Avrohom613 glad to hear it 🙂

Do you think this could be merged?

I'm afraid I am not a maintainer so it is not up to me

@Avrohom613
Copy link

Thanks @Rob-Hague , I understand.
@JoshClose do you think @Rob-Hague 's PR #2106 could be merged?

@vicentezqh
Copy link

@JoshClose @Rob-Hague What's the progress of this issue now? When do you think your PR for this fix could be fixed.
We are working on a project to deal with reading csv files of large scale, each executor will have read/process time with 5 minutes timeout, then we mark the ByteCount and return, then next thread will pick up from this ByteCount and continue processing for another 5 minutes, and repeat the process until finished. That is our overall workflow, but obviously, if input csv hash emoji it will make the byteCount incorrect. We just need this change to be available ASAP, could you give us any updates on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants