ByteCount fails to count surrogate characters properly. #2088

kikaragyozov · 2022-11-21T11:36:16Z

Describe the bug
The following symbol - 😗, is composed of the following two characters: \ud83d and \ude17. Calling Encoding.UTF8.GetByteCount(new char[] { <one of the two characters> } instead of an array with both symbols in it, will yield in an incorrect byte counting.

To Reproduce
Create a single line CSV file or in-memory StringStream that has an emoticon inside it. If you rely on ByteCount to correctly count the bytes, and move the stream's position forward by that many bytes. There's a chance you won't be able to process the next line, if for example an opening quote was skipped due the incorrect ByteCount property.

Expected behavior
ByteCount to correctly count such symbols.

Additional context
I'm not really in the deep end of things with these emoticons, so I find it really weird why the incorrect byte count happens. I'm aware that if we instantiate an UTF8Encoding with throwOnInvalidBytes set to true, passing either of the two characters by themselves will throw an exception, but passed together it will work.

EDIT: It appears those are called surrogate characters. https://www.ibm.com/docs/en/i/7.3?topic=renamed-surrogate-characters

The text was updated successfully, but these errors were encountered:

JoshClose · 2022-11-21T18:23:35Z

Are you setting Encoding? That is required for CountBytes to work. I should probably add that to the configuration validation.

kikaragyozov · 2022-11-21T20:35:26Z

Are you setting Encoding? That is required for CountBytes to work. I should probably add that to the configuration validation.

Encoding set doesn't matter here. After all - even if no encoding is specified, CsvConfiguration uses Encoding.UTF8 by default. I'll try to make a PR fixing this, but I'll have to first learn how the code is structured so that may take some time. I'll let you know how things are going! I should have lots of free time in the weekends to tackle this.

JoshClose · 2022-11-21T20:41:39Z

Let me know what you find out.

kikaragyozov · 2022-11-23T12:46:10Z

@JoshClose I've created a draft on the idea I had. Let me know what you think.

Rob-Hague · 2022-12-13T10:20:50Z

If you rely on ByteCount to correctly count the bytes, and move the stream's position forward by that many bytes.

@kikaragyozov could you share a small example showing what you are trying to do here? I do not think you can rely on ByteCount to correctly count the bytes, i.e a given encoding is not necessarily a bijection bytes -> chars (in particular for invalid bytes), so if you go bytes -> chars -> bytes you cannot assume the output matches the input:

byte[] b = { 0x17, 0xDE };
string s = Encoding.Unicode.GetString(b);

Console.WriteLine(Convert.ToHexString(b)); // 17DE
Console.WriteLine(s); // �
Console.WriteLine(Convert.ToHexString(Encoding.Unicode.GetBytes(s))); // FDFF

You have observed this because the surrogate characters are (on their own) invalid and ByteCount is accumulated by encoding each individual (decoded) character, which in this case is replaced by the fallback character \uFFFD. A slightly better thing to do here would be to use an Encoder via encoding.GetEncoder which remembers state across calls rather than the Encoding itself which is stateless. But it does not solve the underlying fact that input may not equal output.

Regardless, it does sound to me like your approach is a bit backwards. I would think you start with a Stream of bytes, wrap it with a StreamReader which correctly handles the decoding bytes -> chars and advancing of the stream, and which is passed to the CsvParser. So things should just work. To then encode the decoded characters to provide feedback to the stream position does not make much sense to me.

byte[] bytes = Encoding.UTF8.GetBytes("Hello\ud83d\ude17emoji");
CsvConfiguration config = new(CultureInfo.InvariantCulture)
{
	Delimiter = "\ud83d\ude17",
	CountBytes = true
};
using MemoryStream stream = new(bytes);
using StreamReader reader = new(stream);
using CsvParser parser = new(reader, config);

Console.WriteLine(parser.Read()); // True

Console.WriteLine(bytes.Length); // 14
Console.WriteLine(stream.Position); // 14
Console.WriteLine(parser.ByteCount); // 16 (wrong)

// But the parsing still works nicely
Console.WriteLine(parser.Record.Length); // 2
Console.WriteLine(parser.Record[0]); // Hello
Console.WriteLine(parser.Record[1]); // emoji

kikaragyozov · 2022-12-13T10:55:01Z

If you rely on ByteCount to correctly count the bytes, and move the stream's position forward by that many bytes.

@kikaragyozov could you share a small example showing what you are trying to do here? I do not think you can rely on ByteCount to correctly count the bytes, i.e a given encoding is not necessarily a bijection bytes -> chars (in particular for invalid bytes), so if you go bytes -> chars -> bytes you cannot assume the output matches the input:
byte[] b = { 0x17, 0xDE };
string s = Encoding.Unicode.GetString(b);

Console.WriteLine(Convert.ToHexString(b)); // 17DE
Console.WriteLine(s); // �
Console.WriteLine(Convert.ToHexString(Encoding.Unicode.GetBytes(s))); // FDFF
You have observed this because the surrogate characters are (on their own) invalid and ByteCount is accumulated by encoding each individual (decoded) character, which in this case have been decoded to the fallback character \uFFFD. A slightly better thing to do here would be to use an Encoder via encoding.GetEncoder which remembers state across calls rather than the Encoding itself which is stateless. But it does not solve the underlying fact that input may not equal output.

Regardless, it does sound to me like your approach is a bit backwards. I would think you start with a Stream of bytes, wrap it with a StreamReader which correctly handles the decoding bytes -> chars and advancing of the stream, and which is passed to the CsvParser. So things should just work. To then encode the decoded characters to provide feedback to the stream position does not make much sense to me.
byte[] bytes = Encoding.UTF8.GetBytes("Hello\ud83d\ude17emoji");
CsvConfiguration config = new(CultureInfo.InvariantCulture)
{
	Delimiter = "\ud83d\ude17",
	CountBytes = true
};
using MemoryStream stream = new(bytes);
using StreamReader reader = new(stream);
using CsvParser parser = new(reader, config);

Console.WriteLine(parser.Read()); // True

Console.WriteLine(bytes.Length); // 14
Console.WriteLine(stream.Position); // 14
Console.WriteLine(parser.ByteCount); // 16 (wrong)

// But the parsing still works nicely
Console.WriteLine(parser.Record.Length); // 2
Console.WriteLine(parser.Record[0]); // Hello
Console.WriteLine(parser.Record[1]); // emoji

Hello @Rob-Hague!

The idea here is to make CsvReader.ByteCount property count surrogate characters correctly. That is the sole idea of what we're trying to accomplish here.

But to answer some of your other questions - a perfectly sensible scenario to use ByteCount as the specifier for Stream.Position is when you actually have a large enough input that a given StreamReader's buffer (By default 4096 bytes) cannot hold all of the data in it.

That means, that once you've started reading the data from the StreamReader, your underlying stream's position would have advanced with 4096 bytes, but that would not effectively be the position at which you've actually read/consumed the data. I could have read 5 of the 20 lines out of those 4096 bytes (by using the CsvReader.Read() method for example) and I'd like to resume reading at the 6th line.

That is the reason why a ByteCount property exists for the CSV parser in question, at least to my understanding. It's there to show you the effectively read bytes, and not just the total number of read bytes, like those in a buffer. Currently though, that number is incorrect if a surrogate pair is involved in the read payload.

Rob-Hague · 2022-12-13T12:37:45Z

Ok I think I understand your scenario, but (IMO) it's probably difficult to achieve 100% correctness given that the byte-level/encoding functionality is done at the StreamReader and the CsvParser sits "above" that, so we are always going backwards in trying to invert a function which is not injective (i.e to get the input bytes from the output chars). But I agree with the apparent purpose of exposing that information at the parser level.

In any case I think the suggestion to use an Encoder in the CsvParser might help you here.

kikaragyozov · 2022-12-13T13:09:20Z

Ok I think I understand your scenario, but (IMO) it's probably difficult to achieve 100% correctness given that the byte-level/encoding functionality is done at the StreamReader and the CsvParser sits "above" that, so we are always going backwards in trying to invert a function which is not injective (i.e to get the input bytes from the output chars). But I agree with the apparent purpose of exposing that information at the parser level.

In any case I think the suggestion to use an Encoder in the CsvParser might help you here.

I tried using an Encoder, but it seems the count is still incorrect? Here's a short example:

var encoder = Encoding.UTF8.GetEncoder();
var firstPairBytes = encoder.GetByteCount(new char[] { '\ud83d' }, false);
var secondPairBytes = encoder.GetByteCount(new char[] { '\ude17' }, true);
var surrogatePairBytes = encoder.GetByteCount(new char[] { '\ud83d', '\ude17' }, true);
Console.WriteLine(firstPairBytes + secondPairBytes);
Console.WriteLine(surrogatePairBytes);
Console.WriteLine(surrogatePairBytes == firstPairBytes + secondPairBytes);

Try it out here. We expect to see 4 as the result, but using the encoder by utilizing its state yields something different. You can verify it's the expected number by just calling Encoding.UTF8.GetByteCount("😗") and seeing the output.

Rob-Hague · 2022-12-13T13:46:28Z

Yeah it's a bit tricky because encoder.GetByteCount does not alter the state. From the docs:

This method does not affect the state of the encoder.

To calculate the exact array size that GetBytes requires to store the resulting bytes, the application should use GetByteCount.

If GetBytes is called with flush set to false, the encoder stores trailing characters at the end of the data block in an internal buffer and uses them in the next encoding operation. The application should call GetByteCount on a block of data immediately before calling GetBytes on the same block, so that any trailing characters from the previous block are included in the calculation.

For our purposes we would probably need to use the return value of GetBytes called with a throwaway buffer or stream:

var encoding = Encoding.UTF8;
var encoder = encoding.GetEncoder();

byte[] byteBuffer = new byte[encoding.GetMaxByteCount(1)];

var firstPairBytes = encoder.GetBytes(new char[] { '\ud83d' }, byteBuffer, flush: false);
var secondPairBytes = encoder.GetBytes(new char[] { '\ude17' }, byteBuffer, flush: true);

var surrogatePairBytes = encoding.GetByteCount(new char[] { '\ud83d', '\ude17' });

Console.WriteLine(firstPairBytes + secondPairBytes); // 4
Console.WriteLine(surrogatePairBytes); // 4
Console.WriteLine(surrogatePairBytes == firstPairBytes + secondPairBytes); // True

kikaragyozov · 2022-12-13T15:00:02Z

Nice! I'll modify my pull request with your suggestion to simplify the code. If you'd like to contribute it instead - let me know so I make the appropriate changes for you to be able to edit it. 😃

Rob-Hague · 2022-12-26T18:58:56Z

Hi @kikaragyozov I gave it a go in #2106, could you please check it solves your problems?

Avrohom613 · 2023-10-12T13:54:50Z

Hi @Rob-Hague. Your pull request is exactly what I need!

I'm working on a project which reads csv files which have a footer with a different set of columns to the rest of the file.
To read the footer, after reading all the rows in the body of the file we reset the position of the underlying Stream to the beginning of the footer using the ByteCount.
Recently we were getting the wrong position, also due to surrogate characters.
I cloned your fork of the project, and tested both your encoder branch and master and found that your fix gets me the correct byte count.

Do you think this could be merged?

Rob-Hague · 2023-10-12T16:39:58Z

@Avrohom613 glad to hear it 🙂

Do you think this could be merged?

I'm afraid I am not a maintainer so it is not up to me

Avrohom613 · 2023-10-13T07:30:06Z

Thanks @Rob-Hague , I understand.
@JoshClose do you think @Rob-Hague 's PR #2106 could be merged?

vicentezqh · 2024-08-10T00:08:42Z

@JoshClose @Rob-Hague What's the progress of this issue now? When do you think your PR for this fix could be fixed.
We are working on a project to deal with reading csv files of large scale, each executor will have read/process time with 5 minutes timeout, then we mark the ByteCount and return, then next thread will pick up from this ByteCount and continue processing for another 5 minutes, and repeat the process until finished. That is our overall workflow, but obviously, if input csv hash emoji it will make the byteCount incorrect. We just need this change to be available ASAP, could you give us any updates on it?

kikaragyozov added the bug label Nov 21, 2022

kikaragyozov changed the title ~~ByteCount fails to count emoticons properly.~~ ByteCount fails to count emoticons/symbols properly. Nov 21, 2022

kikaragyozov changed the title ~~ByteCount fails to count emoticons/symbols properly.~~ ByteCount fails to count surrogate characters properly. Nov 21, 2022

kikaragyozov linked a pull request Nov 23, 2022 that will close this issue

Count surrogate pairs. #2090

Draft

Rob-Hague linked a pull request Dec 26, 2022 that will close this issue

Use Encoder instead of Encoding in CsvParser #2106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByteCount fails to count surrogate characters properly. #2088

ByteCount fails to count surrogate characters properly. #2088

kikaragyozov commented Nov 21, 2022 •

edited

Loading

JoshClose commented Nov 21, 2022

kikaragyozov commented Nov 21, 2022 •

edited

Loading

JoshClose commented Nov 21, 2022

kikaragyozov commented Nov 23, 2022

Rob-Hague commented Dec 13, 2022 •

edited

Loading

kikaragyozov commented Dec 13, 2022 •

edited

Loading

Rob-Hague commented Dec 13, 2022

kikaragyozov commented Dec 13, 2022 •

edited

Loading

Rob-Hague commented Dec 13, 2022

kikaragyozov commented Dec 13, 2022 •

edited

Loading

Rob-Hague commented Dec 26, 2022

Avrohom613 commented Oct 12, 2023

Rob-Hague commented Oct 12, 2023

Avrohom613 commented Oct 13, 2023

vicentezqh commented Aug 10, 2024

ByteCount fails to count surrogate characters properly. #2088

ByteCount fails to count surrogate characters properly. #2088

Comments

kikaragyozov commented Nov 21, 2022 • edited Loading

JoshClose commented Nov 21, 2022

kikaragyozov commented Nov 21, 2022 • edited Loading

JoshClose commented Nov 21, 2022

kikaragyozov commented Nov 23, 2022

Rob-Hague commented Dec 13, 2022 • edited Loading

kikaragyozov commented Dec 13, 2022 • edited Loading

Rob-Hague commented Dec 13, 2022

kikaragyozov commented Dec 13, 2022 • edited Loading

Rob-Hague commented Dec 13, 2022

kikaragyozov commented Dec 13, 2022 • edited Loading

Rob-Hague commented Dec 26, 2022

Avrohom613 commented Oct 12, 2023

Rob-Hague commented Oct 12, 2023

Avrohom613 commented Oct 13, 2023

vicentezqh commented Aug 10, 2024

kikaragyozov commented Nov 21, 2022 •

edited

Loading

kikaragyozov commented Nov 21, 2022 •

edited

Loading

Rob-Hague commented Dec 13, 2022 •

edited

Loading

kikaragyozov commented Dec 13, 2022 •

edited

Loading

kikaragyozov commented Dec 13, 2022 •

edited

Loading

kikaragyozov commented Dec 13, 2022 •

edited

Loading