Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs for utf8 decoding #2979

Merged
merged 1 commit into from
Sep 19, 2022
Merged

Docs for utf8 decoding #2979

merged 1 commit into from
Sep 19, 2022

Conversation

TimWSpence
Copy link
Member

Clarify a potentially confusing aspect of the decoding API

@TimWSpence
Copy link
Member Author

CC @CremboC

Comment on lines +46 to +49
* Note that the output stream is ''not'' a singleton stream but rather a stream
* of strings where each string is the result of UTF8 decoding a chunk of the
* underlying byte stream.
*/
Copy link
Member

@armanbilge armanbilge Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I can see how this was confusing 😅 thanks!

If we really get nitty, IIUC it's not technically not one-string-per-Chunk, since some multi-byte characters could be split across Chunks. Not sure if there's a good way to say that though, or if it really matters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh haha ouch! I suspect attempting to explain that would cause more confusion rather than less but I'm very happy if someone can suggest a simple explanation for it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is even more complicated, since a chunk from the input could just be the middle bytes of a multi-byte character, so not a single character would be built off that chunk.

The result is a stream of strings. Every chunk in the result contains exactly one string, and each string carries all characters that could be fully decoded from the input.

But I think it may be easier to just say

For the most part, each string in the output is the result of decoding a chunk of bytes from the input; however, this may not be accurate when the bytes of a multi-byte character are split amongst one or more input chunks.

@mpilquist mpilquist merged commit b1bf982 into typelevel:main Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants