Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialize lone surrogates into byte bufs #828

Merged
merged 3 commits into from
Nov 25, 2021

Conversation

lucacasonato
Copy link
Contributor

@lucacasonato lucacasonato commented Nov 24, 2021

This commit deserializes lone surrogates in strings that are encoded in
escape sequences instead of erroring on them.

As per #827 (comment)


Note the implementation here is ugly & unoptimized. I'll go fix this after
an initial review of the functionality.

This commit deserializes lone surrogates in strings that are encoded in
escape sequences instead of erroring on them.
Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

src/de.rs Outdated Show resolved Hide resolved
src/read.rs Outdated Show resolved Hide resolved
src/read.rs Show resolved Hide resolved
src/read.rs Show resolved Hide resolved
Copy link
Contributor Author

@lucacasonato lucacasonato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed your review comments, and have added some test cases for the "non \u escape code after lone surrogate" cases.

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good — thanks.

@dtolnay dtolnay merged commit 691466c into serde-rs:master Nov 25, 2021
@lucacasonato lucacasonato deleted the lone_surrogate branch November 25, 2021 11:39
lucacasonato added a commit to lucacasonato/json that referenced this pull request Apr 12, 2022
Previously serde-rs#828 added support for deserializing lone leading and
trailing surrogates into WTF-8 encoded bytes when deserializing a string
as bytes. This commit extends this to cover the case of a leading
surrogate followed by code units that are not trailing surrogates. This
allows for deserialization of "\ud83c\ud83c" (two leading surrogates),
or  "\ud83c\u0061" (a leading surrogate followed by "a").

The docs also now make it clear that we are serializing the invalid code
points as WTF-8. This reference to WTF-8 signals to the user that they
can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.
lucacasonato added a commit to lucacasonato/json that referenced this pull request Apr 12, 2022
Previously serde-rs#828 added support for deserializing lone leading and
trailing surrogates into WTF-8 encoded bytes when deserializing a string
as bytes. This commit extends this to cover the case of a leading
surrogate followed by code units that are not trailing surrogates. This
allows for deserialization of "\ud83c\ud83c" (two leading surrogates),
or  "\ud83c\u0061" (a leading surrogate followed by "a").

The docs also now make it clear that we are serializing the invalid code
points as WTF-8. This reference to WTF-8 signals to the user that they
can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.
lucacasonato added a commit to lucacasonato/json that referenced this pull request May 18, 2022
Previously serde-rs#828 added support for deserializing lone leading and
trailing surrogates into WTF-8 encoded bytes when deserializing a string
as bytes. This commit extends this to cover the case of a leading
surrogate followed by code units that are not trailing surrogates. This
allows for deserialization of "\ud83c\ud83c" (two leading surrogates),
or  "\ud83c\u0061" (a leading surrogate followed by "a").

The docs also now make it clear that we are serializing the invalid code
points as WTF-8. This reference to WTF-8 signals to the user that they
can use a WTF-8 parser on the bytes to construct a valid UTF-8 string.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants