Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAML parser rejects Unicode surrogates. #2206

Open
lhmathies opened this issue Nov 29, 2024 · 3 comments
Open

YAML parser rejects Unicode surrogates. #2206

lhmathies opened this issue Nov 29, 2024 · 3 comments
Labels

Comments

@lhmathies
Copy link

lhmathies commented Nov 29, 2024

Describe the bug

The YAML parser rejects Unicode surrogates, but the JSON parser accepts them. This breaks the expectation that you can parse JSON as if it were YAML.

I have a real world example where the JSON response body from an HTTP call to an API endpoint contains surrogates (in string values), but I'll illustrate it with a single non-BMP character (u+10336 Gothic Letter Iuja, 𐌶):

$ yq --version
yq (https://github.com/mikefarah/yq/) version v4.44.5
$ echo -n '"\ud800\udf36"' | yq -py
Error: bad file '-': yaml: found invalid Unicode character escape code
$ echo -n '"\ud800\udf36"' | yq -pj
𐌶
$ echo -n '"\ud800\udf36"' | yq -pj | tr -d \\n | iconv -t utf-32le | od -t x4
0000000 00010336
0000004

So arguably the author of the JSON should have used \U{10336}, but that only works for EcmaScript strings. (Tested in the Firefox console, but yq -pj doesn't grok it. Firefox also accepts surrogates).

YAML supports \U00010336, but that only works with yq -py. FWIW, the YAML 1.2 spec doesn't mention surrogates, but you can argue that they aren't "characters". I just need them to work...

(This sort of proves that YAML doesn't have JSON as a subset if you use the full EcmaScript string definition; but the JSON spec only has \uxxxx, so it's cool but you do need surrogates to reach outside the BMP).

Version of yq: 4.44.5
Operating system: linux amd64
Installed via: binary release

@lhmathies
Copy link
Author

lhmathies commented Dec 2, 2024

YAML 1,2 specification, first line of section 5.2:

 All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair.

I wanted to test if UTF-16 behaved differently from UTF-8, but I can't seem to get yq to accept UTF16 input.

I can get yq -py to accept UTF-16 with a byte order mark (iconv -t utf16) but then it replaces the surrogate characters with \ufffd 'REPLACEM;ENT CHARACTER'. This is not what the YAML spec says. And then it outputs "\U0010336" if printing a whole YAML document, and the UTF-8 representation of the code point if printing a raw string. That all complies with the YAML 1.2 spec.

My conclusion to all this is that surrogates are invalid in UTF-8, but the YAML parser ought to do something sensible when presented with surrogate code points presented as \uD800-\uDFFF escapes. And throwing an error is not sensible, especially not when it works fine for UTF-16 with a byte order mark. (And I'd claim that it ought to work without the BOM too since all the other characters come out fine; so we know it detected the byte order correctly).

@lhmathies
Copy link
Author

lhmathies commented Dec 3, 2024

To be more explicit: The only spec-compliant way to represent 𐌶 in JSON is as "\ud800\udf36", unless your input is UTF-8. (So why the API I have to deal with doesn't just do that is a good question. The content is a user upload; all non-ASCII chars are converted to \uxxxx escapes, but the JSON is served with an explicit content-type: text/plain; charset=utf-8(!) My guess is that the server stores some type of UTF-16 and whoever coded the read endpoint was unclear on the concepts; handing the same string to JS code [which is the site's main way of presenting the content] does after all work).

So regardless of what the YAML spec says or doesn't say about surrogates in \udxxx representation, it does say that it's intended to be a superset of JSON; it's surprising that yq -py doesn't accept it.

@mikefarah
Copy link
Owner

Yeah this is a known issue with the underlying go-yaml library, which hasn't received much love recently :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants