-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse error: Invalid \uXXXX\uXXXX surrogate pair escape at line 276, column 28101 #2543
Comments
Hi, please cleanup your question formatting and figure out what's on line 276, would make it much easier to help. |
The only surrogate escape i see is $ jq -n '"\ufffd\ud8ec\ufffd"'
jq: error: Invalid \uXXXX\uXXXX surrogate pair escape at line 1, column 20 (while parsing '"\ufffd\ud8ec\ufffd"') at <top-level>, line 1:
"\ufffd\ud8ec\ufffd"
jq: 1 compile error |
Ok, i guess it boils down to this section from the JSON spec https://www.rfc-editor.org/rfc/rfc8259 and that different implementations behave differently in this "unpredictable" case:
|
I have removed some comments I left after realizing my "straightforward" suggestion was based on a misunderstanding of the problem, my apologies. My advice to anyone who wants to avoid the same mistake as me: remember that the It's a tricky problem for sure, it would be nice to handle this as it's technically valid JSON, but the both the ECMA and IETF specs don't appear to give any guidance on how to do so. If we accept the assumption the RFC seems to make in the passage quoted by @wader above (that most or all instances of an unmatched surrogate come from incorrectly truncating UTF-16), then I guess the pragmatic approach would be to consume just the unmatched surrogate (if this assumption is false, this results in a single garbage character in the output, which is far preferable to if the assumption is true and we consume both, scrambling all immediately subsequent "healthy" surrogate pairs). One thing I haven't fully explored the implications of: if you feed random bytes into |
Update:
Looks like jq handles unmatched utf-8 surrogates by just replacing the surrogate, and leaves the following character intact, matching the "pragmatic" option above. (no idea how it handles partial 3 and 4 byte sequences missing a surrogate but that's less relevant as precedent for how to deal with utf-16). Given that jq does handle unmatched surrogates in some cases, I think that allows one to consider this handling of escape sequences to be inconsistent behavior, and possibly worth considering a bug if anyone is interested in re-opening this. |
jq - commandline JSON processor [version 1.6]
at line 276, column 28101:
The text was updated successfully, but these errors were encountered: