-
Notifications
You must be signed in to change notification settings - Fork 1k
fix JSON decoder error checking for UTF16 / surrogate parsing panic #7721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
match (low, high) { | ||
(0xDC00..=0xDFFF, 0xD800..=0xDBFF) => { | ||
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000); | ||
char::from_u32(n) | ||
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}"))) | ||
} | ||
_ => Err(ArrowError::JsonError(format!( | ||
"Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}" | ||
))), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a pair of checked_sub work?
match (low, high) { | |
(0xDC00..=0xDFFF, 0xD800..=0xDBFF) => { | |
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000); | |
char::from_u32(n) | |
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}"))) | |
} | |
_ => Err(ArrowError::JsonError(format!( | |
"Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}" | |
))), | |
} | |
char_from_surrogate_pair_opt(low, high).ok_or_else(|| { | |
ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair: {lo:#02X}, {high:#02X}")) | |
}) | |
} | |
fn char_from_surrogate_pair_opt(low: u16, high: u16) -> Option<char> { | |
let high = high.checked_sub(0xD800)? as u32; | |
let low = low.checked_sub(0xDC00)? as u32; | |
char::from_u32((high << 10) | (low + 0x1_0000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider this, and it does have the benefit of being more succinct. I opted for the range check because if one of the arguments was too high the checked sub would not catch it and I wasn't able to convince myself that only valid inputs would allow char::from_u32
to succeed.
I could dig more into the calling layers to check if that's prevented there, but it still felt like more of a foot-gun than the simple range check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it is worth, I think the current way this is encoded is quite understandable and will generate clear error messages. I double checked and it conforms to my reading of UTF-16 on https://en.wikipedia.org/wiki/UTF-16
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match (low, high) { | ||
(0xDC00..=0xDFFF, 0xD800..=0xDBFF) => { | ||
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000); | ||
char::from_u32(n) | ||
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}"))) | ||
} | ||
_ => Err(ArrowError::JsonError(format!( | ||
"Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}" | ||
))), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it is worth, I think the current way this is encoded is quite understandable and will generate clear error messages. I double checked and it conforms to my reading of UTF-16 on https://en.wikipedia.org/wiki/UTF-16
} | ||
|
||
#[test] | ||
fn test_invalid_surrogates() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified this test panic's on main:
attempt to subtract with overflow
thread 'reader::tape::tests::test_invalid_surrogates' panicked at arrow-json/src/reader/tape.rs:708:49:
attempt to subtract with overflow
And the test actually fails to detect an error in release mode
---- reader::tape::tests::test_invalid_surrogates stdout ----
thread 'reader::tape::tests::test_invalid_surrogates' panicked at arrow-json/src/reader/tape.rs:959:9:
assertion failed: res.is_err()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
If anything this seems faster than what is on main so it has a added bonus |
Which issue does this PR close?
Rationale for this change
Shouldn't panic, especially in a fallible function.
What changes are included in this PR?
Validate that the high and low surrogates are in the expected range, which guarantees that the subtractions won't overflow.
Are there any user-facing changes?
No (well, things that used to panic now won't, but I don't think that counts)