Skip to content

Conversation

nicklan
Copy link
Contributor

@nicklan nicklan commented Jun 20, 2025

Which issue does this PR close?

Rationale for this change

Shouldn't panic, especially in a fallible function.

What changes are included in this PR?

Validate that the high and low surrogates are in the expected range, which guarantees that the subtractions won't overflow.

Are there any user-facing changes?

No (well, things that used to panic now won't, but I don't think that counts)

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 20, 2025
Comment on lines +708 to +717
match (low, high) {
(0xDC00..=0xDFFF, 0xD800..=0xDBFF) => {
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000);
char::from_u32(n)
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}")))
}
_ => Err(ArrowError::JsonError(format!(
"Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}"
))),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a pair of checked_sub work?

Suggested change
match (low, high) {
(0xDC00..=0xDFFF, 0xD800..=0xDBFF) => {
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000);
char::from_u32(n)
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}")))
}
_ => Err(ArrowError::JsonError(format!(
"Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}"
))),
}
char_from_surrogate_pair_opt(low, high).ok_or_else(|| {
ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair: {lo:#02X}, {high:#02X}"))
})
}
fn char_from_surrogate_pair_opt(low: u16, high: u16) -> Option<char> {
let high = high.checked_sub(0xD800)? as u32;
let low = low.checked_sub(0xDC00)? as u32;
char::from_u32((high << 10) | (low + 0x1_0000))

Copy link
Contributor Author

@nicklan nicklan Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider this, and it does have the benefit of being more succinct. I opted for the range check because if one of the arguments was too high the checked sub would not catch it and I wasn't able to convince myself that only valid inputs would allow char::from_u32 to succeed.

I could dig more into the calling layers to check if that's prevented there, but it still felt like more of a foot-gun than the simple range check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it is worth, I think the current way this is encoded is quite understandable and will generate clear error messages. I double checked and it conforms to my reading of UTF-16 on https://en.wikipedia.org/wiki/UTF-16

@alamb
Copy link
Contributor

alamb commented Jun 21, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fix-json-surrogate-parsing-panic (30170ce) to fbaf7ce diff
BENCH_NAME=json_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench json_reader
BENCH_FILTER=
BENCH_BRANCH_NAME=fix-json-surrogate-parsing-panic
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 21, 2025

Thank you @nicklan and @scovich -- I started the CI and some benchmarks on this PR

@alamb
Copy link
Contributor

alamb commented Jun 21, 2025

🤖: Benchmark completed

Details

group                                  fix-json-surrogate-parsing-panic       main
-----                                  --------------------------------       ----
large_bench_primitive                  1.00      2.3±0.01ms        ? ?/sec    1.04      2.4±0.00ms        ? ?/sec
small_bench_list                       1.00     12.9±0.03µs        ? ?/sec    1.03     13.2±0.07µs        ? ?/sec
small_bench_primitive                  1.01      6.6±0.04µs        ? ?/sec    1.00      6.5±0.04µs        ? ?/sec
small_bench_primitive_with_utf8view    1.00      6.2±0.04µs        ? ?/sec    1.04      6.5±0.09µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jun 21, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fix-json-surrogate-parsing-panic (30170ce) to fbaf7ce diff
BENCH_NAME=json_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench json_reader
BENCH_FILTER=
BENCH_BRANCH_NAME=fix-json-surrogate-parsing-panic
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 21, 2025

🤖: Benchmark completed

Details

group                                  fix-json-surrogate-parsing-panic       main
-----                                  --------------------------------       ----
large_bench_primitive                  1.00      2.3±0.01ms        ? ?/sec    1.05      2.4±0.01ms        ? ?/sec
small_bench_list                       1.00     13.1±0.05µs        ? ?/sec    1.01     13.2±0.04µs        ? ?/sec
small_bench_primitive                  1.00      6.5±0.03µs        ? ?/sec    1.02      6.6±0.03µs        ? ?/sec
small_bench_primitive_with_utf8view    1.00      6.2±0.02µs        ? ?/sec    1.05      6.5±0.04µs        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nicklan and @scovich - this looks like a great change to me

Comment on lines +708 to +717
match (low, high) {
(0xDC00..=0xDFFF, 0xD800..=0xDBFF) => {
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000);
char::from_u32(n)
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}")))
}
_ => Err(ArrowError::JsonError(format!(
"Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}"
))),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it is worth, I think the current way this is encoded is quite understandable and will generate clear error messages. I double checked and it conforms to my reading of UTF-16 on https://en.wikipedia.org/wiki/UTF-16

}

#[test]
fn test_invalid_surrogates() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this test panic's on main:

attempt to subtract with overflow
thread 'reader::tape::tests::test_invalid_surrogates' panicked at arrow-json/src/reader/tape.rs:708:49:
attempt to subtract with overflow

And the test actually fails to detect an error in release mode


---- reader::tape::tests::test_invalid_surrogates stdout ----

thread 'reader::tape::tests::test_invalid_surrogates' panicked at arrow-json/src/reader/tape.rs:959:9:
assertion failed: res.is_err()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@alamb alamb changed the title fix surrogate parsing panic fix JSON decoder error checking for surrogate parsing panic Jun 21, 2025
@alamb alamb changed the title fix JSON decoder error checking for surrogate parsing panic fix JSON decoder error checking for UTF16 / surrogate parsing panic Jun 21, 2025
@alamb
Copy link
Contributor

alamb commented Jun 22, 2025

If anything this seems faster than what is on main so it has a added bonus

@alamb alamb merged commit 2788762 into apache:main Jun 22, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 22, 2025

Thanks again @nicklan and @scovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In arrow_json, Decoder::decode can panic if it encounters two high surrogates in a row.
3 participants