-
Notifications
You must be signed in to change notification settings - Fork 1.7k
minor: check size overflow before string repeat build #14575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @wForget I feel it is great to avoid fallible scenario.
Please create a bench like for other string functions to see if we get a performance downside introducing new branch instructions like IF
@comphead Thank you, I have tried to add a benchmark case. Before this fix, string repeat overflow would result in a panic error which interrupts the benchmark, so it is difficult for me to compare whether the new benchmark has an improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wForget and @comphead
I think this looks like an improvement to me -- I also ran the benchmarks and saw no significant changes in my mind:
group dev main
----- --- ----
repeat 1073741824 times/repeat_string overflow [size=1024, repeat_times=1073741824] 1.00 455.0±0.63ns ? ?/sec
repeat 1073741824 times/repeat_string overflow [size=4096, repeat_times=1073741824] 1.00 418.3±0.16ns ? ?/sec
repeat 3 times/repeat_large_string [size=1024, repeat_times=3] 1.00 37.0±0.16µs ? ?/sec 1.02 37.9±0.20µs ? ?/sec
repeat 3 times/repeat_large_string [size=4096, repeat_times=3] 1.00 148.1±0.32µs ? ?/sec 1.03 153.0±0.64µs ? ?/sec
repeat 3 times/repeat_string [size=1024, repeat_times=3] 1.04 39.8±0.25µs ? ?/sec 1.00 38.3±0.22µs ? ?/sec
repeat 3 times/repeat_string [size=4096, repeat_times=3] 1.05 160.1±0.46µs ? ?/sec 1.00 152.1±0.40µs ? ?/sec
repeat 3 times/repeat_string_view [size=1024, repeat_times=3] 1.00 37.5±0.12µs ? ?/sec 1.04 38.8±0.14µs ? ?/sec
repeat 3 times/repeat_string_view [size=4096, repeat_times=3] 1.00 153.3±0.17µs ? ?/sec 1.02 156.8±0.23µs ? ?/sec
repeat 30 times/repeat_large_string [size=1024, repeat_times=30] 1.01 113.0±0.90µs ? ?/sec 1.00 112.1±0.26µs ? ?/sec
repeat 30 times/repeat_large_string [size=4096, repeat_times=30] 1.01 570.7±7.25µs ? ?/sec 1.00 565.9±3.24µs ? ?/sec
repeat 30 times/repeat_string [size=1024, repeat_times=30] 1.00 113.0±0.96µs ? ?/sec 1.00 113.3±1.08µs ? ?/sec
repeat 30 times/repeat_string [size=4096, repeat_times=30] 1.03 577.2±4.93µs ? ?/sec 1.00 561.6±2.89µs ? ?/sec
repeat 30 times/repeat_string_view [size=1024, repeat_times=30] 1.00 111.7±0.75µs ? ?/sec 1.04 115.9±0.68µs ? ?/sec
repeat 30 times/repeat_string_view [size=4096, repeat_times=30] 1.00 3.2±0.04ms ? ?/sec 1.00 3.2±0.02ms ? ?/sec
++ popd
number as usize * string.len() | ||
); | ||
} else { | ||
builder.append_value(string.repeat(number as usize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code checks for overflow on each individual element which is an improvement.
We could also potentially ensure that the total length of the builder's data after pushing the string would be iess than max length as well
Given it calls String::repeat which allocates and copies the string (before append_value
copies it again) I doubt checking for max sizes is going to make a difference
I am running some benchmarks to be sure, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also filed a ticket to optimize this path some more
(BTW I run benchmarks using https://github.com/alamb/datafusion-benchmarking/blob/main/compare_branch_bench.sh) |
Which issue does this PR close?
minor fix
Rationale for this change
Check string size overflow before string repeat build to fail fast and save memory.
What changes are included in this PR?
check max size in string repeat function
Are these changes tested?
added unit test
Are there any user-facing changes?
More friendly error message