-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of trim
for string view (10%)
#12395
Conversation
FYI @Rachelint that #12383 is modifying |
Thanks! I will push forward this until #12383 merged. |
f6c83bf
to
325fac6
Compare
325fac6
to
48cb4db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Rachelint -- this looks really nice and quite close 🙏
I left some comments, but I don't think they are required to merge this.
I do think we should have benchmark numbers showing this makes things faster in order to merge it. Could you please make a StringView based benchmark for trim -- perhaps in
// regarding copyright ownership. The ASF licenses this file |
Then we can run that benchmark and show that this PR improves the performance.
Thanks again!
@@ -82,7 +82,11 @@ impl ScalarUDFImpl for BTrimFunc { | |||
} | |||
|
|||
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | |||
utf8_to_str_type(&arg_types[0], "btrim") | |||
if arg_types[0] == DataType::Utf8View { | |||
Ok(DataType::Utf8View) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Also eventually it would also be possible to return Utf8View
when the input was Utf8
and save a copy as well
use datafusion_common::cast::{as_generic_string_array, as_string_view_array}; | ||
use datafusion_common::Result; | ||
use datafusion_common::{exec_err, ScalarValue}; | ||
use datafusion_expr::ColumnarValue; | ||
|
||
/// Make a `u128` based on the given substr, start(offset to view.offset), and | ||
/// push into to the given buffers | ||
pub(crate) fn make_and_append_view( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I wonder if we should (as a follow on PR) propose adding this upstream to arrow-rs as it seems valuable for any trim related kernels on stringview
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds great! and #12383 (comment) can be solved if it is function in arrow-rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like what would be really useful is a StringViewBuilder
that could be modified perhaps 🤔
I started to write a ticket in arrow-rs but I didn't know exactly what API to suggest. I think we would have to try it out
I think maybe we should place the LTrim/RTrim/BTrim into a same place(like trim.rs)? |
For benchmarking, I would recommend this PR #12111. for what it's worth |
Thanks, it is really helpful! |
4e092d4
to
dbd0f25
Compare
Run benchmark introduced in #12513, about 10~20% improvement for the long string(64 bytes). Highlights, as we expected, the string view trim mainly reduces copyings when the trimmed result > 12:
The detailed sorted out benchmark result:
|
I merged this PR up to main and am running another round of benchmarks. Thank you @Rachelint |
Looks like a reasonable improvement to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Rachelint -- I went thought this PR again and it looks good
Since I had this PR checked out locally for review, I went ahead and remove the unsafe
pointer calculation to try and move this PR along (I know it has been outstanding for too long)
Thanks again!
let views_buf = ScalarBuffer::from(views_buf); | ||
let nulls_buf = null_builder.finish(); | ||
|
||
// Safety: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related discussion: apache/arrow-rs#6430
// Safety: | ||
// `trim_str` is computed from `str::trim_xxx_matches`, | ||
// and its addr is ensured to be >= `origin_str`'s | ||
let start = unsafe { trim_str.as_ptr().offset_from(src_str.as_ptr()) as u32 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran this diff:
diff --git a/datafusion/functions/src/string/common.rs b/datafusion/functions/src/string/common.rs
index 4f70374b7..f796d10c2 100644
--- a/datafusion/functions/src/string/common.rs
+++ b/datafusion/functions/src/string/common.rs
@@ -204,10 +204,7 @@ fn trim_and_append_str<'a>(
if let (Some(src_str), Some(characters)) = (src_str_opt, trim_characters_opt) {
let trim_str = trim_func(src_str, characters);
- // Safety:
- // `trim_str` is computed from `str::trim_xxx_matches`,
- // and its addr is ensured to be >= `origin_str`'s
- let start = unsafe { trim_str.as_ptr().offset_from(src_str.as_ptr()) as u32 };
+ let start = (src_str.as_bytes().len() - trim_str.as_bytes().len()) as u32;
make_and_append_view(views_buf, null_builder, raw, trim_str, start);
} else {
And all tests passed.
How about we merge this PR and then you can continue work on the optimizations as follow on PRs? |
I am checking about #12395 (comment) |
Sure -- no worries -- we can wait too. I just feel bad about how long this PR has been outstanding |
Hmm, some newly added tests seem to be failing
Thanks. I added some case to check it. And unfortunately, I found maybe we can't remove the unsafe codes currently. |
I just can't explain why pointer arithmetic is needed -- I think it is important to fix (or really understand) before merging |
Maybe disscussion in #12387 can help. The logic of
But unforunately, the needed feature
So eventually, we can just through the pointer arithmetic to get the start index currently... Update: |
I have filed an issue #12597 to track the introduced unsafe codes. And added a todo to mention this issue for tracking and explaining why we introduce unsafe codes here. |
@alamb I found we indeed don't need the unsafe pointer arithmetic to get the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Rachelint -- very nice. Thank you for sticking with it
I reran the benchmarks one more time and they looks good to me. Nice work.
I merged up from main and removed some redundant tests and plan to merge this PR when it passes CI.
++ critcmp main string-view-trim
group main string-view-trim
----- ---- ----------------
INPUT LEN <= 12/large_string [size=1024, len_before=12, len_after=8] 1.03 42.9±0.05µs ? ?/sec 1.00 41.7±0.06µs ? ?/sec
INPUT LEN <= 12/large_string [size=4096, len_before=12, len_after=8] 1.02 165.5±0.56µs ? ?/sec 1.00 161.5±0.06µs ? ?/sec
INPUT LEN <= 12/large_string [size=8192, len_before=12, len_after=8] 1.02 327.8±0.19µs ? ?/sec 1.00 320.2±0.19µs ? ?/sec
INPUT LEN <= 12/string [size=1024, len_before=12, len_after=8] 1.01 41.6±0.03µs ? ?/sec 1.00 41.2±0.12µs ? ?/sec
INPUT LEN <= 12/string [size=4096, len_before=12, len_after=8] 1.01 160.3±0.11µs ? ?/sec 1.00 159.4±0.09µs ? ?/sec
INPUT LEN <= 12/string [size=8192, len_before=12, len_after=8] 1.01 318.5±1.89µs ? ?/sec 1.00 316.4±0.48µs ? ?/sec
INPUT LEN <= 12/string_view [size=1024, len_before=12, len_after=8] 1.06 41.6±0.01µs ? ?/sec 1.00 39.4±0.02µs ? ?/sec
INPUT LEN <= 12/string_view [size=4096, len_before=12, len_after=8] 1.03 160.5±0.07µs ? ?/sec 1.00 155.1±0.13µs ? ?/sec
INPUT LEN <= 12/string_view [size=8192, len_before=12, len_after=8] 1.03 318.0±0.20µs ? ?/sec 1.00 309.3±1.15µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/large_string [size=1024, len_before=64, len_after=4] 1.10 184.3±0.27µs ? ?/sec 1.00 167.4±0.10µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/large_string [size=4096, len_before=64, len_after=4] 1.09 727.0±0.67µs ? ?/sec 1.00 665.8±0.41µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/large_string [size=8192, len_before=64, len_after=4] 1.09 1448.7±1.76µs ? ?/sec 1.00 1329.6±1.77µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string [size=1024, len_before=64, len_after=4] 1.09 181.8±0.18µs ? ?/sec 1.00 167.4±0.11µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string [size=4096, len_before=64, len_after=4] 1.09 725.8±11.87µs ? ?/sec 1.00 664.9±0.62µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string [size=8192, len_before=64, len_after=4] 1.09 1443.8±2.64µs ? ?/sec 1.00 1324.8±1.25µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string_view [size=1024, len_before=64, len_after=4] 1.11 181.8±0.11µs ? ?/sec 1.00 163.8±0.22µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string_view [size=4096, len_before=64, len_after=4] 1.11 724.6±0.84µs ? ?/sec 1.00 651.6±0.22µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string_view [size=8192, len_before=64, len_after=4] 1.11 1444.8±0.63µs ? ?/sec 1.00 1302.7±0.98µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/large_string [size=1024, len_before=64, len_after=60] 1.01 44.4±0.12µs ? ?/sec 1.00 44.0±0.07µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/large_string [size=4096, len_before=64, len_after=60] 1.01 171.7±0.30µs ? ?/sec 1.00 170.4±0.15µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/large_string [size=8192, len_before=64, len_after=60] 1.01 349.8±0.83µs ? ?/sec 1.00 347.1±0.53µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string [size=1024, len_before=64, len_after=60] 1.00 43.0±0.06µs ? ?/sec 1.00 43.1±0.20µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string [size=4096, len_before=64, len_after=60] 1.00 166.2±0.21µs ? ?/sec 1.01 167.5±0.32µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string [size=8192, len_before=64, len_after=60] 1.00 336.5±1.03µs ? ?/sec 1.01 341.1±0.42µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string_view [size=1024, len_before=64, len_after=60] 1.08 43.8±0.17µs ? ?/sec 1.00 40.6±0.03µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string_view [size=4096, len_before=64, len_after=60] 1.06 169.0±0.55µs ? ?/sec 1.00 159.0±0.21µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string_view [size=8192, len_before=64, len_after=60] 1.08 342.9±0.96µs ? ?/sec 1.00 316.5±0.57µs ? ?/sec
@@ -982,5 +982,93 @@ logical_plan | |||
01)Projection: temp.column2 || temp.column3 | |||
02)--TableScan: temp projection=[column2, column3] | |||
|
|||
################################################ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked and @goldmedal 's recent changes (I think created after this PR) make these tests redundant
include ./string_query.slt.part |
WHich then runs the tests in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string/string_query.slt.part
I removed these tests from this PR to keep things moving
trim
for string viewtrim
for string view (10%)
🚀 |
Which issue does this PR close?
Closes #12387
Rationale for this change
Similar as the string view version substr, we can impl the string view version trim to improve performance.
What changes are included in this PR?
Are these changes tested?
Test by new unit test and exist other tests.
Are there any user-facing changes?
No.