Perf: Optimize substring_index via single-byte fast path and direct indexing#19590
Perf: Optimize substring_index via single-byte fast path and direct indexing#19590alamb merged 5 commits intoapache:mainfrom
substring_index via single-byte fast path and direct indexing#19590Conversation
uzqw
left a comment
There was a problem hiding this comment.
Hi! I noticed some Clippy warnings when running locally:
cargo clippy -p datafusion-functions --all-targets --all-features
| match string.get(string.len().saturating_sub(length)..) { | ||
| Some(substring) => builder.append_value(substring), | ||
| None => builder.append_null(), | ||
| if n > 0 { |
There was a problem hiding this comment.
this else { if .. } block can be collapsed
|
|
||
| for batch_size in batch_sizes { | ||
| group.bench_function( | ||
| &format!("substr_index_{}_single_delimiter", batch_size), |
There was a problem hiding this comment.
variables can be used directly in the format! string
| ); | ||
|
|
||
| group.bench_function( | ||
| &format!("substr_index_{}_long_delimiter", batch_size), |
There was a problem hiding this comment.
variables can be used directly in the format! string
|
|
Hi @Jefffrey , could you please help trigger the workflows for this PR? I'd appreciate a review when you have a moment. Thanks! |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark substr_index |
This comment was marked as outdated.
This comment was marked as outdated.
| match string.get(..length) { | ||
| Some(substring) => builder.append_value(substring), | ||
| None => builder.append_null(), | ||
| let result_idx = if delimiter.len() == 1 { |
There was a problem hiding this comment.
I double checked that this checks for length in bytes
https://doc.rust-lang.org/std/string/struct.String.html#method.len
| T::Native: OffsetSizeTrait, | ||
| { | ||
| let mut builder = StringBuilder::new(); | ||
| let num_rows = string_array.len(); |
There was a problem hiding this comment.
If you wanted to make this really fast, you could also implement special case code for the common cases where delimiter and count were scalar values -- I suspect you could make it quite fast.
There was a problem hiding this comment.
Good point! I’ll try adding a special-case implementation for that. Thanks!
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Onwards 🚀 |
Which issue does this PR close?
Rationale for this change
This PR improves the performance of the
substring_indexfunction by optimizingdelimiter search and substring extraction:
.,,), avoiding UTF-8 pattern matching overhead.match_indices/rmatch_indices.What changes are included in this PR?
delimiter.len() == 1using byte-based search.match_indicesandrmatch_indicesfor more efficient positioning.Benchmarks
Are these changes tested?
Are there any user-facing changes?
No.