-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up substring_by_char
by about 2.5x
#1832
speed up substring_by_char
by about 2.5x
#1832
Conversation
Signed-off-by: remzi <13716567376yh@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #1832 +/- ##
==========================================
- Coverage 83.53% 83.49% -0.05%
==========================================
Files 200 201 +1
Lines 56798 56903 +105
==========================================
+ Hits 47449 47510 +61
- Misses 9349 9393 +44
Continue to review full report at Codecov.
|
let length = length.map_or(char_count - start, |length| { | ||
length.to_usize().unwrap().min(char_count - start) | ||
}); | ||
let mut vals = BufferBuilder::<u8>::new(array.value_data().len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will always allocate more memory than we need, but it's no big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted to improve this estimate you could compute the difference between the first and last offset. It would still be an over-estimate, but would over-estimate sliced arrays less
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to do this, or should I just get this in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good enough. It has O(n)
space complexity (where n
is the byte length of the value buffer of the input array) and at most allocate n
extra memory than we need (when length
= 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wasn't clear, the problem here is that the values buffer may be substantially larger than necessary as a result of slicing.
Say I have an array of 1000 strings, and I take a slice of length two, the values array is completely unchanged, but the offsets array is now length of 3, instead of 1001.
By taking the delta between the first and last offset in the source array, you ensure you are not allocating capacity that you definitely don't need, as technically that capacity isn't even needed by the source array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your explanation, which makes perfect sense.
Forgot offset
and slicing. 😅 I will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated! No impact on the benchmark result
vec![], | ||
) | ||
}; | ||
Ok(GenericStringArray::<OffsetSize>::from(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Ok()
here is just to make this function to be consistent with substring
. It is meaningless because this function never return an error (Actually, I don't like unwrap
or panic
in a function returning Result
. But the code would be much unclear is I remove unwrap
(lots of ok_or(...)
)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Panicking if an invariant is violated I think makes perfect sense, errors only really make sense imo if you expect some higher level system to catch and handle that error.
if let Some(val) = val { | ||
let char_count = val.chars().count(); | ||
let start = if start >= 0 { | ||
start.to_usize().unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about why to_usize()
returns Option<usize>
but not Result<usize>
.
Unfortunately, the rust std library also chooses Option
:
/// Converts the value of `self` to a `usize`. If the value cannot be
/// represented by a `usize`, then `None` is returned.
#[inline]
fn to_usize(&self) -> Option<usize> {
self.to_u64().as_ref().and_then(ToPrimitive::to_usize)
}
But Result
is more reasonable for me. Because we could tell users the reason of casting failure :
#[inline]
fn to_usize(&self) -> Result<usize> {
match num::ToPrimitive::to_usize(self) {
Some(val) => Ok(val),
None => Err(ArrowError::CastError(format!("Cannot cast {} to usize"))
}
}
And for unsupported types, we can have a default implementation:
/// Convert native type to usize.
#[inline]
fn to_usize(&self) -> Result<usize> {
Err(ArrowError::CastError(format!("Casting {} to usize is not supported.", type_name::<Self>())))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale is likely that there is only one possible error variant, and so it is up to the caller to add this context in the manner appropriate to how they have chosen to handle errors - be it panic, anyhow, custom error enumerations, etc...
As for the semantics on ArrowNativeType, I happen to not be a fan of returning None for "unsupported" floating point types - my 2 cents is we should just use num-traits and not roll our own 😅
array | ||
.data_ref() | ||
.null_buffer() | ||
.map(|b| b.bit_slice(array.offset(), array.len())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
let length = length.map_or(char_count - start, |length| { | ||
length.to_usize().unwrap().min(char_count - start) | ||
}); | ||
let mut vals = BufferBuilder::<u8>::new(array.value_data().len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted to improve this estimate you could compute the difference between the first and last offset. It would still be an over-estimate, but would over-estimate sliced arrays less
offsets.append(OffsetSize::zero()); | ||
let length = length.map(|len| len.to_usize().unwrap()); | ||
|
||
array.iter().for_each(|val| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A further improvement might be to iterate the offsets directly, as we don't actually care about skipping over nulls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure whether we could get further speed-up from this. Because for each value slot, we have to transmute it to string to scan it over (using char_indices
) to get the start
and end
offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.html is free and safe in this context FWIW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.html is free and safe in this context FWIW
Our GenericStringIter
is implemented based on this. That's why I guess array.iter()
is optimal enough.
BTW, skipping nulls could not be the hotspot in this function, because it is just a bitwise operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it is just a bitwise operation
I would not underestimate how remarkably slow bitwise operations can be, especially when used to predicate branches. But yes, likely for this kernel the bottleneck lies elsewhere
vec![], | ||
) | ||
}; | ||
Ok(GenericStringArray::<OffsetSize>::from(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Panicking if an invariant is violated I think makes perfect sense, errors only really make sense imo if you expect some higher level system to catch and handle that error.
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi 13716567376yh@gmail.com
Which issue does this PR close?
Closes #1800.
What changes are included in this PR?
Directly copy the string slice to
BufferBuilder
.Are there any user-facing changes?
No.
Benchmark
about 2.5x