-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up substring_by_char
by about 2.5x
#1832
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -182,24 +182,69 @@ pub fn substring_by_char<OffsetSize: OffsetSizeTrait>( | |
start: i64, | ||
length: Option<u64>, | ||
) -> Result<GenericStringArray<OffsetSize>> { | ||
Ok(array | ||
.iter() | ||
.map(|val| { | ||
val.map(|val| { | ||
let char_count = val.chars().count(); | ||
let start = if start >= 0 { | ||
start.to_usize().unwrap().min(char_count) | ||
} else { | ||
char_count - (-start).to_usize().unwrap().min(char_count) | ||
}; | ||
let length = length.map_or(char_count - start, |length| { | ||
length.to_usize().unwrap().min(char_count - start) | ||
}); | ||
let mut vals = BufferBuilder::<u8>::new({ | ||
let offsets = array.value_offsets(); | ||
(offsets[array.len()] - offsets[0]).to_usize().unwrap() | ||
}); | ||
let mut new_offsets = BufferBuilder::<OffsetSize>::new(array.len() + 1); | ||
new_offsets.append(OffsetSize::zero()); | ||
let length = length.map(|len| len.to_usize().unwrap()); | ||
|
||
array.iter().for_each(|val| { | ||
if let Some(val) = val { | ||
let char_count = val.chars().count(); | ||
let start = if start >= 0 { | ||
start.to_usize().unwrap() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am curious about why /// Converts the value of `self` to a `usize`. If the value cannot be
/// represented by a `usize`, then `None` is returned.
#[inline]
fn to_usize(&self) -> Option<usize> {
self.to_u64().as_ref().and_then(ToPrimitive::to_usize)
} But #[inline]
fn to_usize(&self) -> Result<usize> {
match num::ToPrimitive::to_usize(self) {
Some(val) => Ok(val),
None => Err(ArrowError::CastError(format!("Cannot cast {} to usize"))
}
} And for unsupported types, we can have a default implementation: /// Convert native type to usize.
#[inline]
fn to_usize(&self) -> Result<usize> {
Err(ArrowError::CastError(format!("Casting {} to usize is not supported.", type_name::<Self>())))
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The rationale is likely that there is only one possible error variant, and so it is up to the caller to add this context in the manner appropriate to how they have chosen to handle errors - be it panic, anyhow, custom error enumerations, etc... As for the semantics on ArrowNativeType, I happen to not be a fan of returning None for "unsupported" floating point types - my 2 cents is we should just use num-traits and not roll our own 😅 |
||
} else { | ||
char_count - (-start).to_usize().unwrap().min(char_count) | ||
}; | ||
let (start_offset, end_offset) = get_start_end_offset(val, start, length); | ||
vals.append_slice(&val.as_bytes()[start_offset..end_offset]); | ||
} | ||
new_offsets.append(OffsetSize::from_usize(vals.len()).unwrap()); | ||
}); | ||
let data = unsafe { | ||
ArrayData::new_unchecked( | ||
GenericStringArray::<OffsetSize>::get_data_type(), | ||
array.len(), | ||
None, | ||
array | ||
.data_ref() | ||
.null_buffer() | ||
.map(|b| b.bit_slice(array.offset(), array.len())), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👌 |
||
0, | ||
vec![new_offsets.finish(), vals.finish()], | ||
vec![], | ||
) | ||
}; | ||
Ok(GenericStringArray::<OffsetSize>::from(data)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Panicking if an invariant is violated I think makes perfect sense, errors only really make sense imo if you expect some higher level system to catch and handle that error. |
||
} | ||
|
||
val.chars().skip(start).take(length).collect::<String>() | ||
}) | ||
}) | ||
.collect::<GenericStringArray<OffsetSize>>()) | ||
/// * `val` - string | ||
/// * `start` - the start char index of the substring | ||
/// * `length` - the char length of the substring | ||
/// | ||
/// Return the `start` and `end` offset (by byte) of the substring | ||
fn get_start_end_offset( | ||
val: &str, | ||
start: usize, | ||
length: Option<usize>, | ||
) -> (usize, usize) { | ||
let len = val.len(); | ||
let mut offset_char_iter = val.char_indices(); | ||
let start_offset = offset_char_iter | ||
.nth(start) | ||
.map_or(len, |(offset, _)| offset); | ||
let end_offset = length.map_or(len, |length| { | ||
if length > 0 { | ||
offset_char_iter | ||
.nth(length - 1) | ||
.map_or(len, |(offset, _)| offset) | ||
} else { | ||
start_offset | ||
} | ||
}); | ||
(start_offset, end_offset) | ||
} | ||
|
||
fn binary_substring<OffsetSize: OffsetSizeTrait>( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A further improvement might be to iterate the offsets directly, as we don't actually care about skipping over nulls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure whether we could get further speed-up from this. Because for each value slot, we have to transmute it to string to scan it over (using
char_indices
) to get thestart
andend
offset.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.html is free and safe in this context FWIW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our
GenericStringIter
is implemented based on this. That's why I guessarray.iter()
is optimal enough.BTW, skipping nulls could not be the hotspot in this function, because it is just a bitwise operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not underestimate how remarkably slow bitwise operations can be, especially when used to predicate branches. But yes, likely for this kernel the bottleneck lies elsewhere