Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up substring_by_char by about 2.5x #1832

Merged
merged 2 commits into from
Jun 11, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 62 additions & 17 deletions arrow/src/compute/kernels/substring.rs
Original file line number Diff line number Diff line change
Expand Up @@ -182,24 +182,69 @@ pub fn substring_by_char<OffsetSize: OffsetSizeTrait>(
start: i64,
length: Option<u64>,
) -> Result<GenericStringArray<OffsetSize>> {
Ok(array
.iter()
.map(|val| {
val.map(|val| {
let char_count = val.chars().count();
let start = if start >= 0 {
start.to_usize().unwrap().min(char_count)
} else {
char_count - (-start).to_usize().unwrap().min(char_count)
};
let length = length.map_or(char_count - start, |length| {
length.to_usize().unwrap().min(char_count - start)
});
let mut vals = BufferBuilder::<u8>::new({
let offsets = array.value_offsets();
(offsets[array.len()] - offsets[0]).to_usize().unwrap()
});
let mut new_offsets = BufferBuilder::<OffsetSize>::new(array.len() + 1);
new_offsets.append(OffsetSize::zero());
let length = length.map(|len| len.to_usize().unwrap());

array.iter().for_each(|val| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A further improvement might be to iterate the offsets directly, as we don't actually care about skipping over nulls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure whether we could get further speed-up from this. Because for each value slot, we have to transmute it to string to scan it over (using char_indices) to get the start and end offset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.html is free and safe in this context FWIW

Our GenericStringIter is implemented based on this. That's why I guess array.iter() is optimal enough.
BTW, skipping nulls could not be the hotspot in this function, because it is just a bitwise operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it is just a bitwise operation

I would not underestimate how remarkably slow bitwise operations can be, especially when used to predicate branches. But yes, likely for this kernel the bottleneck lies elsewhere

if let Some(val) = val {
let char_count = val.chars().count();
let start = if start >= 0 {
start.to_usize().unwrap()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about why to_usize() returns Option<usize> but not Result<usize>.
Unfortunately, the rust std library also chooses Option:

    /// Converts the value of `self` to a `usize`. If the value cannot be
    /// represented by a `usize`, then `None` is returned.
    #[inline]
    fn to_usize(&self) -> Option<usize> {
        self.to_u64().as_ref().and_then(ToPrimitive::to_usize)
    }

But Result is more reasonable for me. Because we could tell users the reason of casting failure :

    #[inline]
    fn to_usize(&self) -> Result<usize> {
        match num::ToPrimitive::to_usize(self) {
            Some(val) => Ok(val),
            None => Err(ArrowError::CastError(format!("Cannot cast {} to usize"))
        }
    }

And for unsupported types, we can have a default implementation:

    /// Convert native type to usize.
    #[inline]
    fn to_usize(&self) -> Result<usize> {
        Err(ArrowError::CastError(format!("Casting {} to usize is not supported.", type_name::<Self>())))
    }

Copy link
Contributor

@tustvold tustvold Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale is likely that there is only one possible error variant, and so it is up to the caller to add this context in the manner appropriate to how they have chosen to handle errors - be it panic, anyhow, custom error enumerations, etc...

As for the semantics on ArrowNativeType, I happen to not be a fan of returning None for "unsupported" floating point types - my 2 cents is we should just use num-traits and not roll our own 😅

} else {
char_count - (-start).to_usize().unwrap().min(char_count)
};
let (start_offset, end_offset) = get_start_end_offset(val, start, length);
vals.append_slice(&val.as_bytes()[start_offset..end_offset]);
}
new_offsets.append(OffsetSize::from_usize(vals.len()).unwrap());
});
let data = unsafe {
ArrayData::new_unchecked(
GenericStringArray::<OffsetSize>::get_data_type(),
array.len(),
None,
array
.data_ref()
.null_buffer()
.map(|b| b.bit_slice(array.offset(), array.len())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

0,
vec![new_offsets.finish(), vals.finish()],
vec![],
)
};
Ok(GenericStringArray::<OffsetSize>::from(data))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ok() here is just to make this function to be consistent with substring. It is meaningless because this function never return an error (Actually, I don't like unwrap or panic in a function returning Result. But the code would be much unclear is I remove unwrap (lots of ok_or(...))).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panicking if an invariant is violated I think makes perfect sense, errors only really make sense imo if you expect some higher level system to catch and handle that error.

}

val.chars().skip(start).take(length).collect::<String>()
})
})
.collect::<GenericStringArray<OffsetSize>>())
/// * `val` - string
/// * `start` - the start char index of the substring
/// * `length` - the char length of the substring
///
/// Return the `start` and `end` offset (by byte) of the substring
fn get_start_end_offset(
val: &str,
start: usize,
length: Option<usize>,
) -> (usize, usize) {
let len = val.len();
let mut offset_char_iter = val.char_indices();
let start_offset = offset_char_iter
.nth(start)
.map_or(len, |(offset, _)| offset);
let end_offset = length.map_or(len, |length| {
if length > 0 {
offset_char_iter
.nth(length - 1)
.map_or(len, |(offset, _)| offset)
} else {
start_offset
}
});
(start_offset, end_offset)
}

fn binary_substring<OffsetSize: OffsetSizeTrait>(
Expand Down