Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up substring_by_char by about 2.5x #1832

Merged
merged 2 commits into from
Jun 11, 2022

Conversation

HaoYang670
Copy link
Contributor

@HaoYang670 HaoYang670 commented Jun 10, 2022

Signed-off-by: remzi 13716567376yh@gmail.com

Which issue does this PR close?

Closes #1800.

What changes are included in this PR?

Directly copy the string slice to BufferBuilder.

Are there any user-facing changes?

No.

Benchmark

substring utf8 by char  time:   [48.896 ms 48.942 ms 48.990 ms]                                   
                        change: [-58.192% -58.123% -58.060%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)

about 2.5x

Signed-off-by: remzi <13716567376yh@gmail.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 10, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 10, 2022

Codecov Report

Merging #1832 (1966e0e) into master (db41b33) will decrease coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1832      +/-   ##
==========================================
- Coverage   83.53%   83.49%   -0.05%     
==========================================
  Files         200      201       +1     
  Lines       56798    56903     +105     
==========================================
+ Hits        47449    47510      +61     
- Misses       9349     9393      +44     
Impacted Files Coverage Δ
arrow/src/compute/kernels/substring.rs 99.53% <100.00%> (+0.01%) ⬆️
arrow/src/compute/kernels/temporal.rs 95.77% <0.00%> (-1.36%) ⬇️
parquet/src/encodings/encoding.rs 93.46% <0.00%> (-0.20%) ⬇️
arrow/src/array/mod.rs 100.00% <0.00%> (ø)
parquet/src/bin/parquet-fromcsv.rs 0.00% <0.00%> (ø)
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (+0.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db41b33...1966e0e. Read the comment docs.

let length = length.map_or(char_count - start, |length| {
length.to_usize().unwrap().min(char_count - start)
});
let mut vals = BufferBuilder::<u8>::new(array.value_data().len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always allocate more memory than we need, but it's no big deal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to improve this estimate you could compute the difference between the first and last offset. It would still be an over-estimate, but would over-estimate sliced arrays less

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to do this, or should I just get this in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good enough. It has O(n) space complexity (where n is the byte length of the value buffer of the input array) and at most allocate n extra memory than we need (when length = 0).

Copy link
Contributor

@tustvold tustvold Jun 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wasn't clear, the problem here is that the values buffer may be substantially larger than necessary as a result of slicing.

Say I have an array of 1000 strings, and I take a slice of length two, the values array is completely unchanged, but the offsets array is now length of 3, instead of 1001.

By taking the delta between the first and last offset in the source array, you ensure you are not allocating capacity that you definitely don't need, as technically that capacity isn't even needed by the source array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your explanation, which makes perfect sense.
Forgot offset and slicing. 😅 I will fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated! No impact on the benchmark result

vec![],
)
};
Ok(GenericStringArray::<OffsetSize>::from(data))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ok() here is just to make this function to be consistent with substring. It is meaningless because this function never return an error (Actually, I don't like unwrap or panic in a function returning Result. But the code would be much unclear is I remove unwrap (lots of ok_or(...))).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panicking if an invariant is violated I think makes perfect sense, errors only really make sense imo if you expect some higher level system to catch and handle that error.

if let Some(val) = val {
let char_count = val.chars().count();
let start = if start >= 0 {
start.to_usize().unwrap()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about why to_usize() returns Option<usize> but not Result<usize>.
Unfortunately, the rust std library also chooses Option:

    /// Converts the value of `self` to a `usize`. If the value cannot be
    /// represented by a `usize`, then `None` is returned.
    #[inline]
    fn to_usize(&self) -> Option<usize> {
        self.to_u64().as_ref().and_then(ToPrimitive::to_usize)
    }

But Result is more reasonable for me. Because we could tell users the reason of casting failure :

    #[inline]
    fn to_usize(&self) -> Result<usize> {
        match num::ToPrimitive::to_usize(self) {
            Some(val) => Ok(val),
            None => Err(ArrowError::CastError(format!("Cannot cast {} to usize"))
        }
    }

And for unsupported types, we can have a default implementation:

    /// Convert native type to usize.
    #[inline]
    fn to_usize(&self) -> Result<usize> {
        Err(ArrowError::CastError(format!("Casting {} to usize is not supported.", type_name::<Self>())))
    }

Copy link
Contributor

@tustvold tustvold Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale is likely that there is only one possible error variant, and so it is up to the caller to add this context in the manner appropriate to how they have chosen to handle errors - be it panic, anyhow, custom error enumerations, etc...

As for the semantics on ArrowNativeType, I happen to not be a fan of returning None for "unsupported" floating point types - my 2 cents is we should just use num-traits and not roll our own 😅

array
.data_ref()
.null_buffer()
.map(|b| b.bit_slice(array.offset(), array.len())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

let length = length.map_or(char_count - start, |length| {
length.to_usize().unwrap().min(char_count - start)
});
let mut vals = BufferBuilder::<u8>::new(array.value_data().len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to improve this estimate you could compute the difference between the first and last offset. It would still be an over-estimate, but would over-estimate sliced arrays less

offsets.append(OffsetSize::zero());
let length = length.map(|len| len.to_usize().unwrap());

array.iter().for_each(|val| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A further improvement might be to iterate the offsets directly, as we don't actually care about skipping over nulls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure whether we could get further speed-up from this. Because for each value slot, we have to transmute it to string to scan it over (using char_indices) to get the start and end offset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.html is free and safe in this context FWIW

Our GenericStringIter is implemented based on this. That's why I guess array.iter() is optimal enough.
BTW, skipping nulls could not be the hotspot in this function, because it is just a bitwise operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it is just a bitwise operation

I would not underestimate how remarkably slow bitwise operations can be, especially when used to predicate branches. But yes, likely for this kernel the bottleneck lies elsewhere

vec![],
)
};
Ok(GenericStringArray::<OffsetSize>::from(data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panicking if an invariant is violated I think makes perfect sense, errors only really make sense imo if you expect some higher level system to catch and handle that error.

Signed-off-by: remzi <13716567376yh@gmail.com>
@tustvold tustvold merged commit c08e532 into apache:master Jun 11, 2022
@HaoYang670 HaoYang670 deleted the speed_up_substring_by_char branch June 11, 2022 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up substring_by_char kernel
3 participants