-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive Row Block Size (#4812) #4818
Conversation
@@ -2221,7 +2227,7 @@ mod tests { | |||
} | |||
|
|||
for r in r2.iter() { | |||
assert_eq!(r.data.len(), 34); | |||
assert_eq!(r.data.len(), 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice demonstration of how this will reduce the impact of #4811. Whilst 10 is still worse than the 3 for dictionaries, it is a lot better than 34 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had a chance to read this PR in detail yet. In terms of performance regression, is there some way to dynamically pick the number of blocks row by row (based on their contents) rather than hard coding it to always use a few small blocks up front)? That way we could avoid the performance regression for larger strings
arrow-row/src/lib.rs
Outdated
/// to the output, followed by `0xFF_u8`. The final block is padded to 32-bytes | ||
/// with `0_u8` and written to the output, followed by the un-padded length in bytes | ||
/// of this final block as a `u8`. | ||
/// of this final block as a `u8`. The first 4 blocks have a length of 8, with subsequent | ||
/// blocks using a length of 32. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to explain the rationale for using smaller blocks up front (to avoid space wastage for smaller stings)
I can't think of a scheme that would allow this whilst also allowing shorter strings to compare correctly against longer, e.g. "abc" < "bcde" < "cde" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold -- I went over this change carefully
Here is my summary of expected benefits of this PR. I wonder if you agree with this assesment?
-
GROUP BY
andORDER BY
with low cardinality multi column string grouping / ordering with strings less than 32 characters I would expect smaller keys to make comparisions faster when theRow
values are equal many of the times -- which is what happens forGROUP BY
on low cardinality grouping operations in DataFusion. TPCH Q1 is an example (return_status
is a string) -
Memory usage when doing high cardinality multi-column grouping or sorting with strings less than 32 characaters Since there is less memory overhead per value the overall memory requirements are lower.
TLDR is I think this it is the right tradeoff for IOx but I can imagine workloads where one would rather have the faster conversion performance but larger memory usage.
I wonder if anyone else in the community has an opinion?
@@ -1368,12 +1368,14 @@ pub(crate) mod bytes { | |||
} | |||
|
|||
impl ByteArrayNativeType for [u8] { | |||
#[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline(always)
? -- at some point I thought that saying 'always' was required to get cross crate inlining to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That hasn't been my experience, I believe #[inline]
makes the data available to be inlined should LLVM think it a good idea, with '[inline(always)]
only necessary when LLVM doesn't think it a good idea (for whatever reason).
@@ -26,6 +26,12 @@ use arrow_schema::{DataType, SortOptions}; | |||
/// The block size of the variable length encoding | |||
pub const BLOCK_SIZE: usize = 32; | |||
|
|||
/// The first block is split into `MINI_BLOCK_COUNT` mini-blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this would be a good place to add comments about the why for miniblocks.
Some(a) if a <= BLOCK_SIZE => { | ||
1 + ceil(a, MINI_BLOCK_SIZE) * (MINI_BLOCK_SIZE + 1) | ||
} | ||
Some(a) => MINI_BLOCK_COUNT + ceil(a, BLOCK_SIZE) * (BLOCK_SIZE + 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the padded length gets a MINI_BLOCK_COUNT
on it. Is it because each miniblock contains 1 byte continuation?
Maybe a comment would help
@@ -26,6 +26,12 @@ use arrow_schema::{DataType, SortOptions}; | |||
/// The block size of the variable length encoding | |||
pub const BLOCK_SIZE: usize = 32; | |||
|
|||
/// The first block is split into `MINI_BLOCK_COUNT` mini-blocks | |||
pub const MINI_BLOCK_COUNT: usize = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if using a MINI_BLOCK count of 2 might get most of the memory savings but have lower CPU overhead 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find this to yield a meaningful performance delta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for trying
/// Returns the number of bytes of encoded data | ||
fn decoded_len(row: &[u8], options: SortOptions) -> usize { | ||
#[inline] | ||
fn encode_blocks<const SIZE: usize>(out: &mut [u8], val: &[u8]) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we could add a comment here explaining what this writes (specifically continuation tokens vs BLOCK_CONTINUATION)
The only regression is converting back from rows to arrays, so for DF I would not expect this to regress performance as this only done for grouping where other overheads will dominate |
Which issue does this PR close?
Closes #4812
Rationale for this change
Reduces the space amplification for small strings, reducing memory usage and potentially yielding faster comparisons
This does noticeably regress the performance of converting rows back into arrow-arrays, as a result of having to parse additional blocks. I am inclined to think this isn't a major concern as I have only seen this functionality being used following expensive, reducing operations like grouping, which will dominate execution time
What changes are included in this PR?
Are there any user-facing changes?