-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make dictionary preservation optional in row encoding #3831
Make dictionary preservation optional in row encoding #3831
Conversation
…ve-dictionaries-row-encoding
arrow-row/src/lib.rs
Outdated
/// By default dictionaries are preserved as described on [`RowConverter`] | ||
/// | ||
/// However, this process requires maintaining and incrementally updating | ||
/// an order-preserving mapping of dictionary values which is relatively expensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// an order-preserving mapping of dictionary values which is relatively expensive | |
/// an order-preserving mapping of dictionary values which is relatively expensive | |
/// computationally but minimizes memory used. |
arrow-row/src/lib.rs
Outdated
test_string_dictionary_preserve(true); | ||
} | ||
|
||
fn test_string_dictionary_preserve(preserve: bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the repeated preserve
was a little confusing at first -- as I would expect test_string_dictionary_preserve
to mean testing preserve = true.
totally minor nitpick but maybe it could be renamed to test_string_dictionary
actual.data().validate_full().unwrap(); | ||
assert_eq!(actual, expected) | ||
dictionary_eq(preserve, actual, expected) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these tests seem to verify that the values are the same. I wonder if it would be valuable to cover space efficiency too in a test (like encode using preserved and non preserved dictionaries and show them taking up different sizes 🤔 )
arrow/benches/row_format.rs
Outdated
|
||
let cols = | ||
vec![Arc::new(create_string_dict_array::<Int32Type>(4096, 0.5, 100)) as ArrayRef]; | ||
do_bench(c, "4096 string_dictionary(100, 0.5)", cols); | ||
do_bench(c, "4096 string_dictionary(100, 0.5)", cols.clone(), true); | ||
do_bench(c, "4096 string_dictionary_hydrate(100, 0.5)", cols, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than using hydrate
which I think is a somewhat non standard term, maybe this could be non_preserve
to match the parameter name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, I have a very bad reason why I did this... Formatting 😆 I'll work around it differently
Looks good to me -- nice work |
Benchmark runs are scheduled for baseline = 61c4f12 and contender = 69c04db. 69c04db is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #.
Rationale for this change
As discussed on apache/datafusion#4973 (comment) the dictionary preserving row encoding is particularly expensive to compute and may not be appropriate for all use-cases. This PR therefore adds an option to instead encode the dictionary values directly, instead of constructing an order preserving dictionary.
This can be significantly faster, especially when seeing values for the first time (the unprepared case above), however, does come with the potential downside of a significantly less efficient encoding, which in turn needs more memory and may make comparisons slower.
What changes are included in this PR?
Are there any user-facing changes?