-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow round-tripping of dictionary data through the v2 format #2656
feat: allow round-tripping of dictionary data through the v2 format #2656
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2656 +/- ##
==========================================
- Coverage 79.33% 79.31% -0.02%
==========================================
Files 222 222
Lines 64584 64849 +265
Branches 64584 64849 +265
==========================================
+ Hits 51236 51435 +199
- Misses 10360 10422 +62
- Partials 2988 2992 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
// TODO: We could be more efficient here by checking if the dictionaries are the same | ||
// Also, if they aren't, we can possibly do something cheaper than concatenating | ||
let array_refs = arrays.iter().map(|arr| arr.as_ref()).collect::<Vec<_>>(); | ||
let array = arrow_select::concat::concat(&array_refs)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that there's this util for normalizing the indices (It looks like that is what this is doing?)
ec1de1d
to
ce7a931
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we are going with the approach that Arrow dictionaries correspond with Lance dictionaries. Are we considering dynamically choosing dictionary when appropriate? Do we have affordances for that?
|
||
round_tripped = round_trip(dict_arr) | ||
|
||
assert round_tripped.dictionary == dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the desired behavior? I'm kind of surprised we wouldn't remove unused values (or at least allow that in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user is giving us dictionary encoded data then the dictionary may be meaningful to them and I didn't want to lose that information. I agree it would be nice to have an option to vacuum dictionaries on write.
Alternatively, we may want to make a distinction between "enum" and "dictionary" data types which is something polars does. Enum is an "extension type" (with storage type dictionary) for data where the dictionary is fixed across all batches in the dataset. In the enum case you would store the dictionary at the column level and do not vacuum it. In the categorical case you would assume it is just opportunistic space saving and always vacuum dictionaries.
@wjones127 opportunistic dictionary encoding / decoding is already in place for strings. If a page of string values has low cardinality (<100 values is threshold today) then it will be dictionary encoded on write and dictionary decoded on read. This PR adds support for the case where the input from the user is already dictionary encoded. In this case we want to maintain the user's dictionary information and do not dictionary decode on read. In the future, with a more sophisticated projection API we could both:
|
ce7a931
to
43edd7d
Compare
No description provided.