Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow round-tripping of dictionary data through the v2 format #2656

Merged
merged 4 commits into from
Aug 1, 2024

Conversation

westonpace
Copy link
Contributor

No description provided.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 63.58382% with 126 lines in your changes missing coverage. Please review.

Project coverage is 79.31%. Comparing base (7782eb9) to head (ec1de1d).
Report is 3 commits behind head on main.

Files Patch % Lines
rust/lance-encoding/src/data.rs 22.77% 73 Missing and 5 partials ⚠️
rust/lance-encoding/src/buffer.rs 48.14% 14 Missing ⚠️
rust/lance-encoding/src/decoder.rs 40.90% 12 Missing and 1 partial ⚠️
...ance-encoding/src/encodings/physical/dictionary.rs 88.88% 0 Missing and 12 partials ⚠️
rust/lance-encoding/src/encoder.rs 89.41% 2 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2656      +/-   ##
==========================================
- Coverage   79.33%   79.31%   -0.02%     
==========================================
  Files         222      222              
  Lines       64584    64849     +265     
  Branches    64584    64849     +265     
==========================================
+ Hits        51236    51435     +199     
- Misses      10360    10422      +62     
- Partials     2988     2992       +4     
Flag Coverage Δ
unittests 79.31% <63.58%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// TODO: We could be more efficient here by checking if the dictionaries are the same
// Also, if they aren't, we can possibly do something cheaper than concatenating
let array_refs = arrays.iter().map(|arr| arr.as_ref()).collect::<Vec<_>>();
let array = arrow_select::concat::concat(&array_refs)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that there's this util for normalizing the indices (It looks like that is what this is doing?)

@westonpace westonpace force-pushed the feat/v2-dictionary-data-type branch from ec1de1d to ce7a931 Compare July 30, 2024 17:33
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we are going with the approach that Arrow dictionaries correspond with Lance dictionaries. Are we considering dynamically choosing dictionary when appropriate? Do we have affordances for that?


round_tripped = round_trip(dict_arr)

assert round_tripped.dictionary == dictionary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the desired behavior? I'm kind of surprised we wouldn't remove unused values (or at least allow that in the future).

Copy link
Contributor Author

@westonpace westonpace Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user is giving us dictionary encoded data then the dictionary may be meaningful to them and I didn't want to lose that information. I agree it would be nice to have an option to vacuum dictionaries on write.

Alternatively, we may want to make a distinction between "enum" and "dictionary" data types which is something polars does. Enum is an "extension type" (with storage type dictionary) for data where the dictionary is fixed across all batches in the dataset. In the enum case you would store the dictionary at the column level and do not vacuum it. In the categorical case you would assume it is just opportunistic space saving and always vacuum dictionaries.

@westonpace
Copy link
Contributor Author

Seems like we are going with the approach that Arrow dictionaries correspond with Lance dictionaries. Are we considering dynamically choosing dictionary when appropriate? Do we have affordances for that?

@wjones127 opportunistic dictionary encoding / decoding is already in place for strings. If a page of string values has low cardinality (<100 values is threshold today) then it will be dictionary encoded on write and dictionary decoded on read.

This PR adds support for the case where the input from the user is already dictionary encoded. In this case we want to maintain the user's dictionary information and do not dictionary decode on read.

In the future, with a more sophisticated projection API we could both:

  • Return encoded dictionary arrays from opportunistically encoded columns (would either require every page in the column was dictionary encoded or that we have different schema for different batches)
  • Return decoded value arrays even if the user's input was dictionary encoded (no changes needed, just API)

@westonpace westonpace force-pushed the feat/v2-dictionary-data-type branch from ce7a931 to 43edd7d Compare July 31, 2024 18:28
@westonpace westonpace merged commit 526cdd1 into lancedb:main Aug 1, 2024
19 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants