feat: allow round-tripping of dictionary data through the v2 format #2656

westonpace · 2024-07-29T13:58:28Z

No description provided.

codecov-commenter · 2024-07-29T14:23:32Z

Codecov Report

Attention: Patch coverage is 63.58382% with 126 lines in your changes missing coverage. Please review.

Project coverage is 79.31%. Comparing base (7782eb9) to head (ec1de1d).
Report is 3 commits behind head on main.

Files	Patch %	Lines
rust/lance-encoding/src/data.rs	22.77%	73 Missing and 5 partials ⚠️
rust/lance-encoding/src/buffer.rs	48.14%	14 Missing ⚠️
rust/lance-encoding/src/decoder.rs	40.90%	12 Missing and 1 partial ⚠️
...ance-encoding/src/encodings/physical/dictionary.rs	88.88%	0 Missing and 12 partials ⚠️
rust/lance-encoding/src/encoder.rs	89.41%	2 Missing and 7 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2656      +/-   ##
==========================================
- Coverage   79.33%   79.31%   -0.02%     
==========================================
  Files         222      222              
  Lines       64584    64849     +265     
  Branches    64584    64849     +265     
==========================================
+ Hits        51236    51435     +199     
- Misses      10360    10422      +62     
- Partials     2988     2992       +4

Flag	Coverage Δ
unittests	`79.31% <63.58%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

albertlockett · 2024-07-29T17:58:34Z

rust/lance-encoding/src/encodings/physical/dictionary.rs

+        // TODO: We could be more efficient here by checking if the dictionaries are the same
+        //       Also, if they aren't, we can possibly do something cheaper than concatenating
+        let array_refs = arrays.iter().map(|arr| arr.as_ref()).collect::<Vec<_>>();
+        let array = arrow_select::concat::concat(&array_refs)?;


Nice that there's this util for normalizing the indices (It looks like that is what this is doing?)

wjones127

Seems like we are going with the approach that Arrow dictionaries correspond with Lance dictionaries. Are we considering dynamically choosing dictionary when appropriate? Do we have affordances for that?

wjones127 · 2024-07-30T18:01:11Z

python/python/tests/test_file.py

+
+    round_tripped = round_trip(dict_arr)
+
+    assert round_tripped.dictionary == dictionary


Is this the desired behavior? I'm kind of surprised we wouldn't remove unused values (or at least allow that in the future).

If the user is giving us dictionary encoded data then the dictionary may be meaningful to them and I didn't want to lose that information. I agree it would be nice to have an option to vacuum dictionaries on write.

Alternatively, we may want to make a distinction between "enum" and "dictionary" data types which is something polars does. Enum is an "extension type" (with storage type dictionary) for data where the dictionary is fixed across all batches in the dataset. In the enum case you would store the dictionary at the column level and do not vacuum it. In the categorical case you would assume it is just opportunistic space saving and always vacuum dictionaries.

westonpace · 2024-07-30T20:41:04Z

Seems like we are going with the approach that Arrow dictionaries correspond with Lance dictionaries. Are we considering dynamically choosing dictionary when appropriate? Do we have affordances for that?

@wjones127 opportunistic dictionary encoding / decoding is already in place for strings. If a page of string values has low cardinality (<100 values is threshold today) then it will be dictionary encoded on write and dictionary decoded on read.

This PR adds support for the case where the input from the user is already dictionary encoded. In this case we want to maintain the user's dictionary information and do not dictionary decode on read.

In the future, with a more sophisticated projection API we could both:

Return encoded dictionary arrays from opportunistically encoded columns (would either require every page in the column was dictionary encoded or that we have different schema for different batches)
Return decoded value arrays even if the user's input was dictionary encoded (no changes needed, just API)

…2656)

github-actions bot added enhancement New feature or request python labels Jul 29, 2024

westonpace requested review from albertlockett, raunaks13 and wjones127 July 29, 2024 13:58

albertlockett reviewed Jul 29, 2024

View reviewed changes

albertlockett approved these changes Jul 29, 2024

View reviewed changes

westonpace force-pushed the feat/v2-dictionary-data-type branch from ec1de1d to ce7a931 Compare July 30, 2024 17:33

wjones127 reviewed Jul 30, 2024

View reviewed changes

westonpace added 4 commits July 31, 2024 11:25

Add encoding and decoding of dictionary arrays as input

89703e8

Add python tests

8c0632b

Remove defunct layout mod

553cdab

Fix clippy warnings

43edd7d

westonpace force-pushed the feat/v2-dictionary-data-type branch from ce7a931 to 43edd7d Compare July 31, 2024 18:28

wjones127 approved these changes Jul 31, 2024

View reviewed changes

westonpace merged commit 526cdd1 into lancedb:main Aug 1, 2024
19 of 22 checks passed

eddyxu pushed a commit that referenced this pull request Aug 1, 2024

feat: allow round-tripping of dictionary data through the v2 format (#…

f1ba58d

…2656)

westonpace mentioned this pull request Sep 11, 2024

Add support for dictionary encoded fields to the v2 reader/writer #2347

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow round-tripping of dictionary data through the v2 format #2656

feat: allow round-tripping of dictionary data through the v2 format #2656

westonpace commented Jul 29, 2024

codecov-commenter commented Jul 29, 2024

albertlockett Jul 29, 2024

wjones127 left a comment

wjones127 Jul 30, 2024

westonpace Jul 30, 2024 •

edited

Loading

westonpace commented Jul 30, 2024


		round_tripped = round_trip(dict_arr)

		assert round_tripped.dictionary == dictionary

feat: allow round-tripping of dictionary data through the v2 format #2656

feat: allow round-tripping of dictionary data through the v2 format #2656

Conversation

westonpace commented Jul 29, 2024

codecov-commenter commented Jul 29, 2024

Codecov Report

albertlockett Jul 29, 2024

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Jul 30, 2024

Choose a reason for hiding this comment

westonpace Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

westonpace commented Jul 30, 2024

westonpace Jul 30, 2024 •

edited

Loading