This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 223
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jorgecarleitao
force-pushed
the
dict_id
branch
5 times, most recently
from
December 26, 2021 19:46
7b2cde0
to
f2a22b6
Compare
jorgecarleitao
force-pushed
the
dict_id
branch
from
December 26, 2021 19:53
f2a22b6
to
cc7859f
Compare
Codecov Report
@@ Coverage Diff @@
## main #713 +/- ##
==========================================
+ Coverage 70.48% 70.60% +0.11%
==========================================
Files 310 312 +2
Lines 16763 16922 +159
==========================================
+ Hits 11815 11947 +132
- Misses 4948 4975 +27
Continue to review full report at Codecov.
|
jorgecarleitao
force-pushed
the
dict_id
branch
from
December 28, 2021 05:46
ccc7c8a
to
78faa36
Compare
cc @alamb , this was another simplification of the Field that allows re-using dictionary-encoding efficiently on Arrow IPC interface. |
4 tasks
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This removes the
dict_id
fromField
. It is a major simplification to the crate's design.Background
The arrow format supports dictionaries. When writing to IPC, each dictionary is required to be assigned an id. Dictionaries are written as separate (dictionary) messages and identified in a IPC recordbatch message via the id stored in the IPC message's schema.
Currently, dictionary ids are stored in
Field
. This is quite limiting for the following reasons:dict_id
is not a logical constructA dictionary id is encoding-related information specific for the IPC format (e.g. the C data interface does not have such a concept). However, it currently resides in
Field
, a logical construct used to track names, logical types and nullability (general schema information).dict_id
must be declared at logical planningUsers currently have no mechanism to declare a dictionary to be re-used in the IPC during execution - since the id needs to be set on the
Field
, it means that logical planning needs to assign ids during logical planning, even before any access to the data. This is a major design limitation because it can happen that the specifics of the execution can cause the dictionaries to be reusable - this information is often not available during logical planning.For example, one implementation of single-batch "id assignment" is to look for all dictionary values in a batch and increment by 1 iff the pointer of the values is new over all others (thus avoiding serializing the values twice). That two columns are re-using a dictionary is only known once we have access to the dictionaries themselves, during execution.
Likewise, an implementation for multi-batch "id assignment" is to hash the dictionary values and have a hashmap between ids and hashes, and only emit dictionaries to the IPC when we find a new hash (which, again, is only known at execution time).
This PR
This PR refactors the
dict_id
to a separate struct,io::ipc::IpcField
, that tracks the dictionary ids of the fields from and to the IPC format. This tracking is only needed when reading from and writing to IPC, and is now part of the "metadata" that we store when reading from IPC.When writing a batch to IPC, users are now required to provide a
io::ipc::IpcField
, which is used to describe to this crate which dictionaries are being re-used or not. We also offer a minimal implementation to generate these fields by assigning an incremental id per dictionary in the batch,io::ipc::write::default_ipc_fields
.