Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Moved dict_id to IPC-specific IO #713

Merged
merged 6 commits into from
Dec 28, 2021
Merged

Moved dict_id to IPC-specific IO #713

merged 6 commits into from
Dec 28, 2021

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Dec 26, 2021

This removes the dict_id from Field. It is a major simplification to the crate's design.

Background

The arrow format supports dictionaries. When writing to IPC, each dictionary is required to be assigned an id. Dictionaries are written as separate (dictionary) messages and identified in a IPC recordbatch message via the id stored in the IPC message's schema.

Currently, dictionary ids are stored in Field. This is quite limiting for the following reasons:

dict_id is not a logical construct

A dictionary id is encoding-related information specific for the IPC format (e.g. the C data interface does not have such a concept). However, it currently resides in Field, a logical construct used to track names, logical types and nullability (general schema information).

dict_id must be declared at logical planning

Users currently have no mechanism to declare a dictionary to be re-used in the IPC during execution - since the id needs to be set on the Field, it means that logical planning needs to assign ids during logical planning, even before any access to the data. This is a major design limitation because it can happen that the specifics of the execution can cause the dictionaries to be reusable - this information is often not available during logical planning.

For example, one implementation of single-batch "id assignment" is to look for all dictionary values in a batch and increment by 1 iff the pointer of the values is new over all others (thus avoiding serializing the values twice). That two columns are re-using a dictionary is only known once we have access to the dictionaries themselves, during execution.

Likewise, an implementation for multi-batch "id assignment" is to hash the dictionary values and have a hashmap between ids and hashes, and only emit dictionaries to the IPC when we find a new hash (which, again, is only known at execution time).

This PR

This PR refactors the dict_id to a separate struct, io::ipc::IpcField, that tracks the dictionary ids of the fields from and to the IPC format. This tracking is only needed when reading from and writing to IPC, and is now part of the "metadata" that we store when reading from IPC.

When writing a batch to IPC, users are now required to provide a io::ipc::IpcField, which is used to describe to this crate which dictionaries are being re-used or not. We also offer a minimal implementation to generate these fields by assigning an incremental id per dictionary in the batch, io::ipc::write::default_ipc_fields.

@jorgecarleitao
Copy link
Owner Author

This closes #581 and is a pre-requisite to address #673

@codecov
Copy link

codecov bot commented Dec 26, 2021

Codecov Report

Merging #713 (78faa36) into main (f07cc2c) will increase coverage by 0.11%.
The diff coverage is 72.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #713      +/-   ##
==========================================
+ Coverage   70.48%   70.60%   +0.11%     
==========================================
  Files         310      312       +2     
  Lines       16763    16922     +159     
==========================================
+ Hits        11815    11947     +132     
- Misses       4948     4975      +27     
Impacted Files Coverage Δ
src/datatypes/field.rs 40.27% <ø> (-3.63%) ⬇️
src/io/flight/mod.rs 0.00% <0.00%> (ø)
src/io/ipc/read/deserialize.rs 34.56% <0.00%> (-1.33%) ⬇️
src/io/json_integration/write/array.rs 0.00% <ø> (ø)
src/io/json_integration/write/schema.rs 0.00% <0.00%> (ø)
src/io/parquet/read/schema/metadata.rs 74.19% <ø> (ø)
src/columns.rs 37.50% <37.50%> (ø)
src/io/ipc/write/stream_async.rs 55.26% <45.00%> (-15.58%) ⬇️
src/io/ipc/read/reader.rs 74.28% <62.50%> (+0.37%) ⬆️
src/io/ipc/read/array/dictionary.rs 47.61% <66.66%> (+0.25%) ⬆️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f07cc2c...78faa36. Read the comment docs.

@jorgecarleitao jorgecarleitao marked this pull request as draft December 26, 2021 23:29
@jorgecarleitao jorgecarleitao marked this pull request as ready for review December 27, 2021 07:52
@jorgecarleitao
Copy link
Owner Author

cc @alamb , this was another simplification of the Field that allows re-using dictionary-encoding efficiently on Arrow IPC interface.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant