This repository has been archived by the owner on Oct 23, 2023. It is now read-only.
add partition_columns to StructuredDatasetType #364
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Niels Bantilan niels.bantilan@gmail.com
Add
partition_columns
toStructuredDatasetType
Partially addresses flyteorg/flyte#3219
TL;DR
This PR adds an additional property to the
StructureDatasetType
protobuf definition so that metadata about which columns in the dataset (some kind of DataFrame object) are used for partitioning the dataset into chunks, for example when apandas.DataFrame
is serialized as a parquet file.Type
Are all requirements met?
Complete description
This change is required to store additional metadata about which columns are used for partitioning. Currently this only meaningfully affects the serialization/deserialization of parquet files, but in the future we could support the partitioning of other serialization formats.
Tracking Issue
Partly addresses flyteorg/flyte#3219
Follow-up issue
NA