add partition_columns to StructuredDatasetType #364

cosmicBboy · 2023-02-10T16:55:51Z

Signed-off-by: Niels Bantilan niels.bantilan@gmail.com

Add `partition_columns` to `StructuredDatasetType`

TL;DR

This PR adds an additional property to the StructureDatasetType protobuf definition so that metadata about which columns in the dataset (some kind of DataFrame object) are used for partitioning the dataset into chunks, for example when a pandas.DataFrame is serialized as a parquet file.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

This change is required to store additional metadata about which columns are used for partitioning. Currently this only meaningfully affects the serialization/deserialization of parquet files, but in the future we could support the partitioning of other serialization formats.

Tracking Issue

Partly addresses flyteorg/flyte#3219

Follow-up issue

NA

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

codecov · 2023-02-10T16:59:55Z

Codecov Report

Merging #364 (dee449a) into master (f3724b4) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #364   +/-   ##
=======================================
  Coverage   73.71%   73.71%           
=======================================
  Files          18       18           
  Lines        1377     1377           
=======================================
  Hits         1015     1015           
  Misses        311      311           
  Partials       51       51

Flag	Coverage Δ
unittests	`73.71% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

hamersaw

@eapolinario please confirm, did we decide this needed to be included in the serialized flyteidl type or could just be read from metadata within flytekit at runtime?

protos/flyteidl/core/types.proto

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy · 2023-02-10T18:04:35Z

@eapolinario please confirm, did we decide this needed to be included in the serialized flyteidl type or could just be read from metadata within flytekit at runtime?

I may be missing something, but we need to include it in the type so that the structured dataset decoder has access to the metadata (unless we want to manually inspect the uri path for multiple directories)

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

add partition_columns to StructuredDatasetType

71f69e6

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

cosmicBboy requested review from eapolinario and hamersaw February 10, 2023 16:56

hamersaw reviewed Feb 10, 2023

View reviewed changes

protos/flyteidl/core/types.proto Outdated Show resolved Hide resolved

fix typos in comments

046de10

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

fix more typos

dee449a

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add partition_columns to StructuredDatasetType #364

add partition_columns to StructuredDatasetType #364

cosmicBboy commented Feb 10, 2023

codecov bot commented Feb 10, 2023 •

edited

Loading

hamersaw left a comment

cosmicBboy commented Feb 10, 2023

add partition_columns to StructuredDatasetType #364

Are you sure you want to change the base?

add partition_columns to StructuredDatasetType #364

Conversation

cosmicBboy commented Feb 10, 2023