[Core feature] Default parquet-to-pandas encoder/decoder should support iterable read. #3219

cosmicBboy · 2023-01-09T20:51:03Z

Motivation: Why do you think this is important?

The purpose of this issue is to support the use case where I can load StructuredDatasets iteratively as in:

structured_dataset.open(pd.Dataframe).iter()

Goal: What should the final outcome look like, ideally?

The end user should be able to specify the partition column when they output a structured dataset:

@task
def make_df() -> StructuredDataset:
    df = pd.DataFrame.from_records([
        {
            "id": i,
            "partition": (i % 10) + 1,
            "name": "".join(
                random.choices(string.ascii_uppercase + string.digits, k=10)
            )
        }
        for i in range(1000)
    ])
    return StructuredDataset(dataframe=df, partition_cols=["partition"])  # or ["partition1", "partition2"]

And then consume it like so:

@task
def use_df(dataset: StructuredDataset) -> pd.DataFrame:
    output = []
    for dd in dataset.open(pd.DataFrame).iter():
        print(f"This is a partial dataframe")
        print(dd.head(3))
        output.append(dd)
    return pd.concat(output)

Describe alternatives you've considered

The user needs to implement their own encoder/decoder for this use case.

Propose: Link/Inline OR Additional context

There is a working implementation of this here: https://github.com/flyteorg/flyte-demos/blob/main/flyte_demo/workflows/data_iter.py#L101

Steps

add partition_columns field to the StructuredDatasetType in flyteidl
modify the StructuredDataset type in flytekit to use this field in the encoder/decoder handler

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

natewarr · 2023-01-11T17:14:34Z

partition_col: str or partition_cols: List[str]?

I am swimming in multiple-partition parquet datasets.

cosmicBboy · 2023-01-18T21:51:58Z

thanks @natewarr ! updating the code snippet example

kumare3 · 2023-06-25T23:41:03Z

isnt this supported now - cc @wild-endeavor / @eapolinario

github-actions · 2024-03-22T00:06:21Z

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

cosmicBboy added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Jan 9, 2023

cosmicBboy mentioned this issue Jan 10, 2023

[Core feature] map_task should be able to handle a partitioned StructuredDataset #3226

Open

2 tasks

cosmicBboy added this to the 1.4.0 milestone Jan 29, 2023

cosmicBboy self-assigned this Jan 30, 2023

cosmicBboy mentioned this issue Feb 10, 2023

add partition_columns to StructuredDatasetType flyteorg/flyteidl#364

Open

8 tasks

cosmicBboy assigned wild-endeavor and unassigned cosmicBboy Feb 13, 2023

cosmicBboy removed the untriaged This issues has not yet been looked at by the Maintainers label Mar 3, 2023

cosmicBboy modified the milestones: 1.4.0, 1.5.0 Mar 6, 2023

cosmicBboy modified the milestones: 1.5.0, 1.6.0 Apr 20, 2023

github-actions bot added the stale label Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core feature] Default parquet-to-pandas encoder/decoder should support iterable read. #3219

[Core feature] Default parquet-to-pandas encoder/decoder should support iterable read. #3219

cosmicBboy commented Jan 9, 2023 •

edited

Loading

natewarr commented Jan 11, 2023 •

edited

Loading

cosmicBboy commented Jan 18, 2023

kumare3 commented Jun 25, 2023

github-actions bot commented Mar 22, 2024

[Core feature] Default parquet-to-pandas encoder/decoder should support iterable read. #3219

[Core feature] Default parquet-to-pandas encoder/decoder should support iterable read. #3219

Comments

cosmicBboy commented Jan 9, 2023 • edited Loading

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Steps

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

natewarr commented Jan 11, 2023 • edited Loading

cosmicBboy commented Jan 18, 2023

kumare3 commented Jun 25, 2023

github-actions bot commented Mar 22, 2024

cosmicBboy commented Jan 9, 2023 •

edited

Loading

natewarr commented Jan 11, 2023 •

edited

Loading