Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] Default parquet-to-pandas encoder/decoder should support iterable read. #3219

Open
2 of 4 tasks
cosmicBboy opened this issue Jan 9, 2023 · 4 comments
Open
2 of 4 tasks
Assignees
Labels
enhancement New feature or request stale
Milestone

Comments

@cosmicBboy
Copy link
Contributor

cosmicBboy commented Jan 9, 2023

Motivation: Why do you think this is important?

The purpose of this issue is to support the use case where I can load StructuredDatasets iteratively as in:

structured_dataset.open(pd.Dataframe).iter()

Goal: What should the final outcome look like, ideally?

The end user should be able to specify the partition column when they output a structured dataset:

@task
def make_df() -> StructuredDataset:
    df = pd.DataFrame.from_records([
        {
            "id": i,
            "partition": (i % 10) + 1,
            "name": "".join(
                random.choices(string.ascii_uppercase + string.digits, k=10)
            )
        }
        for i in range(1000)
    ])
    return StructuredDataset(dataframe=df, partition_cols=["partition"])  # or ["partition1", "partition2"]

And then consume it like so:

@task
def use_df(dataset: StructuredDataset) -> pd.DataFrame:
    output = []
    for dd in dataset.open(pd.DataFrame).iter():
        print(f"This is a partial dataframe")
        print(dd.head(3))
        output.append(dd)
    return pd.concat(output)

Describe alternatives you've considered

The user needs to implement their own encoder/decoder for this use case.

Propose: Link/Inline OR Additional context

There is a working implementation of this here: https://github.com/flyteorg/flyte-demos/blob/main/flyte_demo/workflows/data_iter.py#L101

Steps

  • add partition_columns field to the StructuredDatasetType in flyteidl
  • modify the StructuredDataset type in flytekit to use this field in the encoder/decoder handler

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@cosmicBboy cosmicBboy added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Jan 9, 2023
@natewarr
Copy link

natewarr commented Jan 11, 2023

partition_col: str or partition_cols: List[str]?

I am swimming in multiple-partition parquet datasets.

@cosmicBboy
Copy link
Contributor Author

thanks @natewarr ! updating the code snippet example

@cosmicBboy cosmicBboy added this to the 1.4.0 milestone Jan 29, 2023
@cosmicBboy cosmicBboy self-assigned this Jan 30, 2023
@cosmicBboy cosmicBboy removed the untriaged This issues has not yet been looked at by the Maintainers label Mar 3, 2023
@cosmicBboy cosmicBboy modified the milestones: 1.4.0, 1.5.0 Mar 6, 2023
@cosmicBboy cosmicBboy modified the milestones: 1.5.0, 1.6.0 Apr 20, 2023
@kumare3
Copy link
Contributor

kumare3 commented Jun 25, 2023

isnt this supported now - cc @wild-endeavor / @eapolinario

Copy link

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

4 participants