Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Filter operation changes schema of dataset #51217

Open
tkbartel opened this issue Mar 10, 2025 · 0 comments
Open

[Data] Filter operation changes schema of dataset #51217

tkbartel opened this issue Mar 10, 2025 · 0 comments
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@tkbartel
Copy link

What happened + What you expected to happen

I have a dataset read from parquet with mixed column types. Some columns are strings and some are dictionary<values=string, indices=int32, ordered=0>.

If I apply a filter to the dataset (even a trivial one that returns True for all rows), the columns that were dictionary type become type string.

I expected the schema to stay the same after applying a filter to a dataset.

In the example below, the two schemas that are printed out do not match.

Versions / Dependencies

Ray 2.23.0
Python 3.9.21
Ubuntu 20.04.6

Reproduction script

import ray
dataset = ray.data.read_parquet(<dataset path>)
print(dataset.schema())
print(dataset.filter(lambda row: True).schema())

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@tkbartel tkbartel added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 10, 2025
@jcotant1 jcotant1 added the data Ray Data-related issues label Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants