[Data] Filter operation changes schema of dataset #51217

tkbartel · 2025-03-10T16:48:59Z

What happened + What you expected to happen

I have a dataset read from parquet with mixed column types. Some columns are strings and some are dictionary<values=string, indices=int32, ordered=0>.

If I apply a filter to the dataset (even a trivial one that returns True for all rows), the columns that were dictionary type become type string.

I expected the schema to stay the same after applying a filter to a dataset.

In the example below, the two schemas that are printed out do not match.

Versions / Dependencies

Ray 2.23.0
Python 3.9.21
Ubuntu 20.04.6

Reproduction script

import ray
dataset = ray.data.read_parquet(<dataset path>)
print(dataset.schema())
print(dataset.filter(lambda row: True).schema())

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

tkbartel added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 10, 2025

jcotant1 added the data Ray Data-related issues label Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Filter operation changes schema of dataset #51217

[Data] Filter operation changes schema of dataset #51217

tkbartel commented Mar 10, 2025

[Data] Filter operation changes schema of dataset #51217

[Data] Filter operation changes schema of dataset #51217

Comments

tkbartel commented Mar 10, 2025

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity