You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a dataset read from parquet with mixed column types. Some columns are strings and some are dictionary<values=string, indices=int32, ordered=0>.
If I apply a filter to the dataset (even a trivial one that returns True for all rows), the columns that were dictionary type become type string.
I expected the schema to stay the same after applying a filter to a dataset.
In the example below, the two schemas that are printed out do not match.
Versions / Dependencies
Ray 2.23.0
Python 3.9.21
Ubuntu 20.04.6
Reproduction script
import ray
dataset = ray.data.read_parquet(<dataset path>)
print(dataset.schema())
print(dataset.filter(lambda row: True).schema())
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
tkbartel
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Mar 10, 2025
What happened + What you expected to happen
I have a dataset read from parquet with mixed column types. Some columns are strings and some are
dictionary<values=string, indices=int32, ordered=0>
.If I apply a filter to the dataset (even a trivial one that returns True for all rows), the columns that were
dictionary
type become typestring
.I expected the schema to stay the same after applying a filter to a dataset.
In the example below, the two schemas that are printed out do not match.
Versions / Dependencies
Ray 2.23.0
Python 3.9.21
Ubuntu 20.04.6
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: