-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
What happened + What you expected to happen
Issue:
- Renamed columns are not renamed when iterating through Dataset with iter_batches
Reproduce issue:
- Reading from parquet data
- Rename columns
- iterate through Dataset with iter_batches/iter_torch_batches
Current workaround
- pin to < 2.50.0
Versions / Dependencies
ray==2.50.0
Reproduction script
ds_iris_renamed = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet").rename_columns({"sepal.length": "SEPAL.LENGTH_MODIFIED", "variety": "VARIETY_MODIFIED"})
schema is correct and take/take_batch work
Column Type
SEPAL.LENGTH_MODIFIED double
sepal.width double
petal.length double
petal.width double
VARIETY_MODIFIED string
for b in ds_iris_renamed.iter_batches(batch_size=10): print(b) break
{'sepal.length': array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9]), 'sepal.width': array([3.5, 3. , 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1]), 'petal.length': array([1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5]), 'petal.width': array([0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1]), 'variety': array(['Setosa', 'Setosa', 'Setosa', 'Setosa', 'Setosa', 'Setosa',
'Setosa', 'Setosa', 'Setosa', 'Setosa'], dtype=object)}
Issue Severity
Medium: It is a significant difficulty but I can work around it.