Skip to content

[Data] rename_column inconsistent when using iter_batches and reading from parquet data #57700

@mikedavy

Description

@mikedavy

What happened + What you expected to happen

Issue:

  • Renamed columns are not renamed when iterating through Dataset with iter_batches

Reproduce issue:

  • Reading from parquet data
  • Rename columns
  • iterate through Dataset with iter_batches/iter_torch_batches

Current workaround

  • pin to < 2.50.0

Versions / Dependencies

ray==2.50.0

Reproduction script

ds_iris_renamed = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet").rename_columns({"sepal.length": "SEPAL.LENGTH_MODIFIED", "variety": "VARIETY_MODIFIED"})

schema is correct and take/take_batch work

Column Type


SEPAL.LENGTH_MODIFIED double
sepal.width double
petal.length double
petal.width double
VARIETY_MODIFIED string

for b in ds_iris_renamed.iter_batches(batch_size=10): print(b) break

{'sepal.length': array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9]), 'sepal.width': array([3.5, 3. , 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1]), 'petal.length': array([1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5]), 'petal.width': array([0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1]), 'variety': array(['Setosa', 'Setosa', 'Setosa', 'Setosa', 'Setosa', 'Setosa',
'Setosa', 'Setosa', 'Setosa', 'Setosa'], dtype=object)}

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tdataRay Data-related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions