-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
Description
The mode parameter in write_numpy(), write_webdataset(), and write_tfrecords() is currently non-functional.
Root Cause
PR #52900 added the mode parameter (with values APPEND, OVERWRITE, IGNORE, ERROR) to several write methods. However, while the parameter was added to the method signatures of write_numpy(), write_webdataset(), and write_tfrecords(), it is not being passed to the underlying datasink classes.
Working implementations:
- ✅
write_parquet()- passesmode=modetoParquetDatasink - ✅
write_json()- passesmode=modetoJSONDatasink - ✅
write_csv()- passesmode=modetoCSVDatasink - ✅
write_images()- passesmode=modetoImageDatasink
Broken implementations:
- ❌
write_tfrecords()- does NOT passmodetoTFRecordDatasink - ❌
write_webdataset()- does NOT passmodetoWebDatasetDatasink - ❌
write_numpy()- does NOT passmodetoNumpyDatasink
Expected Behavior
The mode parameter should control how existing files are handled:
APPEND: Append files if directory exists (default behavior)OVERWRITE: Delete existing directory contents before writingIGNORE: Skip writing if directory already existsERROR: Raise an error if directory already exists
Actual Behavior
The mode parameter is silently ignored in write_numpy(), write_webdataset(), and write_tfrecords(), and these methods always use the default APPEND behavior.
Reproduction
import ray
import os
ds = ray.data.from_items([{"value": 1}])
# Create directory with initial data
path = "/tmp/test_tfrecords"
ds.write_tfrecords(path, mode="overwrite")
# Try to overwrite - this should delete old data first, but doesn't
ds2 = ray.data.from_items([{"value": 2}])
ds2.write_tfrecords(path, mode="overwrite") # mode is ignored, files are appended insteadFix
The fix is straightforward - pass mode=mode when instantiating the datasink classes in python/ray/data/dataset.py:
- Line ~3900:
TFRecordDatasink(..., mode=mode) - Line ~3970:
WebDatasetDatasink(..., mode=mode) - Line ~4070:
NumpyDatasink(..., mode=mode)
The datasink classes already support the mode parameter via **file_datasink_kwargs, so no changes are needed to the datasink implementations themselves.
Related
- Original PR: [Data] Add save modes to file data sinks #52900
- Discovered by: @dujl in [Data] Add save modes to file data sinks #52900 (comment)