Skip to content

[Data] SaveMode parameter non-functional in write_numpy(), write_webdataset(), and write_tfrecords() #57924

@bveeramani

Description

@bveeramani

Description

The mode parameter in write_numpy(), write_webdataset(), and write_tfrecords() is currently non-functional.

Root Cause

PR #52900 added the mode parameter (with values APPEND, OVERWRITE, IGNORE, ERROR) to several write methods. However, while the parameter was added to the method signatures of write_numpy(), write_webdataset(), and write_tfrecords(), it is not being passed to the underlying datasink classes.

Working implementations:

  • write_parquet() - passes mode=mode to ParquetDatasink
  • write_json() - passes mode=mode to JSONDatasink
  • write_csv() - passes mode=mode to CSVDatasink
  • write_images() - passes mode=mode to ImageDatasink

Broken implementations:

  • write_tfrecords() - does NOT pass mode to TFRecordDatasink
  • write_webdataset() - does NOT pass mode to WebDatasetDatasink
  • write_numpy() - does NOT pass mode to NumpyDatasink

Expected Behavior

The mode parameter should control how existing files are handled:

  • APPEND: Append files if directory exists (default behavior)
  • OVERWRITE: Delete existing directory contents before writing
  • IGNORE: Skip writing if directory already exists
  • ERROR: Raise an error if directory already exists

Actual Behavior

The mode parameter is silently ignored in write_numpy(), write_webdataset(), and write_tfrecords(), and these methods always use the default APPEND behavior.

Reproduction

import ray
import os

ds = ray.data.from_items([{"value": 1}])

# Create directory with initial data
path = "/tmp/test_tfrecords"
ds.write_tfrecords(path, mode="overwrite")

# Try to overwrite - this should delete old data first, but doesn't
ds2 = ray.data.from_items([{"value": 2}])
ds2.write_tfrecords(path, mode="overwrite")  # mode is ignored, files are appended instead

Fix

The fix is straightforward - pass mode=mode when instantiating the datasink classes in python/ray/data/dataset.py:

  1. Line ~3900: TFRecordDatasink(..., mode=mode)
  2. Line ~3970: WebDatasetDatasink(..., mode=mode)
  3. Line ~4070: NumpyDatasink(..., mode=mode)

The datasink classes already support the mode parameter via **file_datasink_kwargs, so no changes are needed to the datasink implementations themselves.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdataRay Data-related issuesgood-first-issueGreat starter issue for someone just starting to contribute to Ray

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions