[Data] Add save modes to file data sinks #52900

iamjustinhsu · 2025-05-09T16:34:33Z

Why are these changes needed?

In write_parquet, we want to be able to support

OVERWRITE: (If dir present, delete then write, otherwise, just create dir, then write)

A more detailed description can be found in https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes

This PR was meant to address https://anyscale1.atlassian.net/browse/DATA-946, but since the other save modes weren't that much work, I added the additional following 3 from apache spark too

IGNORE: (if dir present, silently pass)
ERROR: (if dir present, throw error)
APPEND (this is the current behavior we have, if dir present, we append files. Any conflicting file names are overwritten)

Related issue number

attentive requesting this

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

## Why are these changes needed?  In write_parquet, we want to be able to support - `OVERWRITE`: (If dir present, delete then write, otherwise, just create dir, then write) A more detailed description can be found in https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes This PR was meant to address https://anyscale1.atlassian.net/browse/DATA-946, but since the other save modes weren't that much work, I added the additional following 3 from apache spark too - `IGNORE`: (if dir present, silently pass) - `ERROR`: (if dir present, throw error) - `APPEND` (this is the current behavior we have, if dir present, we append files. Any conflicting file names are overwritten) ## Related issue number attentive requesting this  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: weiran11 <weiran11@baidu.com>

dujl · 2025-10-17T07:06:11Z

The savemode parameter in write_numpy(), write_webdataset(), and write_tfrecords() is currently non-functional.

I have observed that you have already integrated the mode (presumably referring to the parameter analogous to savemode) into the implementations of write_numpy(), write_webdataset(), and write_tfrecords(). However, a key oversight appears to be the lack of passing this mode parameter to underlying components such as TFRecordDatasink (and potentially other relevant datasink classes for Numpy/WebDataset).

bveeramani · 2025-10-20T19:44:02Z

@dujl Good catch! Here's a GitHub Issue for this: #57924.

Would you want to pick this up?

iamjustinhsu and others added 20 commits May 9, 2025 09:29

Add SaveModes in write_parquet

3de7bea

substitutions

9ccb3b4

lint

708d8d6

fix s3 test

ce778cc

fix s3 test 2

58254ee

Update python/ray/data/dataset.py

13ea238

Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

reduce complexity, some nits

677293f

save modes in each test

9e0e037

simplify enum, fix some tests

7ed6e4d

lint

9203ab2

lint

c8d34cc

lint isort skip_file

a64de1e

log warning on ignore

023a696

log again

1a46ba9

add to filedatasink

a9c75a9

lint

417d9dd

remove 1 test

6d97f6f

make tests simplier

2547352

rename

4838af4

nits and bits

ac4605c

iamjustinhsu requested a review from a team as a code owner May 9, 2025 16:34

iamjustinhsu requested a review from bveeramani May 9, 2025 16:34

bveeramani approved these changes May 9, 2025

View reviewed changes

bveeramani enabled auto-merge (squash) May 9, 2025 16:36

bveeramani disabled auto-merge May 9, 2025 16:36

bveeramani enabled auto-merge (squash) May 9, 2025 16:36

github-actions bot added the go add ONLY when ready to merge, run all tests label May 9, 2025

iamjustinhsu changed the title ~~Add save modes to file data sinks~~ [Data] Add save modes to file data sinks May 9, 2025

api stability

54eaf5d

github-actions bot disabled auto-merge May 9, 2025 17:33

api stability p2

9e6e097

bveeramani merged commit 5324339 into ray-project:master May 12, 2025
4 of 5 checks passed

iamjustinhsu deleted the jhsu/add-modes-to-file-datasinks branch May 12, 2025 16:36

hainesmichaelc added the community-backlog label May 22, 2025

bveeramani mentioned this pull request Oct 20, 2025

[Data] SaveMode parameter non-functional in write_numpy(), write_webdataset(), and write_tfrecords() #57924

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Add save modes to file data sinks #52900

[Data] Add save modes to file data sinks #52900

Uh oh!

iamjustinhsu commented May 9, 2025

Uh oh!

Uh oh!

dujl commented Oct 17, 2025

Uh oh!

bveeramani commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Data] Add save modes to file data sinks #52900

[Data] Add save modes to file data sinks #52900

Uh oh!

Conversation

iamjustinhsu commented May 9, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

dujl commented Oct 17, 2025

Uh oh!

bveeramani commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants