Skip to content

Conversation

@iamjustinhsu
Copy link
Contributor

Why are these changes needed?

In write_parquet, we want to be able to support

  • OVERWRITE: (If dir present, delete then write, otherwise, just create dir, then write)

A more detailed description can be found in https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes

This PR was meant to address https://anyscale1.atlassian.net/browse/DATA-946, but since the other save modes weren't that much work, I added the additional following 3 from apache spark too

  • IGNORE: (if dir present, silently pass)
  • ERROR: (if dir present, throw error)
  • APPEND (this is the current behavior we have, if dir present, we append files. Any conflicting file names are overwritten)

Related issue number

attentive requesting this

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@iamjustinhsu iamjustinhsu requested a review from a team as a code owner May 9, 2025 16:34
@iamjustinhsu iamjustinhsu requested a review from bveeramani May 9, 2025 16:34
@bveeramani bveeramani enabled auto-merge (squash) May 9, 2025 16:36
@bveeramani bveeramani disabled auto-merge May 9, 2025 16:36
@bveeramani bveeramani enabled auto-merge (squash) May 9, 2025 16:36
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label May 9, 2025
@iamjustinhsu iamjustinhsu changed the title Add save modes to file data sinks [Data] Add save modes to file data sinks May 9, 2025
@github-actions github-actions bot disabled auto-merge May 9, 2025 17:33
@bveeramani bveeramani merged commit 5324339 into ray-project:master May 12, 2025
4 of 5 checks passed
@iamjustinhsu iamjustinhsu deleted the jhsu/add-modes-to-file-datasinks branch May 12, 2025 16:36
ran1995data pushed a commit to ran1995data/ray that referenced this pull request May 13, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->
In write_parquet, we want to be able to support
- `OVERWRITE`: (If dir present, delete then write, otherwise, just
create dir, then write)

A more detailed description can be found in
https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes

This PR was meant to address
https://anyscale1.atlassian.net/browse/DATA-946, but since the other
save modes weren't that much work, I added the additional following 3
from apache spark too
- `IGNORE`: (if dir present, silently pass)
- `ERROR`: (if dir present, throw error)
- `APPEND` (this is the current behavior we have, if dir present, we
append files. Any conflicting file names are overwritten)

## Related issue number
attentive requesting this
<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: weiran11 <weiran11@baidu.com>
@dujl
Copy link

dujl commented Oct 17, 2025

The savemode parameter in write_numpy(), write_webdataset(), and write_tfrecords() is currently non-functional.

I have observed that you have already integrated the mode (presumably referring to the parameter analogous to savemode) into the implementations of write_numpy(), write_webdataset(), and write_tfrecords(). However, a key oversight appears to be the lack of passing this mode parameter to underlying components such as TFRecordDatasink (and potentially other relevant datasink classes for Numpy/WebDataset).

@bveeramani
Copy link
Member

@dujl Good catch! Here's a GitHub Issue for this: #57924.

Would you want to pick this up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants