Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Get AWS credentials with boto #47352

Merged
merged 2 commits into from
Aug 28, 2024

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Aug 27, 2024

Why are these changes needed?

To reduce flakiness of the multi_node_train_benchmark.py release tests (e.g. read_images_train_1_gpu_5_cpu.aws) caused by AWS Error ACCESS_DENIED, we get filesystem object using boto3 and pass it to read_parquet() instead. We suspect the root cause of the AWS error stems from pyarrow.fs, but need to confirm. See below for two release test runs without such errors.

Related issue number

Closes #47337

Checks

Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee marked this pull request as ready for review August 27, 2024 18:09
@@ -505,6 +505,21 @@ def split_input_files_per_worker(args):
]


def get_s3fs_with_boto_creds():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a GH issue, and point users to this workaround?
And can you add a comment here on why doing this, as well as the downsides of this workaround?
IIUC, the downsides are:

  • credentials obtained from the driver may not work on remote nodes.
  • this approach doesn't handle expiration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linked to this existing issue here

Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee enabled auto-merge (squash) August 28, 2024 16:47
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Aug 28, 2024
@scottjlee scottjlee merged commit e043a03 into ray-project:master Aug 28, 2024
7 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 12, 2024
## Why are these changes needed?

To reduce flakiness of the `multi_node_train_benchmark.py` release tests
(e.g. `read_images_train_1_gpu_5_cpu.aws`) caused by `AWS Error
ACCESS_DENIED`, we get filesystem object using `boto3` and pass it to
`read_parquet()` instead. We suspect the root cause of the AWS error
stems from `pyarrow.fs`, but need to confirm. See below for two release
test runs without such errors.

## Related issue number
Closes ray-project#47337

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [x] Release tests
     - https://buildkite.com/ray-project/release/builds/21708
     - https://buildkite.com/ray-project/release/builds/21742
   - [ ] This PR is not tested :(

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Release test read_images_train_4_gpu.aws failed
4 participants