Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release test read_images_train_4_gpu.aws failed #47337

Closed
can-anyscale opened this issue Aug 26, 2024 · 3 comments · Fixed by #47352
Closed

Release test read_images_train_4_gpu.aws failed #47337

can-anyscale opened this issue Aug 26, 2024 · 3 comments · Fixed by #47352
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues jailed-test Indicate that this test is jailed and will stop running in CI/CD P0 Issues that should be fixed in short order ray-test-bot Issues managed by OSS test policy release-test release test stability triage Needs triage (eg: priority, bug/not-bug, and owning component) weekly-release-blocker Issues that will be blocking Ray weekly releases

Comments

@can-anyscale
Copy link
Collaborator

Release test read_images_train_4_gpu.aws failed. See https://buildkite.com/ray-project/release/builds/21637#01918de2-f07c-4b97-83e7-9fc9e7012db5 for more details.

Managed by OSS Test Policy

@can-anyscale can-anyscale added bug Something that is supposed to be working; but isn't data Ray Data-related issues P0 Issues that should be fixed in short order ray-test-bot Issues managed by OSS test policy release-test release test stability triage Needs triage (eg: priority, bug/not-bug, and owning component) weekly-release-blocker Issues that will be blocking Ray weekly releases labels Aug 26, 2024
@can-anyscale
Copy link
Collaborator Author

Blamed commit: b2319a6 found by bisect job https://buildkite.com/ray-project/release-tests-bisect/builds/1490

@can-anyscale
Copy link
Collaborator Author

Test has been failing for far too long. Jailing.

@can-anyscale can-anyscale added the jailed-test Indicate that this test is jailed and will stop running in CI/CD label Aug 28, 2024
scottjlee added a commit that referenced this issue Aug 28, 2024
## Why are these changes needed?

To reduce flakiness of the `multi_node_train_benchmark.py` release tests
(e.g. `read_images_train_1_gpu_5_cpu.aws`) caused by `AWS Error
ACCESS_DENIED`, we get filesystem object using `boto3` and pass it to
`read_parquet()` instead. We suspect the root cause of the AWS error
stems from `pyarrow.fs`, but need to confirm. See below for two release
test runs without such errors.

## Related issue number
Closes #47337

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [x] Release tests
     - https://buildkite.com/ray-project/release/builds/21708
     - https://buildkite.com/ray-project/release/builds/21742
   - [ ] This PR is not tested :(

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
@can-anyscale
Copy link
Collaborator Author

ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Oct 12, 2024
## Why are these changes needed?

To reduce flakiness of the `multi_node_train_benchmark.py` release tests
(e.g. `read_images_train_1_gpu_5_cpu.aws`) caused by `AWS Error
ACCESS_DENIED`, we get filesystem object using `boto3` and pass it to
`read_parquet()` instead. We suspect the root cause of the AWS error
stems from `pyarrow.fs`, but need to confirm. See below for two release
test runs without such errors.

## Related issue number
Closes ray-project#47337

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [x] Release tests
     - https://buildkite.com/ray-project/release/builds/21708
     - https://buildkite.com/ray-project/release/builds/21742
   - [ ] This PR is not tested :(

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues jailed-test Indicate that this test is jailed and will stop running in CI/CD P0 Issues that should be fixed in short order ray-test-bot Issues managed by OSS test policy release-test release test stability triage Needs triage (eg: priority, bug/not-bug, and owning component) weekly-release-blocker Issues that will be blocking Ray weekly releases
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants