-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] AWS ACCESS_DENIED errors due to transient network issues #47230
Comments
Another option is we retry ACCESS_DENIED for read tasks, but don't retry it for metadata fetching tasks. |
A more detailed explanation from ChatGPT of why this error can be mistakenly raised
|
linking a related issue #42153 |
…for multi-node Data+Train benchmarks (#47232) ## Why are these changes needed? For release tests like `read_images_train_1_gpu_5_cpu`, `read_images_train_4_gpu`, `read_images_train_16_gpu`, and their variants, we observe `AWS ACCESS_DENIED` errors somewhat consistently, but not every time. By default, we do not retry on `ACCESS_DENIED` because `ACCESS_DENIED` can be raised in multiple situations, and does not necessarily stem from authentication failures; hence we cannot distinguish auth errors from other unrelated transient errors. See #47230 for more details on the underlying issue. For the purpose of this release test, we don't foresee authentication issues, so we add `ACCESS_DENIED` as a retryable exception type, to avoid failures for transient errors. ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [x] Release tests - https://buildkite.com/ray-project/release/builds/21397 - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <sjl@anyscale.com>
The current theory behind the root cause is that the original credentials become unavailable in the middle of execution, possibly due to a
Potential downsides for this workaround are:
|
Sometimes we get access_denied errors when read task concurrency is high or when network is unstable.
This can happen even when credentials are properly set. E.g. in some cases, if the AWS service doesn’t receive enough information to determine the actual error, it might default to an “ACCESS_DENIED” response.
Currently we don't retry on ACCESS_DENIED errors because we cannot distinguish transient errors from real authentication errors. In both cases, we all get
OSError: When getting information for key '...' in bucket '...': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
When this happens, reducing concurrency may help. If you are sure about your credential setup, another solution is to manually add
ACCESS_DENIED
error to the retry list.See comment below for a potential workaround.
The text was updated successfully, but these errors were encountered: