Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] AWS ACCESS_DENIED errors due to transient network issues #47230

Open
raulchen opened this issue Aug 20, 2024 · 4 comments
Open

[data] AWS ACCESS_DENIED errors due to transient network issues #47230

raulchen opened this issue Aug 20, 2024 · 4 comments
Labels
data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@raulchen
Copy link
Contributor

raulchen commented Aug 20, 2024

Sometimes we get access_denied errors when read task concurrency is high or when network is unstable.
This can happen even when credentials are properly set. E.g. in some cases, if the AWS service doesn’t receive enough information to determine the actual error, it might default to an “ACCESS_DENIED” response.

Currently we don't retry on ACCESS_DENIED errors because we cannot distinguish transient errors from real authentication errors. In both cases, we all get OSError: When getting information for key '...' in bucket '...': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

When this happens, reducing concurrency may help. If you are sure about your credential setup, another solution is to manually add ACCESS_DENIED error to the retry list.

See comment below for a potential workaround.

@raulchen raulchen added P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Aug 20, 2024
@raulchen
Copy link
Contributor Author

Another option is we retry ACCESS_DENIED for read tasks, but don't retry it for metadata fetching tasks.
Because if it's a real authentication issue, the metadata fetching tasks will first raise the error.
But this can still cause confusion if the user has permission for the directory, but not for a file in the directory.

@raulchen
Copy link
Contributor Author

A more detailed explanation from ChatGPT of why this error can be mistakenly raised

The “ACCESS_DENIED” error message during network operations like HeadObject can sometimes be misleading when dealing with intermittent network issues. Here’s why this might happen:

1. Network Interruption Leading to Misinterpretation

	•	Inconsistent Connectivity: If there’s a brief network glitch or packet loss, the request might not reach the AWS server correctly, or the response might not be received properly. When the request is incomplete or garbled due to the network issue, the server might not be able to authenticate it correctly, causing it to respond with an “ACCESS_DENIED” error.
	•	Fallback Errors: In some cases, if the AWS service doesn’t receive enough information to determine the actual error, it might default to an “ACCESS_DENIED” response. This is a conservative approach to avoid accidentally exposing resources.

2. Timeouts Misinterpreted as Access Issues

	•	Timeout Handling: When a request times out due to network issues, the client library (pyarrow in this case) might interpret the lack of a proper response as an “ACCESS_DENIED” error because it didn’t receive the expected authorization confirmation from the server.
	•	Partial Responses: Sometimes, the client might receive a partial response before the connection drops. If the response lacks the necessary authentication data, it could be interpreted as an access denial rather than a network failure.

3. Boto3 or Pyarrow Error Handling

	•	Error Mapping: The boto3 or pyarrow library might map certain low-level network errors to higher-level errors like “ACCESS_DENIED” if the error occurs during a critical authentication step. This can be due to how the libraries abstract away the complexity of handling AWS responses.
	•	Inconsistent Error Messages: The error handling mechanism in these libraries may not always distinguish clearly between an access denial and a network issue, especially if the error occurs at a point where access checks are involved.

4. Load Balancer or CDN Issues

	•	AWS Infrastructure: If AWS’s load balancers or edge nodes experience brief issues, the requests might be routed in ways that cause them to fail. In such cases, the error might be incorrectly flagged as an access issue when it’s actually a transient infrastructure problem.

5. DNS Resolution Problems

	•	DNS Resolution Failures: If there’s an intermittent DNS resolution failure, the request might not reach the correct endpoint, leading to an incorrect “ACCESS_DENIED” response due to a failure in resolving the proper S3 bucket URL.

@raulchen
Copy link
Contributor Author

linking a related issue #42153

scottjlee added a commit that referenced this issue Aug 21, 2024
…for multi-node Data+Train benchmarks (#47232)

## Why are these changes needed?

For release tests like `read_images_train_1_gpu_5_cpu`,
`read_images_train_4_gpu`, `read_images_train_16_gpu`, and their
variants, we observe `AWS ACCESS_DENIED` errors somewhat consistently,
but not every time. By default, we do not retry on `ACCESS_DENIED`
because `ACCESS_DENIED` can be raised in multiple situations, and does
not necessarily stem from authentication failures; hence we cannot
distinguish auth errors from other unrelated transient errors. See
#47230 for more details on the
underlying issue.

For the purpose of this release test, we don't foresee authentication
issues, so we add `ACCESS_DENIED` as a retryable exception type, to
avoid failures for transient errors.

## Related issue number

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
- [x] Release tests -
https://buildkite.com/ray-project/release/builds/21397
   - [ ] This PR is not tested :(

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee
Copy link
Contributor

scottjlee commented Aug 27, 2024

The current theory behind the root cause is that the original credentials become unavailable in the middle of execution, possibly due to a pyarrow.fs bug. The suggested workaround is to explicitly define a filesystem using credentials generated with boto3 , and pass it to the read method you are using. For example:

def get_s3fs_with_boto_creds():
    import boto3
    from pyarrow import fs

    credentials = boto3.Session().get_credentials()

    s3fs = fs.S3FileSystem(
        access_key=credentials.access_key,
        secret_key=credentials.secret_key,
        session_token=credentials.token,
        region=...,
    )
    return s3fs

fs = get_s3fs_with_boto_creds()
ds = ray.data.read_images(..., filesystem=fs)

Potential downsides for this workaround are:

  • Credentials obtained from the driver may not work on remote nodes.
  • This approach doesn't handle expiration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

2 participants