[data] AWS ACCESS_DENIED errors due to transient network issues #47230

raulchen · 2024-08-20T22:37:08Z

Sometimes we get access_denied errors when read task concurrency is high or when network is unstable.
This can happen even when credentials are properly set. E.g. in some cases, if the AWS service doesn’t receive enough information to determine the actual error, it might default to an “ACCESS_DENIED” response.

Currently we don't retry on ACCESS_DENIED errors because we cannot distinguish transient errors from real authentication errors. In both cases, we all get OSError: When getting information for key '...' in bucket '...': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

When this happens, reducing concurrency may help. If you are sure about your credential setup, another solution is to manually add ACCESS_DENIED error to the retry list.

See comment below for a potential workaround.

The text was updated successfully, but these errors were encountered:

raulchen · 2024-08-20T22:40:03Z

Another option is we retry ACCESS_DENIED for read tasks, but don't retry it for metadata fetching tasks.
Because if it's a real authentication issue, the metadata fetching tasks will first raise the error.
But this can still cause confusion if the user has permission for the directory, but not for a file in the directory.

raulchen · 2024-08-20T22:47:26Z

A more detailed explanation from ChatGPT of why this error can be mistakenly raised

The “ACCESS_DENIED” error message during network operations like HeadObject can sometimes be misleading when dealing with intermittent network issues. Here’s why this might happen:

1. Network Interruption Leading to Misinterpretation

	•	Inconsistent Connectivity: If there’s a brief network glitch or packet loss, the request might not reach the AWS server correctly, or the response might not be received properly. When the request is incomplete or garbled due to the network issue, the server might not be able to authenticate it correctly, causing it to respond with an “ACCESS_DENIED” error.
	•	Fallback Errors: In some cases, if the AWS service doesn’t receive enough information to determine the actual error, it might default to an “ACCESS_DENIED” response. This is a conservative approach to avoid accidentally exposing resources.

2. Timeouts Misinterpreted as Access Issues

	•	Timeout Handling: When a request times out due to network issues, the client library (pyarrow in this case) might interpret the lack of a proper response as an “ACCESS_DENIED” error because it didn’t receive the expected authorization confirmation from the server.
	•	Partial Responses: Sometimes, the client might receive a partial response before the connection drops. If the response lacks the necessary authentication data, it could be interpreted as an access denial rather than a network failure.

3. Boto3 or Pyarrow Error Handling

	•	Error Mapping: The boto3 or pyarrow library might map certain low-level network errors to higher-level errors like “ACCESS_DENIED” if the error occurs during a critical authentication step. This can be due to how the libraries abstract away the complexity of handling AWS responses.
	•	Inconsistent Error Messages: The error handling mechanism in these libraries may not always distinguish clearly between an access denial and a network issue, especially if the error occurs at a point where access checks are involved.

4. Load Balancer or CDN Issues

	•	AWS Infrastructure: If AWS’s load balancers or edge nodes experience brief issues, the requests might be routed in ways that cause them to fail. In such cases, the error might be incorrectly flagged as an access issue when it’s actually a transient infrastructure problem.

5. DNS Resolution Problems

	•	DNS Resolution Failures: If there’s an intermittent DNS resolution failure, the request might not reach the correct endpoint, leading to an incorrect “ACCESS_DENIED” response due to a failure in resolving the proper S3 bucket URL.

raulchen · 2024-08-20T22:56:33Z

linking a related issue #42153

…for multi-node Data+Train benchmarks (#47232) ## Why are these changes needed? For release tests like `read_images_train_1_gpu_5_cpu`, `read_images_train_4_gpu`, `read_images_train_16_gpu`, and their variants, we observe `AWS ACCESS_DENIED` errors somewhat consistently, but not every time. By default, we do not retry on `ACCESS_DENIED` because `ACCESS_DENIED` can be raised in multiple situations, and does not necessarily stem from authentication failures; hence we cannot distinguish auth errors from other unrelated transient errors. See #47230 for more details on the underlying issue. For the purpose of this release test, we don't foresee authentication issues, so we add `ACCESS_DENIED` as a retryable exception type, to avoid failures for transient errors. ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [x] Release tests - https://buildkite.com/ray-project/release/builds/21397 - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2024-08-27T21:54:31Z

The current theory behind the root cause is that the original credentials become unavailable in the middle of execution, possibly due to a pyarrow.fs bug. The suggested workaround is to explicitly define a filesystem using credentials generated with boto3 , and pass it to the read method you are using. For example:

def get_s3fs_with_boto_creds():
    import boto3
    from pyarrow import fs

    credentials = boto3.Session().get_credentials()

    s3fs = fs.S3FileSystem(
        access_key=credentials.access_key,
        secret_key=credentials.secret_key,
        session_token=credentials.token,
        region=...,
    )
    return s3fs

fs = get_s3fs_with_boto_creds()
ds = ray.data.read_images(..., filesystem=fs)

Potential downsides for this workaround are:

Credentials obtained from the driver may not work on remote nodes.
This approach doesn't handle expiration.

raulchen added P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Aug 20, 2024

scottjlee mentioned this issue Aug 20, 2024

[Data] [Release Test] Add AWS ACCESS_DENIED as retryable exception for multi-node Data+Train benchmarks #47232

Merged

8 tasks

scottjlee mentioned this issue Aug 27, 2024

[Data] Get AWS credentials with boto #47352

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] AWS ACCESS_DENIED errors due to transient network issues #47230

[data] AWS ACCESS_DENIED errors due to transient network issues #47230

raulchen commented Aug 20, 2024 •

edited by scottjlee

Loading

raulchen commented Aug 20, 2024

raulchen commented Aug 20, 2024

raulchen commented Aug 20, 2024

scottjlee commented Aug 27, 2024 •

edited

Loading

[data] AWS ACCESS_DENIED errors due to transient network issues #47230

[data] AWS ACCESS_DENIED errors due to transient network issues #47230

Comments

raulchen commented Aug 20, 2024 • edited by scottjlee Loading

raulchen commented Aug 20, 2024

raulchen commented Aug 20, 2024

raulchen commented Aug 20, 2024

scottjlee commented Aug 27, 2024 • edited Loading

raulchen commented Aug 20, 2024 •

edited by scottjlee

Loading

scottjlee commented Aug 27, 2024 •

edited

Loading