Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metaflow S3 retry behavior #953

Closed
shoelsch opened this issue Feb 15, 2022 · 0 comments · Fixed by #1001
Closed

Metaflow S3 retry behavior #953

shoelsch opened this issue Feb 15, 2022 · 0 comments · Fixed by #1001
Assignees

Comments

@shoelsch
Copy link

shoelsch commented Feb 15, 2022

Problem

I ran metaflow configure aws, configuring METAFLOW_DATASTORE_SYSROOT_S3 and METAFLOW_DATATOOLS_SYSROOT_S3 and setting METAFLOW_DEFAULT_DATASTORE to s3.

After, I ran python myflow.py run, and the process appeared to hang:

Metaflow 2.5.0 executing MyFlow for user:developer
Execute before run...
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!

I added some debug logging to _one_boto_op. It looks like S3 access is silently failing and retrying for up to 5 minutes.

Retry #0
Sleeping for 7 seconds...
Retry #1
Sleeping for 2 seconds...
Retry #2
Sleeping for 11 seconds...
Retry #3
Sleeping for 18 seconds...
Retry #4
Sleeping for 18 seconds...
Retry #5
Sleeping for 34 seconds...
Retry #6
Sleeping for 65 seconds...
Retry #7
Sleeping for 133 seconds...

We ran out of attempts (and slept for 288 seconds)!
    S3 access failed:
    S3 operation failed.
    Key requested: s3://metaflow-artifact-store/flows/MyFlow/1644904607105118/_parameters/0/0.attempt.json

On top of this, if I set METAFLOW_S3_RETRY_COUNT=0 to mitigate the retry behavior, I still need to wait for the sleep:

Retry #0
Sleeping for 8 seconds...
We ran out of attempts (and slept for 8 seconds)!
    S3 access failed:
    S3 operation failed.
    Key requested: s3://metaflow-artifact-store/flows/MyFlow/1644905069513467/_parameters/0/0.attempt.json
    Error: Unable to locate credentials

Expected Behavior

I was surprised that I had to wait almost 5 minutes for the process to fail and let me know I have an AWS configuration error.

I would expect setting the retry count to 0 (effectively turning it off) to also disable the sleep behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants