Uploading S3 with boto3 (crt) fails silently in multiprocess scenario #4443
Labels
bug
This issue is a confirmed bug.
investigating
This issue is being investigated and/or work is in progress to resolve the issue.
p2
This is a standard priority issue
s3
Describe the bug
Have a distributed multi process scenario, where each process is uploading some data to s3.
A certain percentage of processes uploads to s3 and succeed without issue, but there always seems to be at least a single process that fail silently, and get permanently stuck in the
upload_fileobj
call without any retry or timeout.Regression Issue
Expected Behavior
Ideally, even if there is a failure due to throttling or connectivity or some other issue, at least an exception can be thrown.
Otherwise nothing can be done until the program hits timeouts, if any are even configured.
Current Behavior
For successful processes, this is the log:
For the silently failing processes, this is the log, and the process is forever hanging inside of the
upload_fileobj
call:Reproduction Steps
Client is setup with adaptive retry as follows:
Possible Solution
Have tried various TransferConfig options, and setting
use_threads=False
seems to be slightly better but still consistently fails for at least 1 process 100% of the timeSeems like there was an older existing issue with this too: #1067
#1067 (comment)
In this scenario
test-bucket
is the bucket andexperiments/testing/__0/0.distcp
is the full key.Prior to calling upload,
experiments/testing
folder is already created and existsSince
experiments/testing/__{rank}/{proc}.distcp
is the full key, there is an intermediate folder__{rank}
that is missing.Due to that, I've also tried changing the key to save to non intermediate folder instead, so that the full key is:
experiments/testing/__{rank}_{proc}.distcp
But getting the same exact error, so doesn't seem to be an issue of intermediate folder, but maybe something to do with how CRT is processing locks.
Aware of the following limitation of boto3 and crt:
But don't think there is any way around multiprocessing in our environment and situation.
Additional Information/Context
This is also a cross region upload from ap-south-1 ec2 to us-east-1 s3 bucket.
This is the trace for the "failed" stuck process:
SDK version used
boto3==1.36.3 botocore==1.36.3 aws_crt=0.23.4 s3transfer==0.11.2
Environment details (OS name and version, etc.)
5.10.217-205.860.amzn2.x86_64
The text was updated successfully, but these errors were encountered: