Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading S3 with boto3 (crt) fails silently in multiprocess scenario #4443

Open
1 task
ryxli opened this issue Feb 15, 2025 · 0 comments
Open
1 task

Uploading S3 with boto3 (crt) fails silently in multiprocess scenario #4443

ryxli opened this issue Feb 15, 2025 · 0 comments
Labels
bug This issue is a confirmed bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. p2 This is a standard priority issue s3

Comments

@ryxli
Copy link

ryxli commented Feb 15, 2025

Describe the bug

Have a distributed multi process scenario, where each process is uploading some data to s3.

A certain percentage of processes uploads to s3 and succeed without issue, but there always seems to be at least a single process that fail silently, and get permanently stuck in the upload_fileobj call without any retry or timeout.

  # file_name.bucket = s3://test-bucket
  # file_name.key = f"experiments/testing/__{os.environ["GLOBAL_RANK"]}/0.distcp"
  #    for example experiments/testing/__0/0.distcp ,  experiments/testing/__123/0.distcp , etc...

  # client is instantiated in main process and passed to this section which is ran in mp.fork context, 

  logger.info(f"{local_proc_idx} start uploading {file_name}")
  config = TransferConfig(
      use_threads=False,
      multipart_chunksize=8 * MB
  )
  try:
      s3_client.upload_fileobj(file_or_obj, file_name.bucket, file_name.key, Config=config)
  except Exception as e:
      logger.error(f"Failed to upload {file_name.key} to {file_name.bucket}: {str(e)}")
      raise

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

Ideally, even if there is a failure due to throttling or connectivity or some other issue, at least an exception can be thrown.

Otherwise nothing can be done until the program hits timeouts, if any are even configured.

Current Behavior

For successful processes, this is the log:

  [0]:0 start uploading s3://test-bucket/experiments/testing/__176/0.distcp
  [0]:Attempting to use CRTTransferManager. Config settings may be ignored.

    [0]:Using default client. pid: 140906, thread: 139768785741632
    [0]:Acquiring 0
    [0]:UploadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f19e1f5ccd0>}) about to wait for the following futures []
    [0]:UploadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f19e1f5ccd0>}) done waiting for dependent futures
    [0]:Executing task UploadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f19e1f5ccd0>}) with kwargs {'client': <botocore.client.S3 object at 0x7f1b12b545e0>, 'config': <boto3.s3.transfer.TransferConfig object at 0x7f1a10122680>, 'osutil': <s3transfer.utils.OSUtils object at 0x7f1a10120b20>, 'request_executor': <s3transfer.futures.BoundedExecutor object at 0x7f1a10121810>, 'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f19e1f5ccd0>}
    [0]:Submitting task CreateMultipartUploadTask(transfer_id=0, {'bucket': 'test-bucket', 'key': '.../__176/0.distcp', 'extra_args': {}}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f1a10121810> for transfer request: 0.
    [0]:Acquiring 0
    [0]:CreateMultipartUploadTask(transfer_id=0, {...}) about to wait for the following futures []
    [0]:CreateMultipartUploadTask(transfer_id=0, {...}) done waiting for dependent futures
    [0]:Executing task CreateMultipartUploadTask(transfer_id=0, {...}) with kwargs {'client': <botocore.client.S3 object at 0x7f1b12b545e0>, 'bucket': '...', 'key': '...', 'extra_args': {}}
    [0]:Event before-parameter-build.s3.CreateMultipartUpload: calling handler <function validate_ascii_metadata at 0x7f1d94465c60>
    [0]:Event before-parameter-build.s3.CreateMultipartUpload: calling handler <function sse_md5 at 0x7f1d94465090>
    [0]:Event before-parameter-build.s3.CreateMultipartUpload: calling handler <function validate_bucket_name at 0x7f1d94465000>
    [0]:Event before-parameter-build.s3.CreateMultipartUpload: calling handler <function remove_bucket_from_url_paths_from_model at 0x7f1d94466e60>
    [0]:Event before-parameter-build.s3.CreateMultipartUpload: calling handler <bound method S3RegionRedirectorv2.annotate_request_context of <botocore.utils.S3RegionRedirectorv2 object at 0x7f1b13ea1b10>>
    [0]:Event before-parameter-build.s3.CreateMultipartUpload: calling handler <bound method ClientCreator._inject_s3_input_parameters of <botocore.client.ClientCreator object at 0x7f1b13d311b0>>
    [0]:Event before-parameter-build.s3.CreateMultipartUpload: calling handler <function generate_idempotent_uuid at 0x7f1d94464e50>
    [0]:Event before-parameter-build.s3.CreateMultipartUpload: calling handler <function _handle_request_validation_mode_member at 0x7f1d94467520>
    [0]:Event before-endpoint-resolution.s3: calling handler <function customize_endpoint_resolver_builtins at 0x7f1d94467010>
    [0]:Event before-endpoint-resolution.s3: calling handler <bound method S3RegionRedirectorv2.redirect_from_cache of <botocore.utils.S3RegionRedirectorv2 object at 0x7f1b13ea1b10>>
    [0]:Calling endpoint provider with parameters: {'Bucket': '...', 'Region': 'us-east-1', 'UseFIPS': False, 'UseDualStack': False, 'ForcePathStyle': False, 'Accelerate': False, 'UseGlobalEndpoint': True, 'Key': '...', 'DisableMultiRegionAccessPoints': False, 'UseArnRegion': True}
    [0]:Endpoint provider result: <...>
...
    [0]:2025-02-15 00:37:49,485 botocore.hooks [DEBUG] Event request-created.s3.CompleteMultipartUpload: calling handler <function signal_transferring at 0x7f1b13f3fac0>
    [0]:2025-02-15 00:37:49,485 botocore.hooks [DEBUG] Event request-created.s3.CompleteMultipartUpload: calling handler <function add_retry_headers at 0x7f1d94466dd0>
    [0]:2025-02-15 00:37:49,485 botocore.endpoint [DEBUG] Sending http request: <AWSPreparedRequest stream_output=False, method=POST, url=..., 'amz-sdk-invocation-id': b'3aaea3d6-a6e1-4264-b773-f2f309315845', 'amz-sdk-request': b'attempt=1', 'Content-Length': '42759'}> 
 ...
    [0]:Event before-parse.s3.CompleteMultipartUpload: calling handler <function _handle_200_error at 0x7f1d944672e0>
    [0]:Event before-parse.s3.CompleteMultipartUpload: calling handler <function handle_expires_header at 0x7f1d94467130>
 ...
    [0]:Response body:
    [0]:b'<?xml version="1.0" encoding="UTF-8"?>\n\n<CompleteMultipartUploadResult xmlns=... ... <Bucket>test-bucket</Bucket><Key>...__176/0.distcp</Key><ETag>".."</ETag></CompleteMultipartUploadResult>'
    [0]:Event needs-retry.s3.CompleteMultipartUpload: calling handler <function _update_status_code at 0x7f1d94467400>
    [0]:Event needs-retry.s3.CompleteMultipartUpload: calling handler <bound method RetryHandler.needs_retry of <botocore.retries.standard.RetryHandler object at 0x7f1b13ea1d20>>
    [0]:Event needs-retry.s3.Complet[0]:Not retrying request.
    eMultipartUpload: calling handler <bound method S3RegionRedirectorv2.redirect_from_error of <botocore.utils.S3RegionRedirectorv2 object at 0x7f1b13ea1b10>>
    [0]:Event needs-retry.s3.CompleteMultipartUpload: calling handler <bound method ClientRateLimiter.on_receiving_response of <botocore.retries.adaptive.ClientRateLimiter object at 0x7f1b13ea1ba0>>
    [0]:Event after-call.s3.CompleteMultipartUpload: calling handler <bound method RetryQuotaChecker.release_retry_quota of <botocore.retries.standard.RetryQuotaChecker object at 0x7f1b13ea22f0>>
    [0]:Releasing acquire 0/None
    [0]:Releasing acquire 0/None

For the silently failing processes, this is the log, and the process is forever hanging inside of the upload_fileobj call:

[4]:0 start uploading s3://test-bucket/experiments/testing/__180/0.distcp
[4]:Attempting to use CRTTransferManager. Config settings may be ignored.
[4]:Using CRT client. pid: 140888, thread: 140482685585216
[4]:Event before-parameter-build.s3.PutObject: calling handler <function validate_ascii_metadata at 0x7fc3cc075c60>
[4]:Event before-parameter-build.s3.PutObject: calling handler <function sse_md5 at 0x7fc3cc075090>
[4]:Event before-parameter-build.s3.PutObject: calling handler <function convert_body_to_file_like_object at 0x7fc3cc076560>
[4]:Event before-parameter-build.s3.PutObject: calling handler <function validate_bucket_name at 0x7fc3cc075000>
[4]:Event before-parameter-build.s3.PutObject: calling handler <function remove_bucket_from_url_paths_from_model at 0x7fc3cc076e60>
[4]:Event before-parameter-build.s3.PutObject: calling handler <bound method S3RegionRedirectorv2.annotate_request_context of <botocore.utils.S3RegionRedirectorv2 object at 0x7fbe1816df30>>
[4]:Event before-parameter-build.s3.PutObject: calling handler <bound method ClientCreator._inject_s3_input_parameters of <botocore.client.ClientCreator object at 0x7fbde8388ee0>>
[4]:Event before-parameter-build.s3.PutObject: calling handler <function generate_idempotent_uuid at 0x7fc3cc074e50>
[4]:Event before-parameter-build.s3.PutObject: calling handler <function _handle_request_validation_mode_member at 0x7fc3cc077520>
[4]:Event before-endpoint-resolution.s3: calling handler <function customize_endpoint_resolver_builtins at 0x7fc3cc077010>
[4]:Event before-endpoint-resolution.s3: calling handler <bound method S3RegionRedirectorv2.redirect_from_cache of <botocore.utils.S3RegionRedirectorv2 object at 0x7fbe1816df30>>
[4]:Calling endpoint provider with parameters: {'Bucket': 'test-bucket', 'Region': 'us-east-1', 'UseFIPS': False, 'UseDualStack': False, 'ForcePathStyle': False, 'Accelerate': False, 'UseGlobalEndpoint': True, 'Key': 'experiments/testing/__180/0.distcp', 'DisableMultiRegionAccessPoints': False, 'UseArnRegion': True}
[4]:Endpoint provider result: ...
[4]:Selecting from endpoint provider's list of auth schemes: "sigv4". User selected auth scheme is: "<botocore.UNSIGNED object at 0x7fc3cc8ed450>"
[4]:Event before-call.s3.PutObject: calling handler <bound method BotocoreCRTRequestSerializer._remove_checksum_context of <s3transfer.crt.BotocoreCRTRequestSerializer object at 0x7fbde83888e0>>
[4]:Event before-call.s3.PutObject: calling handler <function add_expect_header at 0x7fc3cc075360>
[4]:Adding expect 100 continue header to request.
[4]:Event before-call.s3.PutObject: calling handler <bound method S3ExpressIdentityResolver.apply_signing_cache_key of <botocore.utils.S3ExpressIdentityResolver object at 0x7fbe1816df00>>
[4]:Event before-call.s3.PutObject: calling handler <function add_recursion_detection_header at 0x7fc3cc074a60>
[4]:Event before-call.s3.PutObject: calling handler <function add_query_compatibility_header at 0x7fc3cc077490>
[4]:Event before-call.s3.PutObject: calling handler <function inject_api_version_header_if_needed at 0x7fc3cc076680>
[4]:Making request for OperationModel(name=PutObject) with params: {...}}}
[4]:Event request-created.s3.PutObject: calling handler <bound method BotocoreCRTRequestSerializer._capture_http_request of <s3transfer.crt.BotocoreCRTRequestSerializer object at 0x7fbde83888e0>>
[4]:Event request-created.s3.PutObject: calling handler <bound method RequestSigner.handler of <botocore.signers.RequestSigner object at 0x7fbe1819d8a0>>
[4]:Event choose-signer.s3.PutObject: calling handler <function set_operation_specific_signer at 0x7fc3cc074ca0>
[4]:Event before-sign.s3.PutObject: calling handler <function remove_arn_from_signing_path at 0x7fc3cc076f80>
[4]:Event before-sign.s3.PutObject: calling handler <function _set_extra_headers_for_unsigned_request at 0x7fc3cc0775b0>
[4]:Event before-sign.s3.PutObject: calling handler <bound method S3ExpressIdentityResolver.resolve_s3express_identity of <botocore.utils.S3ExpressIdentityResolver object at 0x7fbe1816df00>>
[4]:Event request-created.s3.PutObject: calling handler <function add_retry_headers at 0x7fc3cc076dd0>
[4]:Sending http request: <AWSPreparedRequest ...>
[4]:Event before-send.s3.PutObject: calling handler <bound method BotocoreCRTRequestSerializer._make_fake_http_response of <s3transfer.crt.BotocoreCRTRequestSerializer object at 0x7fbde83888e0>>
[4]:Event before-send.s3.PutObject: calling handler <bound method ClientRateLimiter.on_sending_request of <botocore.retries.adaptive.ClientRateLimiter object at 0x7fbe1816e0e0>>
[4]:Event before-parse.s3.PutObject: calling handler <function _handle_200_error at 0x7fc3cc0772e0>
[4]:Event before-parse.s3.PutObject: calling handler <function handle_expires_header at 0x7fc3cc077130>
[4]:Response headers: {}
[4]:Response body:
[4]:b''
... < repeat a few times>

[4]:botocore.hooks [DEBUG] Event needs-retry.s3.PutObject: calling handler <function _update_status_code at 0x7fc3cc077400>
[4]:botocore.hooks [DEBUG] Event needs-retry.s3.PutObject: calling handler <bound method RetryHandler.needs_retry of <botocore.retries.standard.RetryHandler object at 0x7fbe1816e620>>
[4]:botocore.retries.standard [DEBUG] Not retrying request.
[4]:botocore.hooks [DEBUG] Event needs-retry.s3.PutObject: calling handler <bound method S3RegionRedirectorv2.redirect_from_error of <botocore.utils.S3RegionRedirectorv2 object at 0x7fbe1816df30>>
[4]:botocore.hooks [DEBUG] Event needs-retry.s3.PutObject: calling handler <bound method ClientRateLimiter.on_receiving_response of <botocore.retries.adaptive.ClientRateLimiter object at 0x7fbe1816e0e0>>
[4]:botocore.hooks [DEBUG] Event after-call.s3.PutObject: calling handler <bound method BotocoreCRTRequestSerializer._change_response_to_serialized_http_request of <s3transfer.crt.BotocoreCRTRequestSerializer object at 0x7fbde83888e0>>
[4]:botocore.hooks [DEBUG] Event after-call.s3.PutObject: calling handler <bound method RetryQuotaChecker.release_retry_quota of <botocore.retries.standard.RetryQuotaChecker object at 0x7fbe1816f130>>

Reproduction Steps

Client is setup with adaptive retry as follows:

region_name="us-east-1"
retries={"max_attempts": 5 "mode": "adaptive"},
max_pool_connections=96
  # file_name.bucket = s3://test-bucket
  # file_name.key = f"experiments/testing/__{os.environ["GLOBAL_RANK"]}/0.distcp"
  #    for example experiments/testing/__0/0.distcp ,  experiments/testing/__123/0.distcp , etc...

  # client is instantiated in main process and passed to this section which is ran in mp.fork context, 

  logger.info(f"{local_proc_idx} start uploading {file_name}")
  config = TransferConfig(
      use_threads=False,
      multipart_chunksize=8 * MB
  )
  try:
      s3_client.upload_fileobj(file_or_obj, file_name.bucket, file_name.key, Config=config)
     # some processes are stuck inside of s3 call forever
  except Exception as e:
      logger.error(f"Failed to upload {file_name.key} to {file_name.bucket}: {str(e)}")
      raise

  # some processes will be able to reach this point

Possible Solution

Have tried various TransferConfig options, and setting use_threads=False seems to be slightly better but still consistently fails for at least 1 process 100% of the time

Seems like there was an older existing issue with this too: #1067
#1067 (comment)

In this scenario test-bucket is the bucket and experiments/testing/__0/0.distcp is the full key.
Prior to calling upload, experiments/testing folder is already created and exists
Since experiments/testing/__{rank}/{proc}.distcp is the full key, there is an intermediate folder __{rank} that is missing.

Due to that, I've also tried changing the key to save to non intermediate folder instead, so that the full key is:
experiments/testing/__{rank}_{proc}.distcp

But getting the same exact error, so doesn't seem to be an issue of intermediate folder, but maybe something to do with how CRT is processing locks.

Aware of the following limitation of boto3 and crt:

In this first release, the CRT integrations in the AWS CLI and Boto3 automatically detect when multiple processes are creating CRT-based S3 clients, and fall back to their non-CRT-based S3 clients in these cases. 

But don't think there is any way around multiprocessing in our environment and situation.

Additional Information/Context

This is also a cross region upload from ap-south-1 ec2 to us-east-1 s3 bucket.

This is the trace for the "failed" stuck process:

Thread 157998 (idle): "MainThread"
    0x7fed4be4d117 (libc.so.6)
    0x7fed4be58c78 (libc.so.6)
    PyThread_acquire_lock_timed (python3.10)
    wait (threading.py:320)
    result (concurrent/futures/_base.py:453)
    result (s3transfer/crt.py:697)
    result (s3transfer/crt.py:433)
    upload_fileobj (boto3/s3/inject.py:642)

SDK version used

boto3==1.36.3 botocore==1.36.3 aws_crt=0.23.4 s3transfer==0.11.2

Environment details (OS name and version, etc.)

5.10.217-205.860.amzn2.x86_64

@ryxli ryxli added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Feb 15, 2025
@aemous aemous added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Feb 17, 2025
@khushail khushail added p2 This is a standard priority issue s3 and removed needs-triage This issue or PR still needs to be triaged. labels Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. p2 This is a standard priority issue s3
Projects
None yet
Development

No branches or pull requests

3 participants