[BUG] - Timeout error when deploying to AWS - CreateBucket OperationAborted #2613

viniciusdc · 2024-08-06T19:21:41Z

Describe the bug

While testing our latest RC candidate on AWS, the deployment got stuck at the S3 bucket creation for the terraform-state as seen below:

While inspecting the CloudTrail logs, I noticed that AWS was aborting the CreateBucket requests due to a conflicting conditional operation currently in progress:

{
"eventSource": "s3.amazonaws.com",
    "eventName": "CreateBucket",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "***",
    "errorCode": "OperationAborted",
    "errorMessage": "A conflicting conditional operation is currently in progress against this resource. Please try again.",
    "requestParameters": {
        "bucketName": "***-dev-terraform-state",
        "Host": "***-dev-terraform-state.s3.amazonaws.com",
        "x-amz-acl": "private"
    },
}

I am still not entirely sure about the case, but I think it is related to the order of operations performed by the AWS Terraform provider when creating the bucket and applying the encryption and security visibility block changes. I, unfortunately, didn't have the trace logs on during that time, so I couldn't thoroughly inspect the concurrent requests being made by the AWS provider during that time, but these are the ones that happened during apply:

PutBucketTagging (part of aws_s3_bucket)
CreateBucket (part of aws_s3_bucket)
PutBucketVersioning (part of aws_s3_bucket)
PutBucketPublicAccessBlock
PutBucketEncryption

We already enforce some dependency between these resources in our TF code:

graph TB
	A[aws_kms_key]
    B[aws_s3_bucket]
    C[aws_s3_bucket_server_side_encryption_configuration]
    D[aws_s3_bucket_public_access_block]
    A --> B
    B --> C
    B --> D

However, nothing seems to prevent s3_bucket_server_side_encryption from not running concurrently or in incorrect order from s3_bucket_public_access_block.

We have two possible solutions in my opinion:

As this seems to be fixed upstream, I suggest we test upgrading the AWS provider to the version with the fix (ideal) v3.67.0
Add an explicit depends_on to the resources above to force the API requests to happen sequentially.

Expected behavior

Deployment succeeds without blocking sections or requiring the user to redeploy again for the sequence to self-heal. It succeeds without any blocking sections or the user requiring a redeployment again. The deployment succeeds without any blocking sections or without the user needing to redeploy again for the sequence to self-heal.

OS and architecture in which you are running Nebari

Linux

How to Reproduce the problem?

This can be a bit tricky to reproduce as it depends on the order in which both will apply the resources as well as the timing of the requests sent by the AWS provider. But in theory, just creating a deployment from scratch on AWS should be enough to reproduce. and

Command output

No response

Versions and dependencies used.

nebari 2024.7.1rc3

Compute environment

AWS

Integrations

No response

Anything else?

Similar behavior and in-depth discussion on a similar behavior can be found here: hashicorp/terraform-provider-aws#7628.

This is also an interesting one, hashicorp/terraform-provider-aws#14078:

The text was updated successfully, but these errors were encountered:

viniciusdc · 2024-08-07T17:53:15Z

This option is not feasible as we are already running a greater version (5.33.0") than the one suggested to have the fix (3.67.0).

As this seems to be fixed upstream, I suggest we test upgrading the AWS provider to the version with the fix (ideal) v3.67.0

marcelovilla · 2024-08-14T19:11:11Z

@viniciusdc should we close this now that #2615 has been merged?

marcelovilla · 2024-08-27T16:51:33Z

@viniciusdc I'm closing this as it should have been resolved by #2615. Feel free to reopen the issue if needed.

viniciusdc added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage provider: AWS area: terraform 💾 labels Aug 6, 2024

viniciusdc mentioned this issue Aug 6, 2024

Testing checklist for 2024.7.1 #2596

Closed

11 tasks

marcelovilla self-assigned this Aug 7, 2024

marcelovilla removed the needs: triage 🚦 Someone needs to have a look at this issue and triage label Aug 7, 2024

viniciusdc mentioned this issue Aug 7, 2024

Add depends_on for bucket encryption #2615

Merged

10 tasks

marcelovilla closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - Timeout error when deploying to AWS - CreateBucket OperationAborted #2613

[BUG] - Timeout error when deploying to AWS - CreateBucket OperationAborted #2613

viniciusdc commented Aug 6, 2024 •

edited

Loading

viniciusdc commented Aug 7, 2024 •

edited

Loading

marcelovilla commented Aug 14, 2024

marcelovilla commented Aug 27, 2024

[BUG] - Timeout error when deploying to AWS - CreateBucket OperationAborted #2613

[BUG] - Timeout error when deploying to AWS - CreateBucket OperationAborted #2613

Comments

viniciusdc commented Aug 6, 2024 • edited Loading

Describe the bug

Expected behavior

OS and architecture in which you are running Nebari

How to Reproduce the problem?

Command output

Versions and dependencies used.

Compute environment

Integrations

Anything else?

viniciusdc commented Aug 7, 2024 • edited Loading

marcelovilla commented Aug 14, 2024

marcelovilla commented Aug 27, 2024

viniciusdc commented Aug 6, 2024 •

edited

Loading

viniciusdc commented Aug 7, 2024 •

edited

Loading