Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Timeout error when deploying to AWS - CreateBucket OperationAborted #2613

Closed
viniciusdc opened this issue Aug 6, 2024 · 3 comments
Closed
Assignees

Comments

@viniciusdc
Copy link
Contributor

viniciusdc commented Aug 6, 2024

Describe the bug

While testing our latest RC candidate on AWS, the deployment got stuck at the S3 bucket creation for the terraform-state as seen below:

image

While inspecting the CloudTrail logs, I noticed that AWS was aborting the CreateBucket requests due to a conflicting conditional operation currently in progress:
Captura de tela de 2024-08-06 14 09 38

{
"eventSource": "s3.amazonaws.com",
    "eventName": "CreateBucket",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "***",
    "errorCode": "OperationAborted",
    "errorMessage": "A conflicting conditional operation is currently in progress against this resource. Please try again.",
    "requestParameters": {
        "bucketName": "***-dev-terraform-state",
        "Host": "***-dev-terraform-state.s3.amazonaws.com",
        "x-amz-acl": "private"
    },
}

I am still not entirely sure about the case, but I think it is related to the order of operations performed by the AWS Terraform provider when creating the bucket and applying the encryption and security visibility block changes. I, unfortunately, didn't have the trace logs on during that time, so I couldn't thoroughly inspect the concurrent requests being made by the AWS provider during that time, but these are the ones that happened during apply:

  • PutBucketTagging (part of aws_s3_bucket)
  • CreateBucket (part of aws_s3_bucket)
  • PutBucketVersioning (part of aws_s3_bucket)
  • PutBucketPublicAccessBlock
  • PutBucketEncryption

We already enforce some dependency between these resources in our TF code:

graph TB
	A[aws_kms_key]
    B[aws_s3_bucket]
    C[aws_s3_bucket_server_side_encryption_configuration]
    D[aws_s3_bucket_public_access_block]
    A --> B
    B --> C
    B --> D
Loading

However, nothing seems to prevent s3_bucket_server_side_encryption from not running concurrently or in incorrect order from s3_bucket_public_access_block.

We have two possible solutions in my opinion:

  • As this seems to be fixed upstream, I suggest we test upgrading the AWS provider to the version with the fix (ideal) v3.67.0
  • Add an explicit depends_on to the resources above to force the API requests to happen sequentially.

Expected behavior

Deployment succeeds without blocking sections or requiring the user to redeploy again for the sequence to self-heal. It succeeds without any blocking sections or the user requiring a redeployment again. The deployment succeeds without any blocking sections or without the user needing to redeploy again for the sequence to self-heal.

OS and architecture in which you are running Nebari

Linux

How to Reproduce the problem?

This can be a bit tricky to reproduce as it depends on the order in which both will apply the resources as well as the timing of the requests sent by the AWS provider. But in theory, just creating a deployment from scratch on AWS should be enough to reproduce. and

Command output

No response

Versions and dependencies used.

nebari 2024.7.1rc3

Compute environment

AWS

Integrations

No response

Anything else?

Similar behavior and in-depth discussion on a similar behavior can be found here: hashicorp/terraform-provider-aws#7628.

This is also an interesting one, hashicorp/terraform-provider-aws#14078:

@viniciusdc viniciusdc added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage provider: AWS area: terraform 💾 labels Aug 6, 2024
@marcelovilla marcelovilla self-assigned this Aug 7, 2024
@marcelovilla marcelovilla removed the needs: triage 🚦 Someone needs to have a look at this issue and triage label Aug 7, 2024
@viniciusdc
Copy link
Contributor Author

viniciusdc commented Aug 7, 2024

This option is not feasible as we are already running a greater version (5.33.0") than the one suggested to have the fix (3.67.0).

As this seems to be fixed upstream, I suggest we test upgrading the AWS provider to the version with the fix (ideal) v3.67.0

@marcelovilla
Copy link
Member

@viniciusdc should we close this now that #2615 has been merged?

@marcelovilla
Copy link
Member

@viniciusdc I'm closing this as it should have been resolved by #2615. Feel free to reopen the issue if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants