Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_kinesis_firehose_delivery_stream intermittently fails to be created or destroyed before timeout #22645

Closed
camlow325 opened this issue Jan 18, 2022 · 6 comments · Fixed by #28469
Labels
bug Addresses a defect in current functionality. eventual-consistency Pertains to eventual consistency issues. service/kinesis Issues and PRs that pertain to the kinesis service.
Milestone

Comments

@camlow325
Copy link
Contributor

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

> terraform -v
Terraform v1.0.8
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v3.67.0

Affected Resource(s)

  • aws_kinesis_firehose_delivery_stream

Terraform Configuration Files

I don't believe that there's anything specific about the content of these resources in our configuration which leads to this problem but could provide a more specific example if that would be useful.

Debug Output

Panic Output

Expected Behavior

Kinesis Firehose delivery stream is created successfully.

Actual Behavior

Intermittently, the Terraform apply fails with this error:

Error: error waiting for Kinesis Firehose Delivery Stream (...) creation: timeout while waiting for state to become 'ACTIVE' (last state: 'CREATING', timeout: 20m0s)

When this error has occurred, creation of the delivery stream in AWS has usually succeeded eventually. Since the Terraform run errors out without committing information for the resource to state, subsequent attempts to repeat the apply fail due to the resource already existing in AWS. For example:

Error: error creating Kinesis Firehose Delivery Stream: ResourceInUseException: Firehose ... under accountId ... already exists

The only methods to recover from this condition seem to be to either import the resource into Terraform state or delete the delivery stream manually (outside of Terraform) before repeating a Terraform apply to recreate the stream. After the initial failure occurs, we've commonly seen that creations of a few additional delivery streams via Terraform with different names may fail with the same error. We've assumed that this may be due to some intermittent sluggishness in AWS, where 20 minutes doesn't seem to be a long enough wait timeout in some cases.

We don't have a good feel at this point as to what the timeout would need to be increased to in order to more reliably not see the issue. Ideally, it would be nice to be able to experiment with setting longer create and delete operation timeouts in the aws_kinesis_firehose_delivery_stream resources to see what would work best. It does not appear, however, that the aws_kinesis_firehose_delivery_stream resource currently supports setting operation timeouts and that these timeouts are instead just hardcoded to 20 minutes. If increasing the default timeout would not make sense for most users, we'd like to at least have the flexibility to set operation timeouts.

Steps to Reproduce

  1. terraform apply

Important Factoids

We've also intermittently seen terraform destroy commands for the aws_kinesis_firehose_delivery_stream resources fail for a similar reason:

Error: error waiting for Kinesis Firehose Delivery Stream (...) delete: timeout while waiting for resource to be gone (last state: 'DELETING', timeout: 20m0s)

In these cases, we have seen from the AWS console that the destroy does eventually succeed even after the terraform command has failed.

References

@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Jan 18, 2022
@justinretzolk justinretzolk added bug Addresses a defect in current functionality. eventual-consistency Pertains to eventual consistency issues. service/kinesis Issues and PRs that pertain to the kinesis service. and removed needs-triage Waiting for first response or review from a maintainer. labels Jan 20, 2022
@sjonpaulbrown
Copy link

I have also experienced this issue before. If I remember correctly, on timeout, the aws_kinesis_firehose_delivery_stream resource is not written to state.

Specifically, if a timeout occurs when creating the resource for the first time, it is not written to state, and on subsequent runs, the resource collides with the stream that has already been created. This is less than ideal because resources are actually being orphaned which is why they have to be imported or manually deleted.

I would agree that having the ability to increase the timeout for this resource would be the easiest and safest solution.

@opub
Copy link

opub commented Jan 24, 2022

I am seeing this also with a Firehose stream going to Amazon OpenSearch Service and logging to S3. The timeout is happening 100% of the time for me. The Firehose stream never completes successfully. After about an hour it finally leaves the "creating" status and is left in an "inoperable" status with the error below. The only option to delete it and try again. This seems like an AWS issue but Terraform could at least handle the state better.

image

@opub
Copy link

opub commented Jan 24, 2022

Through process of elimination I was able to resolve the issue that was causing my timeout. I was able to recreate the timeout in the AWS Console and after creating a dozen data streams narrowed the issue down to having only one subnet in the Elasticsearch config.

Changing this in Terraform also fixed my timeout there. Specifically having two subnets instead of one for aws_kinesis_firehose_delivery_stream.elasticsearch_configuration.vpc_config.subnet_ids works even though I've been using this same config without problem for over a year.

@camlow325 camlow325 changed the title aws_kinesis_firehose_delivery_stream intemittently fails to be created or destroyed before timeout aws_kinesis_firehose_delivery_stream intermittently fails to be created or destroyed before timeout May 20, 2022
@josephpohlmann
Copy link

This bug happens intermittently for me as well, and this appears to have nothing to do with VPC configuration, etc. It seems to do with the sheer number of firehoses I have (the more I have, the longer the next one takes to provision). This seems to be a limitation of AWS, but is there no way to put retry logic in place, etc.?

ameddin73 added a commit to ameddin73/aws_kinesis_firehose_delivery_stream-add_timeouts that referenced this issue Dec 21, 2022
jar-b added a commit that referenced this issue Feb 13, 2023
* r/aws_kinesis_firehose_delivery_stream add timeouts (#22645)

Add timeout to kineses firehose create

tmp

* r/aws_firehose_delivery_stream: add configurable delete timeout

* r/aws_firehose_delivery_stream(docs): update timeouts

* r/aws_firehose_delivery_stream: add configurable update timeout

* chore: changelog

---------

Co-authored-by: Jared Baker <jared.baker@hashicorp.com>
@github-actions github-actions bot added this to the v4.55.0 milestone Feb 13, 2023
@github-actions
Copy link

This functionality has been released in v4.55.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. eventual-consistency Pertains to eventual consistency issues. service/kinesis Issues and PRs that pertain to the kinesis service.
Projects
None yet
5 participants