-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wait_for_steady_state attribute to aws_ecs_service #3485
Conversation
aws/resource_aws_ecs_service.go
Outdated
|
||
steadyStateConf := &resource.StateChangeConf{ | ||
Pending: []string{"false"}, | ||
Target: []string{"true"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bool
state as a string feels a bit weird, but there's precedent:
I considered returning the service state or "STABLE", but then we're relying on ECS never introducing another state.
@@ -63,6 +63,7 @@ into consideration during task placement. The maximum number of | |||
* `placement_constraints` - (Optional) rules that are taken into consideration during task placement. Maximum number of | |||
`placement_constraints` is `10`. Defined below. | |||
* `network_configuration` - (Optional) The network configuration for the service. This parameter is required for task definitions that use the awsvpc network mode to receive their own Elastic Network Interface, and it is not supported for other network modes. | |||
* `wait_for_steady_state` - (Optional) If `true`, Terraform will wait for the service to reach a steady state (like [`aws ecs wait services-stable`](https://docs.aws.amazon.com/cli/latest/reference/ecs/wait/services-stable.html)) before continuing. Default `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to write an acceptance test for when this is true
, but I couldn't come up with a reliable, repeatable way to test we weren't hitting a race condition.
This would also be the first test case that required container instances (or which used Fargate and was locked to us-east-1
).
I have verified manually that this waits for a new deployment to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What race conditions were you thinking of?
I would have thought that you could simply create an ECS service in one step (using your wait_for_steady_state
to make sure it doesn't complete until there is a task running somewhere) and then another step that increases the desired_count
and then another step that changes the task definition by setting an environment variable or something.
Creating container instances shouldn't be difficult, I'd just use the latest ECS optimised AMI and set the user data so that it joins the ECS cluster you create for the acceptance test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The race condition I have in mind is the one we hit without this flag: in the case where the to-be-replaced ECS service definition references a resource that will be deleted, Terraform and ECS race and Terraform can delete the resource while there are still tasks referencing it.
The problem is finding a test that would reliably fail when this flag is unset and reliably pass when it is. The test you describe would pass without this flag, right?
The best idea I have for testing this is a service binary that takes N minutes to start responding to healthchecks, then verify that service update takes at least N minutes. Not great, and for small N it's entirely possible the test would pass without this flag if service update were slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we create an ALB and Target Group then the acceptance test can be something as simple as launching busybox
sleep 600
in that service.
Since it will never pass health checks the service will never become stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was just thinking about working on this because I have a service that needs to have finished deploying before I invalidate cache on Cloudfront so thanks for picking this up. I've added a few comments on things I might have approached differently.
aws/resource_aws_ecs_service.go
Outdated
|
||
// Returns a StateRefreshFunc for a given service. That function will return "true" if the service is in a | ||
// steady state, "false" if the service is running but not in a steady state, and an error otherwise. | ||
func resourceAwsEcsServiceIsSteadyStateFunc(d *schema.ResourceData, meta interface{}) resource.StateRefreshFunc { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you're not just using https://docs.aws.amazon.com/sdk-for-go/api/service/ecs/#ECS.WaitUntilServicesStable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably also want a configurable timeout period here. Although I can't see it in the Go SDK the botocore link lower down shows a 600s total wait time (15s * 40 attempts) which seems like a reasonable default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you're not just using https://docs.aws.amazon.com/sdk-for-go/api/service/ecs/#ECS.WaitUntilServicesStable here?
It seemed more in keeping with the Terraform code to use StateChangeConf
, especially since that code has more elaborate and configurable timeout logic than the AWS SDK -- which will help implementing configurable timeouts, as you describe.
$ grep -R . -e "\.WaitUntil" --exclude './vendor/*'
./aws/resource_aws_ebs_snapshot.go: err := conn.WaitUntilSnapshotCompleted(req)
./aws/resource_aws_iam_instance_profile.go: err = iamconn.WaitUntilInstanceProfileExists(waiterRequest)
Although I can't see it in the Go SDK the botocore link lower down shows a 600s total wait time (15s * 40 attempts) which seems like a reasonable default.
That's exactly where I got 10 * time.Minute
from above. :)
@@ -63,6 +63,7 @@ into consideration during task placement. The maximum number of | |||
* `placement_constraints` - (Optional) rules that are taken into consideration during task placement. Maximum number of | |||
`placement_constraints` is `10`. Defined below. | |||
* `network_configuration` - (Optional) The network configuration for the service. This parameter is required for task definitions that use the awsvpc network mode to receive their own Elastic Network Interface, and it is not supported for other network modes. | |||
* `wait_for_steady_state` - (Optional) If `true`, Terraform will wait for the service to reach a steady state (like [`aws ecs wait services-stable`](https://docs.aws.amazon.com/cli/latest/reference/ecs/wait/services-stable.html)) before continuing. Default `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What race conditions were you thinking of?
I would have thought that you could simply create an ECS service in one step (using your wait_for_steady_state
to make sure it doesn't complete until there is a task running somewhere) and then another step that increases the desired_count
and then another step that changes the task definition by setting an environment variable or something.
Creating container instances shouldn't be difficult, I'd just use the latest ECS optimised AMI and set the user data so that it joins the ECS cluster you create for the acceptance test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments!
@@ -63,6 +63,7 @@ into consideration during task placement. The maximum number of | |||
* `placement_constraints` - (Optional) rules that are taken into consideration during task placement. Maximum number of | |||
`placement_constraints` is `10`. Defined below. | |||
* `network_configuration` - (Optional) The network configuration for the service. This parameter is required for task definitions that use the awsvpc network mode to receive their own Elastic Network Interface, and it is not supported for other network modes. | |||
* `wait_for_steady_state` - (Optional) If `true`, Terraform will wait for the service to reach a steady state (like [`aws ecs wait services-stable`](https://docs.aws.amazon.com/cli/latest/reference/ecs/wait/services-stable.html)) before continuing. Default `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The race condition I have in mind is the one we hit without this flag: in the case where the to-be-replaced ECS service definition references a resource that will be deleted, Terraform and ECS race and Terraform can delete the resource while there are still tasks referencing it.
The problem is finding a test that would reliably fail when this flag is unset and reliably pass when it is. The test you describe would pass without this flag, right?
The best idea I have for testing this is a service binary that takes N minutes to start responding to healthchecks, then verify that service update takes at least N minutes. Not great, and for small N it's entirely possible the test would pass without this flag if service update were slow.
aws/resource_aws_ecs_service.go
Outdated
|
||
// Returns a StateRefreshFunc for a given service. That function will return "true" if the service is in a | ||
// steady state, "false" if the service is running but not in a steady state, and an error otherwise. | ||
func resourceAwsEcsServiceIsSteadyStateFunc(d *schema.ResourceData, meta interface{}) resource.StateRefreshFunc { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you're not just using https://docs.aws.amazon.com/sdk-for-go/api/service/ecs/#ECS.WaitUntilServicesStable here?
It seemed more in keeping with the Terraform code to use StateChangeConf
, especially since that code has more elaborate and configurable timeout logic than the AWS SDK -- which will help implementing configurable timeouts, as you describe.
$ grep -R . -e "\.WaitUntil" --exclude './vendor/*'
./aws/resource_aws_ebs_snapshot.go: err := conn.WaitUntilSnapshotCompleted(req)
./aws/resource_aws_iam_instance_profile.go: err = iamconn.WaitUntilInstanceProfileExists(waiterRequest)
Although I can't see it in the Go SDK the botocore link lower down shows a 600s total wait time (15s * 40 attempts) which seems like a reasonable default.
That's exactly where I got 10 * time.Minute
from above. :)
f732521
to
f115c53
Compare
Despite claiming to wait on creates and updates, my implementation previously only waited on updates. Sorry about that! I've updated the PR. |
f115c53
to
c00c0d8
Compare
We've seen a problem on create where the ECS service seems to have disappeared due to eventual consistency, similar to the problem addressed in 7551f92. Change coming. |
c00c0d8
to
fb6a5eb
Compare
Now updated with code to handle stale reads. Ready for review. |
Any news on this? Having |
Is this trying to implement similar concept like this AWS cli? https://docs.aws.amazon.com/cli/latest/reference/ecs/wait/services-stable.html |
@bgiorgini yep, exactly like that. As a workaround for now I'm using a null resource to shell out to the ECS waiter: resource "null_resource" "wait_for_service_deploy" {
triggers {
task_definition = "${aws_ecs_service.web_service.task_definition}"
}
provisioner "local-exec" {
command = "aws ecs wait services-stable --services ${aws_ecs_service.web_service.name} --cluster ${var.cluster_name} --region ${data.aws_region.current.name}"
}
} |
Thanks for the workaround @tomelliff! We found when testing that the So we added a second The triggers cause Terraform to only run
|
In my experience it does seem to catch when a service is fundamentally unhealthy (container is missing something so it fails to properly start) and this fails our deployment CI jobs when it happens. What is unfortunate is that if another deployment happens that still doesn't fix things then it seems to return as stable despite the new task definition still being broken. This seems to happen the majority of the time with back to back failing deploys and it will also usually stop the original healthy tasks then. Oddly this isn't 100% reproducible and sometimes ECS keeps the original healthy tasks running until a new task definition is genuinely healthy. I'm not sure what's exactly happening in these situations but they're rare enough for us that I've not really looked into it other than a cursory glance when alerting points out 0 tasks are running. And it's not been an issue for us because deploy failures always get flushed out in staging because we aren't yet (maybe never) just going straight to production. |
I've been looking for this exact feature, and would love to use this instead of the work around. Any clue when this will be ready? I'd be happy to help. Let me know if there is anything I can do to help finish this. |
If this gets attention from the Terraform folks I can probably try to help shepherd it through, but otherwise I've pretty much forgotten about it. To be practically deployable, this probably needs to allow configurable timeouts to accommodate large service deployments that will take a long time to update. |
@mmdriley first passes on a feature normally add the minimal thing to get it working and then if people have a requirement for something else (eg for needing to wait more than 10 minutes for an insanely slow starting service to start) then it can be made configurable in a follow up PR. @bflad any chance this can get a review? Got a few people interested in it and we've had a couple of issues raised looking for this functionality. |
9b50fa4
to
72c4193
Compare
aws/resource_aws_ecs_service_test.go
Outdated
data "aws_ami" "test" { | ||
most_recent = true | ||
owners = ["amazon"] | ||
|
||
filter { | ||
name = "name" | ||
values = ["amzn2-ami-ecs-hvm-*"] | ||
} | ||
} | ||
|
||
data "aws_availability_zones" "available" { | ||
state = "available" | ||
|
||
filter { | ||
name = "opt-in-status" | ||
values = ["opt-in-not-required"] | ||
} | ||
} | ||
|
||
resource "aws_launch_template" "test" { | ||
image_id = data.aws_ami.test.id | ||
instance_type = "t3.micro" | ||
name = %[3]q | ||
} | ||
|
||
resource "aws_autoscaling_group" "test" { | ||
availability_zones = data.aws_availability_zones.available.names | ||
desired_capacity = 1 | ||
max_size = 1 | ||
min_size = 1 | ||
name = %[3]q | ||
|
||
launch_template { | ||
id = aws_launch_template.test.id | ||
} | ||
|
||
tags = [ | ||
{ | ||
key = "foo" | ||
value = "bar" | ||
propagate_at_launch = true | ||
}, | ||
] | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added b/c DescribeServices
was throwing was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your cluster
but a new error persists: was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your capacity provider.
I originally followed these instructions https://docs.aws.amazon.com/AmazonECS/latest/developerguide/launch_container_instance.html but was hitting the first error.
440a662
to
28df995
Compare
28df995
to
bc09c23
Compare
network_configuration { | ||
security_groups = [aws_security_group.test.id] | ||
subnets = aws_subnet.test[*].id | ||
assign_public_ip = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required for the task to reach RUNNING
state -- else task throws error CannotPullContainerError: Error response from daemon:Get https://registry-name/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
bc09c23
to
a2f1416
Compare
a2f1416
to
cb6270a
Compare
resource "aws_internet_gateway" "test" { | ||
vpc_id = aws_vpc.test.id | ||
} | ||
|
||
resource "aws_route_table" "test" { | ||
vpc_id = aws_vpc.test.id | ||
|
||
route { | ||
cidr_block = "0.0.0.0/0" | ||
gateway_id = aws_internet_gateway.test.id | ||
} | ||
} | ||
|
||
resource "aws_route_table_association" "test" { | ||
count = 2 | ||
subnet_id = element(aws_subnet.test.*.id, count.index) | ||
route_table_id = aws_route_table.test.id | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required to run task else receives the error CannotPullContainerError
and task reaches STOPPED
state
Output of added tests:
Output of pre-existing tests:
|
dd62136
to
7a6369d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 🚀
This has been released in version 3.13.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks! |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks! |
This allows for cases where the old and new service tasks have different resource dependencies -- for example, an old S3 bucket or database -- and the old resources should only be deleted once all old service tasks that depended on it are gone.
In the case that I hit, our service relied on an auxiliary
aws_ecs_task_definition
that became inactive while there were still service tasks that expected to use it withRunTask
. Now we can setwait_for_steady_state
and know the old task definition will stay alive as long as it's needed.This also allows authors to make the success of their deployment conditional on the success of the service update.
To the extent possible, I modeled this after the logic of
aws ecs wait services-stable
. That includes the definition of "stable" as well as the timeout (10m).Fixes #3107