-
Notifications
You must be signed in to change notification settings - Fork 327
Convert ECS platform plugin to use resource manager #2098
Conversation
c2fe128
to
434dd63
Compare
9aa210e
to
74a3fd1
Compare
step terminal.Step | ||
} | ||
// DeploymentId is a unique ID to be consistently used throughout our deployment | ||
type DeploymentId string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this as an argmapper marker type
result["awslogs-multiline-pattern"] = aws.String(lo.MultilinePattern) | ||
result["mode"] = aws.String(lo.Mode) | ||
result["max-buffer-size"] = aws.String(lo.MaxBufferSize) | ||
// NOTE(izaak): The random part of ULIDs seems to be near the end, so for long app names, we might not get unique names here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this for now because it's not new behavior, and because the future task to replace internal IDs with sequence IDs should help this considerably.
if container.HealthCheck != nil { | ||
c.SetHealthCheck(&ecs.HealthCheck{ | ||
Command: aws.StringSlice(container.HealthCheck.Command), | ||
Interval: aws.Int64(container.HealthCheck.Interval), | ||
Timeout: aws.Int64(container.HealthCheck.Timeout), | ||
Retries: aws.Int64(container.HealthCheck.Retries), | ||
StartPeriod: aws.Int64(container.HealthCheck.StartPeriod), | ||
}, | ||
Secrets: secrets, | ||
Environment: env, | ||
Memory: utils.OptionalInt64(int64(container.Memory)), | ||
MemoryReservation: utils.OptionalInt64(int64(container.MemoryReservation)), | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit fixes #2090
for i := 0; i < 30; i++ { | ||
taskOut, err = ecsSvc.RegisterTaskDefinition(®isterTaskDefinitionInput) | ||
|
||
for i := 0; i <= awsCreateRetries; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer term it probably makes sense to have some kind of exponential backoff with a context cancel.
log hclog.Logger, | ||
src *component.Source, | ||
securityGroups *Resource_ExternalSecurityGroups, | ||
subnets *Resource_Subnets, // Required because we need to know which VPC we're in, and subnets discover it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love having vpcId live inside the subnet state. I tried having subnets return a "VpcId" marker type, but we're not currently incorporating the resource return values into the argmapper soup. I think this is good enough for now, and if this problem crops up repeatedly we can change the plugin sdk.
string id = 2; | ||
|
||
// Indicates if a security group was created by waypoint, and should be destroyed by waypoint | ||
bool managed = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're only setting (not using) this for now, but it'll be valuable in the future when we decide to implement delete on security groups.
} else if len(def) > 0 && def[0].ForwardConfig != nil && len(def[0].ForwardConfig.TargetGroups) > 1 { | ||
// Multiple target groups means we can keep the listener | ||
var active bool | ||
getOut, err := svc.GetRoleWithContext(ctx, queryInput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the improved resource visibility, we can see that this turns out to be the slowest part of the ecs deployment most of the time. Go figure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:whoa: there's a lot here! This really illustrates how nice that StatusAll helper is back over on the sdk side! Nice. I haven't tried this yet but spent some time reading through the code and gave some feedback here.
Looks great! Re: rate limits, I know in Terraform we hit rate limits pretty early and while I'm not sure what it does not, back in the day we detected the error and implemented a backoff directly in the providers. I wonder if we want to do something similar here eventually. |
With the status checks, I'm mostly worried about pushing waypoint's steady-state usage over the limit. If that's too high, we'd be constantly doing backoffs. Backoffs in the deploy and destroy funcs seems like a great idea though. |
Co-authored-by: Brian Cain <bcain@hashicorp.com>
Co-authored-by: Brian Cain <bcain@hashicorp.com>
Co-authored-by: Brian Cain <bcain@hashicorp.com>
Co-authored-by: Brian Cain <bcain@hashicorp.com>
There's a bug in the overall refactor though - intermittent argmapper errors coming out of DestroyAll
eb23c5a
to
285a188
Compare
This converts the ECS plugin from its own resource lifecycle logic to use resource manager, and implements the Status plugin.
Addresses #1645
Closes #2061
What changes
From a user perspective, there are three changes:
More detailed deployment UI
Previous
waypoint deploy
console output:New console output:
Users can now see the progress on creating or discovering all of the resources that go into an ECS deployment.
Better rollback behavior on deployment failure
If a deployment fails partway through, waypoint will now call destroy on each resource that was created during that deployment, leaving fewer orphaned resources. Example:
Note that after the rate limit exception, waypoint destroyed the ALB listener and target group before exiting.
Status
I've implemented status functions on
cluster
andservice
resources. Theservice
resource produces additionaltask
resources. You can view them withwaypoint status -app=<app>
Example of a service coming online:
One drawback of this change: we run a status check immediately after a deployment, at which point generally the service exists, but the tasks do not yet exist. The initial status check will generally look like this, until a user enables app polling or runs the upcoming
waypoint status -refresh
command:Technical notes
There are so many of individual resources that an ECS deployment creates or needs to be directly aware of. The full list is:
This is also a bit of a nerve-wracking change, as it touches the entire surface area of the plugin and there aren't any tests. I tested:
That said, if anyone has any specific trial workflows in mind, please give them a whirl or let me know.
Future considerations
Improved destroy logic
We only have destroy implemented for ALB listener, target group, and service, which was the state before this change. Other resources are either app-scoped (security groups, etc), or globally scoped (log group). We have a
DestroyWorkspace
plugin func, which could be useful for this, but I feel like we might need aDestroyApp
plugin func too, as I don't think most of these resources are workspace scoped.Relevant issue: #805
More status functions
I've only implemented status on the
cluster
andservice
resources, which I think are the most dynamic and valuable. Implementing more status functions would result in more aws API usage, which bring us closer to hitting api rate limits for large waypoint installations. AWS rate limits to 20 RPS per region. With this change, each status check will result in 4 GET api calls, so with the default check interval of 30 seconds on the latest deploy only, waypoint won't be able to support more than 150 apps.Once we have a plan to address the rate limit problem, we can implement status functions on more resources.
ECS Releaser plugin
The ECS releaser plugin cannot be easily replaced by the ALB plugin (context: #1577 (comment)). The ECS plugin doesn't create any resources though - only modify existing deployment resources - so I don't think it needs resource manager.
Incidental changes
Apologies for not making these smaller separate PRs.
ingress_port
, that allows you to configure the external port the load balancer uses. It still defaults to 443 if a cert is configured, and 80 otherwise.withContext
, so if the aws api is slow or hanging it should be possible to cancel the operation.