Convert ECS platform plugin to use resource manager #2098

izaaklauer · 2021-08-24T14:54:14Z

This converts the ECS plugin from its own resource lifecycle logic to use resource manager, and implements the Status plugin.

Addresses #1645
Closes #2061

What changes

From a user perspective, there are three changes:

More detailed deployment UI

Previous waypoint deploy console output:

New console output:

Users can now see the progress on creating or discovering all of the resources that go into an ECS deployment.

Better rollback behavior on deployment failure

If a deployment fails partway through, waypoint will now call destroy on each resource that was created during that deployment, leaving fewer orphaned resources. Example:

Note that after the rate limit exception, waypoint destroyed the ALB listener and target group before exiting.

Status

I've implemented status functions on cluster and service resources. The service resource produces additional task resources. You can view them with waypoint status -app=<app>

Example of a service coming online:

One drawback of this change: we run a status check immediately after a deployment, at which point generally the service exists, but the tasks do not yet exist. The initial status check will generally look like this, until a user enables app polling or runs the upcoming waypoint status -refresh command:

Technical notes

There are so many of individual resources that an ECS deployment creates or needs to be directly aware of. The full list is:

cluster
execution IAM role
task IAM role (optional)
internal security group
external security group
log group
subnets
target group
ALB
ALB Listener
route53 record (optional)
task definition
service

This is also a bit of a nerve-wracking change, as it touches the entire surface area of the plugin and there aren't any tests. I tested:

many config param combinations
backwards compat (i.e. this can destroy 0.5.1 deployments, and 0.5.1 can destroy and release these deployments)
app destroy
mid-deploy failure rollback
deleting AWS resources out of band before running a deployment destroy
status checks performed locally and by a remote runner

That said, if anyone has any specific trial workflows in mind, please give them a whirl or let me know.

Future considerations

Improved destroy logic

We only have destroy implemented for ALB listener, target group, and service, which was the state before this change. Other resources are either app-scoped (security groups, etc), or globally scoped (log group). We have a DestroyWorkspace plugin func, which could be useful for this, but I feel like we might need a DestroyApp plugin func too, as I don't think most of these resources are workspace scoped.

Relevant issue: #805

More status functions

I've only implemented status on the cluster and service resources, which I think are the most dynamic and valuable. Implementing more status functions would result in more aws API usage, which bring us closer to hitting api rate limits for large waypoint installations. AWS rate limits to 20 RPS per region. With this change, each status check will result in 4 GET api calls, so with the default check interval of 30 seconds on the latest deploy only, waypoint won't be able to support more than 150 apps.

Once we have a plan to address the rate limit problem, we can implement status functions on more resources.

ECS Releaser plugin

The ECS releaser plugin cannot be easily replaced by the ALB plugin (context: #1577 (comment)). The ECS plugin doesn't create any resources though - only modify existing deployment resources - so I don't think it needs resource manager.

Incidental changes

Apologies for not making these smaller separate PRs.

Closes Specifying ECS sidecar containers without health checks causes a panic #2090
Improved error messages. Amazon errors will often be something vague like "ARN validation failed". If you're making three or four AWS calls as part of one resource, that doesn't tell you which call failed. This now wraps errors to give that context.
Internal security group now authorizes ingress from the external security group, instead of 0.0.0.0/0. This improves the security of ECS deploys, as it prevents the public from accessing disabled tasks, or DDOSing individual tasks.
New config parameter for ecs: ingress_port, that allows you to configure the external port the load balancer uses. It still defaults to 443 if a cert is configured, and 80 otherwise.
Checking more aws calls for "404" equivalent responses, and doing something appropriate instead of erroring.
All aws calls are now called withContext, so if the aws api is slow or hanging it should be possible to cancel the operation.
Security group creation doesn't call AuthorizeIngress to add ingress rules if the security group already exists. This reduces the prodigious number of API calls we make to create a deployment.

izaaklauer · 2021-08-24T19:14:45Z

builtin/aws/ecs/platform.go

- step terminal.Step
-}
+// DeploymentId is a unique ID to be consistently used throughout our deployment
+type DeploymentId string


Using this as an argmapper marker type

izaaklauer · 2021-08-24T19:33:58Z

builtin/aws/ecs/platform.go

- result["awslogs-multiline-pattern"] = aws.String(lo.MultilinePattern)
- result["mode"] = aws.String(lo.Mode)
- result["max-buffer-size"] = aws.String(lo.MaxBufferSize)
+ // NOTE(izaak): The random part of ULIDs seems to be near the end, so for long app names, we might not get unique names here.


Leaving this for now because it's not new behavior, and because the future task to replace internal IDs with sequence IDs should help this considerably.

izaaklauer · 2021-08-24T19:35:18Z

builtin/aws/ecs/platform.go

+ if container.HealthCheck != nil {
+ c.SetHealthCheck(&ecs.HealthCheck{
 Command: aws.StringSlice(container.HealthCheck.Command),
 Interval: aws.Int64(container.HealthCheck.Interval),
 Timeout: aws.Int64(container.HealthCheck.Timeout),
 Retries: aws.Int64(container.HealthCheck.Retries),
 StartPeriod: aws.Int64(container.HealthCheck.StartPeriod),
- },
- Secrets: secrets,
- Environment: env,
- Memory: utils.OptionalInt64(int64(container.Memory)),
- MemoryReservation: utils.OptionalInt64(int64(container.MemoryReservation)),
+ })
 }


This bit fixes #2090

izaaklauer · 2021-08-24T19:36:19Z

builtin/aws/ecs/platform.go

- for i := 0; i < 30; i++ {
- taskOut, err = ecsSvc.RegisterTaskDefinition(&registerTaskDefinitionInput)
+
+ for i := 0; i <= awsCreateRetries; i++ {


Longer term it probably makes sense to have some kind of exponential backoff with a context cancel.

izaaklauer · 2021-08-24T19:37:37Z

builtin/aws/ecs/platform.go

+ log hclog.Logger,
+ src *component.Source,
+ securityGroups *Resource_ExternalSecurityGroups,
+ subnets *Resource_Subnets, // Required because we need to know which VPC we're in, and subnets discover it.


I don't love having vpcId live inside the subnet state. I tried having subnets return a "VpcId" marker type, but we're not currently incorporating the resource return values into the argmapper soup. I think this is good enough for now, and if this problem crops up repeatedly we can change the plugin sdk.

izaaklauer · 2021-08-24T19:46:05Z

builtin/aws/ecs/plugin.proto

+ string id = 2;
+
+ // Indicates if a security group was created by waypoint, and should be destroyed by waypoint
+ bool managed = 3;


We're only setting (not using) this for now, but it'll be valuable in the future when we decide to implement delete on security groups.

izaaklauer · 2021-08-24T19:48:22Z

builtin/aws/ecs/platform.go

- } else if len(def) > 0 && def[0].ForwardConfig != nil && len(def[0].ForwardConfig.TargetGroups) > 1 {
- // Multiple target groups means we can keep the listener
- var active bool
+ getOut, err := svc.GetRoleWithContext(ctx, queryInput)


With the improved resource visibility, we can see that this turns out to be the slowest part of the ecs deployment most of the time. Go figure.

briancain

:whoa: there's a lot here! This really illustrates how nice that StatusAll helper is back over on the sdk side! Nice. I haven't tried this yet but spent some time reading through the code and gave some feedback here.

builtin/aws/ecs/platform.go

mitchellh · 2021-08-26T16:10:38Z

Looks great! Re: rate limits, I know in Terraform we hit rate limits pretty early and while I'm not sure what it does not, back in the day we detected the error and implemented a backoff directly in the providers. I wonder if we want to do something similar here eventually.

izaaklauer · 2021-08-26T17:33:40Z

With the status checks, I'm mostly worried about pushing waypoint's steady-state usage over the limit. If that's too high, we'd be constantly doing backoffs.

Backoffs in the deploy and destroy funcs seems like a great idea though.

Co-authored-by: Brian Cain <bcain@hashicorp.com>

There's a bug in the overall refactor though - intermittent argmapper errors coming out of DestroyAll

github-actions bot added plugin plugin/aws labels Aug 24, 2021

izaaklauer force-pushed the ecs/resources-actual branch from c2fe128 to 434dd63 Compare August 24, 2021 17:10

vercel bot temporarily deployed to Preview August 24, 2021 17:10 Inactive

vercel bot had a problem deploying to Preview August 24, 2021 17:17 Failure

github-actions bot added the website label Aug 24, 2021

vercel bot temporarily deployed to Preview August 24, 2021 17:24 Inactive

izaaklauer force-pushed the ecs/resources-actual branch from 9aa210e to 74a3fd1 Compare August 24, 2021 19:12

vercel bot temporarily deployed to Preview August 24, 2021 19:12 Inactive

github-actions bot added plugin/docker plugin/pack labels Aug 24, 2021

izaaklauer added the backport/0.5.x label Aug 24, 2021

izaaklauer commented Aug 24, 2021

View reviewed changes

izaaklauer requested a review from a team August 24, 2021 19:51

izaaklauer marked this pull request as ready for review August 24, 2021 19:51

vercel bot temporarily deployed to Preview August 24, 2021 19:56 Inactive

vercel bot temporarily deployed to Preview August 24, 2021 19:58 Inactive

vercel bot temporarily deployed to Preview August 24, 2021 19:59 Inactive

krantzinator mentioned this pull request Aug 24, 2021

Resource-level awareness in plugins #1645

Open

9 tasks

briancain suggested changes Aug 24, 2021

View reviewed changes

vercel bot temporarily deployed to Preview August 25, 2021 12:51 Inactive

vercel bot temporarily deployed to Preview August 25, 2021 12:52 Inactive

vercel bot temporarily deployed to Preview August 25, 2021 12:53 Inactive

vercel bot temporarily deployed to Preview August 25, 2021 12:57 Inactive

vercel bot temporarily deployed to Preview August 25, 2021 13:03 Inactive

vercel bot temporarily deployed to Preview August 25, 2021 13:05 Inactive

vercel bot temporarily deployed to Preview August 26, 2021 17:41 Inactive

izaaklauer and others added 9 commits September 1, 2021 15:50

Update builtin/aws/ecs/platform.go

30ea630

Co-authored-by: Brian Cain <bcain@hashicorp.com>

Update builtin/aws/ecs/platform.go

9583be5

Co-authored-by: Brian Cain <bcain@hashicorp.com>

More function comments

c269f09

Update builtin/aws/ecs/platform.go

7572e7a

Co-authored-by: Brian Cain <bcain@hashicorp.com>

Update builtin/aws/ecs/platform.go

72bc360

Co-authored-by: Brian Cain <bcain@hashicorp.com>

Using status.Errorf everywhere

309e525

First pass at incorporating changes from #1935

32b9360

There's a bug in the overall refactor though - intermittent argmapper errors coming out of DestroyAll

Fixing missing pointer deref in log

68d4dec

Gracefully handle ECS service being deleted out-of-band

285a188

izaaklauer force-pushed the ecs/resources-actual branch from eb23c5a to 285a188 Compare September 1, 2021 20:21

vercel bot temporarily deployed to Preview September 1, 2021 20:21 Inactive

Add missing error print

35ebdbe

vercel bot temporarily deployed to Preview September 1, 2021 21:54 Inactive

izaaklauer added plugin/ecs and removed plugin plugin/docker plugin/pack plugin/aws website labels Sep 2, 2021

izaaklauer merged commit 3a8029c into main Sep 2, 2021

izaaklauer deleted the ecs/resources-actual branch September 2, 2021 12:51

hc-github-team-waypoint mentioned this pull request Sep 2, 2021

Backport of Convert ECS platform plugin to use resource manager into release/0.5.x #2192

Closed

izaaklauer added backport/0.5.x pr/no-changelog No automatic changelog entry required for this pull request and removed backport/0.5.x labels Sep 2, 2021

hc-github-team-waypoint mentioned this pull request Sep 2, 2021

Backport of Convert ECS platform plugin to use resource manager into release/0.5.x #2196

Merged

izaaklauer added the backport/0.5.x label Sep 2, 2021

hc-github-team-waypoint mentioned this pull request Sep 2, 2021

Backport of Convert ECS platform plugin to use resource manager into release/0.5.x #2197

Closed

izaaklauer removed the pr/no-changelog No automatic changelog entry required for this pull request label Sep 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert ECS platform plugin to use resource manager #2098

Convert ECS platform plugin to use resource manager #2098

izaaklauer commented Aug 24, 2021 •

edited

Loading

izaaklauer Aug 24, 2021

izaaklauer Aug 24, 2021

izaaklauer Aug 24, 2021

izaaklauer Aug 24, 2021

izaaklauer Aug 24, 2021

izaaklauer Aug 24, 2021

izaaklauer Aug 24, 2021

briancain left a comment

mitchellh commented Aug 26, 2021

izaaklauer commented Aug 26, 2021

Convert ECS platform plugin to use resource manager #2098

Convert ECS platform plugin to use resource manager #2098

Conversation

izaaklauer commented Aug 24, 2021 • edited Loading

What changes

More detailed deployment UI

Better rollback behavior on deployment failure

Status

Technical notes

Future considerations

Improved destroy logic

More status functions

ECS Releaser plugin

Incidental changes

izaaklauer Aug 24, 2021

Choose a reason for hiding this comment

izaaklauer Aug 24, 2021

Choose a reason for hiding this comment

izaaklauer Aug 24, 2021

Choose a reason for hiding this comment

izaaklauer Aug 24, 2021

Choose a reason for hiding this comment

izaaklauer Aug 24, 2021

Choose a reason for hiding this comment

izaaklauer Aug 24, 2021

Choose a reason for hiding this comment

izaaklauer Aug 24, 2021

Choose a reason for hiding this comment

briancain left a comment

Choose a reason for hiding this comment

mitchellh commented Aug 26, 2021

izaaklauer commented Aug 26, 2021

izaaklauer commented Aug 24, 2021 •

edited

Loading