Backport of Convert ECS platform plugin to use resource manager into release/0.5.x #2196

hc-github-team-waypoint · 2021-09-02T14:45:35Z

Backport

This PR is auto-generated from #2098 to be assessed for backporting due to the inclusion of the label backport/0.5.x.

The below text is copied from the body of the original PR.

This converts the ECS plugin from its own resource lifecycle logic to use resource manager, and implements the Status plugin.

Addresses #1645
Closes #2061

What changes

From a user perspective, there are three changes:

More detailed deployment UI

Previous waypoint deploy console output:

New console output:

Users can now see the progress on creating or discovering all of the resources that go into an ECS deployment.

Better rollback behavior on deployment failure

If a deployment fails partway through, waypoint will now call destroy on each resource that was created during that deployment, leaving fewer orphaned resources. Example:

Note that after the rate limit exception, waypoint destroyed the ALB listener and target group before exiting.

Status

I've implemented status functions on cluster and service resources. The service resource produces additional task resources. You can view them with waypoint status -app=<app>

Example of a service coming online:

One drawback of this change: we run a status check immediately after a deployment, at which point generally the service exists, but the tasks do not yet exist. The initial status check will generally look like this, until a user enables app polling or runs the upcoming waypoint status -refresh command:

Technical notes

There are so many of individual resources that an ECS deployment creates or needs to be directly aware of. The full list is:

cluster
execution IAM role
task IAM role (optional)
internal security group
external security group
log group
subnets
target group
ALB
ALB Listener
route53 record (optional)
task definition
service

This is also a bit of a nerve-wracking change, as it touches the entire surface area of the plugin and there aren't any tests. I tested:

many config param combinations
backwards compat (i.e. this can destroy 0.5.1 deployments, and 0.5.1 can destroy and release these deployments)
app destroy
mid-deploy failure rollback
deleting AWS resources out of band before running a deployment destroy
status checks performed locally and by a remote runner

That said, if anyone has any specific trial workflows in mind, please give them a whirl or let me know.

Future considerations

Improved destroy logic

We only have destroy implemented for ALB listener, target group, and service, which was the state before this change. Other resources are either app-scoped (security groups, etc), or globally scoped (log group). We have a DestroyWorkspace plugin func, which could be useful for this, but I feel like we might need a DestroyApp plugin func too, as I don't think most of these resources are workspace scoped.

Relevant issue: #805

More status functions

I've only implemented status on the cluster and service resources, which I think are the most dynamic and valuable. Implementing more status functions would result in more aws API usage, which bring us closer to hitting api rate limits for large waypoint installations. AWS rate limits to 20 RPS per region. With this change, each status check will result in 4 GET api calls, so with the default check interval of 30 seconds on the latest deploy only, waypoint won't be able to support more than 150 apps.

Once we have a plan to address the rate limit problem, we can implement status functions on more resources.

ECS Releaser plugin

The ECS releaser plugin cannot be easily replaced by the ALB plugin (context: #1577 (comment)). The ECS plugin doesn't create any resources though - only modify existing deployment resources - so I don't think it needs resource manager.

Incidental changes

Apologies for not making these smaller separate PRs.

Closes Specifying ECS sidecar containers without health checks causes a panic #2090
Improved error messages. Amazon errors will often be something vague like "ARN validation failed". If you're making three or four AWS calls as part of one resource, that doesn't tell you which call failed. This now wraps errors to give that context.
Internal security group now authorizes ingress from the external security group, instead of 0.0.0.0/0. This improves the security of ECS deploys, as it prevents the public from accessing disabled tasks, or DDOSing individual tasks.
New config parameter for ecs: ingress_port, that allows you to configure the external port the load balancer uses. It still defaults to 443 if a cert is configured, and 80 otherwise.
Checking more aws calls for "404" equivalent responses, and doing something appropriate instead of erroring.
All aws calls are now called withContext, so if the aws api is slow or hanging it should be possible to cancel the operation.
Security group creation doesn't call AuthorizeIngress to add ingress rules if the security group already exists. This reduces the prodigious number of API calls we make to create a deployment.

izaaklauer added 24 commits September 1, 2021 19:50

backport of commit 96fa9d8

5420bca

backport of commit 4e1bbfa

7685cdd

backport of commit b6c8073

692f4c9

backport of commit 686ad70

75447cd

backport of commit c28508c

4a7b6b7

backport of commit 33c8efb

97e2fc1

backport of commit 46df1cb

92f23cb

backport of commit e2e1b9d

7a031bf

backport of commit 1ea98ea

0ded81d

backport of commit b196e5a

52ffa5f

backport of commit b6cd9a6

6819f3b

backport of commit dac141c

3b434d9

backport of commit 22e7222

9927296

backport of commit ac9f62d

ff46baf

backport of commit 30ea630

c2f6de1

backport of commit 9583be5

6c5e2b1

backport of commit c269f09

64ba5ac

backport of commit 7572e7a

471e59f

backport of commit 72bc360

f8623e7

backport of commit 309e525

4713e5f

backport of commit 32b9360

7a8a395

backport of commit 68d4dec

f07ae46

backport of commit 285a188

ffbcae6

backport of commit 35ebdbe

f91a978

hc-github-team-waypoint merged commit 518aa65 into release/0.5.x Sep 2, 2021

hc-github-team-waypoint deleted the backport/ecs/resources-actual/mutually-refined-penguin branch September 2, 2021 14:45

github-actions bot added plugin plugin/aws website labels Sep 2, 2021

vercel bot temporarily deployed to Preview September 2, 2021 14:46 Inactive

izaaklauer mentioned this pull request Sep 2, 2021

Upgrading plugin SDK version to latest for the 0.5.x branch. #2199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport of Convert ECS platform plugin to use resource manager into release/0.5.x #2196

Backport of Convert ECS platform plugin to use resource manager into release/0.5.x #2196

hc-github-team-waypoint commented Sep 2, 2021

Backport of Convert ECS platform plugin to use resource manager into release/0.5.x #2196

Backport of Convert ECS platform plugin to use resource manager into release/0.5.x #2196

Conversation

hc-github-team-waypoint commented Sep 2, 2021

Backport

What changes

More detailed deployment UI

Better rollback behavior on deployment failure

Status

Technical notes

Future considerations

Improved destroy logic

More status functions

ECS Releaser plugin

Incidental changes