feat: Set `MinInstancesInService` via CFN parameters #1383

akash1810 · 2024-10-11T07:43:41Z

Context

In guardian/cdk#2417 support was added to @guardian/cdk (GuCDK) to deploy application updates to an autoscaling group (ASG) via CloudFormation (CFN) with the AutoScalingRollingUpdate update policy.

During the testing, we observed that for a service that horizontally scales, the MinInstanceInService property should be set relative to the current state of the ASG. Consequently when an autoscaling group has scaling policies, GuCDK adds a parameter to the CFN template in the form MinInstancesInServiceFor<APP>:

To support this dynamic property, services that horizontally scale should expose a CloudFormation parameter for Riff-Raff to set during deployment.

– https://github.com/guardian/testing-asg-rolling-update/blob/main/README.md

What does this change?

This change adds a new riff-raff.yaml property to the ami-cloudformation-parameter and cloud-formation deployment types called minInstancesInServiceParameters.

The CFN parameter added by GuCDK does not provide a direct connection the to ASG it relates to. The new minInstancesInServiceParameters property attempts to provide this connection via tag lookups. That is, it is a map of CFN parameter -> ASG tags.

In the below example, we're telling Riff-Raff to expect a CFN parameter named MinInstancesInServiceForcdkplayground in the CdkPlayground.template.json CFN template and to set it's value from the desired capacity of an ASG tagged App=cdk-playground, Stage=PROD, Stack=playground.

allowedStages:
  - PROD
stacks:
  - playground
regions:
  - eu-west-1
deployments:
  cfn-deployment:
    type: cloud-formation
    parameters:
      templateStagePaths:
        PROD: CdkPlayground.template.json
      minInstancesInServiceParameters:
        MinInstancesInServiceForcdkplayground:
          App: cdk-playground

Once the ASG has been found, the MinInstancesInService property to whichever is lower between desired or 75% of max, meaning:

When running normally, then we double capacity and half it again once healthy
When partially or fully scaled, we rotate instances in batches, which is slightly slower

This contrasts with the recommendation from https://github.com/guardian/testing-asg-rolling-update:

If the service is running at "normal" capacity (e.g. desired = min), then MinInstancesInService can be set to match min.

If the service is partially scaled, then MinInstancesInService should be set to match the current desired capacity.

If the service is fully scaled, then MinInstancesInService should be set to (at least) max -1.

– https://github.com/guardian/testing-asg-rolling-update/blob/main/README.md

This is because being percentage based offers faster deployment.

How to test

See added unit tests.

We can also perform some practical tests. Each of the following have been performed on https://github.com/guardian/cdk-playground.

"Normal" capacity (i.e. `desired = min`)

guardian/cdk-playground#548 tests this scenario. It demonstrates the CFN parameter was correctly set to desired.

Deployment log showing value of CFN parameter

Partial scaled (i.e. `min < desired < max`)

guardian/cdk-playground#548 tests this scenario. It demonstrates the CFN parameter was correctly set to desired.

Deployment log showing value of CFN parameter

Fully scaled (i.e. `desired = max`)

guardian/cdk-playground#548 tests this scenario. It demonstrates the CFN parameter was set to 7 (75% of 10, rounded down).

Deployment log showing value of CFN parameter

No scaling policy (i.e. our typical ASG)

This test amounts to omitting the minInstancesInServiceParameters property in the riff-raff.yaml file.

This has been done (guardian/cdk-playground#547), and the deployment behaved as expected: CFN doubled the ASG capacity, waited for SUCCESS signals, then removed old instances by halving the ASG capacity.

This can be seen on the dashboard (note: this link might not work in future).

`minInstancesInServiceParameters` describes a CFN parameter that does not exist

guardian/cdk-playground#549 tests this scenario.

The deployment should fail when referencing a CFN parameter that does not exist in the CFN template.

Deployment log showing deployment error

`minInstancesInServiceParameters` describes an ASG that does not exist

guardian/cdk-playground#550 tests this scenario.

The deployment should fail when referencing an ASG that does not exist.

Deployment log showing deployment error

Initial deployment (i.e. CFN stack does not yet exist)

This code path is the same as the "minInstancesInServiceParameters describes an ASG that does not exist" test.

How can we measure success?

This change means we're able to use the AutoScalingRollingUpdate deployment mechanism on all ASG architectures: those that horizontally scale and those that do not horizontally scale.

github-actions · 2024-10-11T07:58:20Z

Deploy build 3518 of `tools::riffraff` to CODE

All deployment options

From guardian/actions-riff-raff.

This is to help in the testing of guardian/riff-raff#1383.

As a test of guardian/riff-raff#1383.

This is to help in the testing of guardian/riff-raff#1383.

As a test of guardian/riff-raff#1383.

This to allow testing of guardian/riff-raff#1383 whilst the `riff-raff.yaml` generator from GuCDK hasn't been updated.

This is to allow testing of guardian/riff-raff#1383 whilst the `riff-raff.yaml` generator from GuCDK hasn't been updated.

magenta-lib/src/main/scala/magenta/tasks/AWS.scala

magenta-lib/src/main/scala/magenta/deployment_type/CloudFormationDeploymentTypeParameters.scala

magenta-lib/src/main/scala/magenta/tasks/ChangeSetTasks.scala

magenta-lib/src/test/scala/magenta/deployment_type/CloudFormationTest.scala

magenta-lib/src/main/scala/magenta/tasks/ChangeSetTasks.scala

magenta-lib/src/main/scala/magenta/tasks/AWS.scala

adamnfish

Lots of comments, sorry! Some tiny Scala style things, but a few other observations in there as well.

We have nice tests for the fiddly min / 75% logic, which is great. The other tricky bit is the way tags get added from multiple places now, it'd be great if we could make sure this is covered?

magenta-lib/src/main/scala/magenta/deployment_type/CloudFormationDeploymentTypeParameters.scala

magenta-lib/src/main/scala/magenta/tasks/AWS.scala

magenta-lib/src/main/scala/magenta/tasks/UpdateCloudFormationTask.scala

adamnfish · 2024-10-25T12:08:53Z

magenta-lib/src/test/scala/magenta/deployment_type/CloudFormationTest.scala

@@ -602,7 +604,8 @@ class CloudFormationTest
      Map.empty,
      Map.empty,
      (_: CfnParam) =>
-        (_: String) => (_: String) => (_: Map[String, String]) => None
+        (_: String) => (_: String) => (_: Map[String, String]) => None,


I realise it isn't from this PR, but this is a bit strange!

The type on the case class is the following:

latestImage: CfnParam => String => String => Map[String, String] => Option[String]

This says latestImage is a function that takes CfnParam, and returns a function that takes a String which in turn returns a function that takes a String, which in turn returns a function that takes a Map and returns Option[String].

This is called currying, and would be normal in an ML-style functional language but is a bit out of place in Scala. Simpler to use the following, which would make these tests much clearer.

latestImage: (CfnParam, String, String, Map[String, String]) => Option[String]

Now latest image is a function that takes 4 parameters and returns Option[String].

Up to you if you think now is a good time to change this for the extra test clarity, do it later, or leave it as-is 🤷 .

(This applies in a few place, whatever you choose)

If we leave it as-is, the type annotations aren't required here. Since we're discarding all these arguments anyway I don't think there's any benefit in making the types explicit, we could just write the following:

_ => _ => _ => _ => None

I propose we make these changes in a separate PR, for the benefit of a single purpose PR.

Yep and as I said no dramas if you want to leave it anyway.

👍🏽

I'll leave this comment "unresolved" to make it easier to locate/reference in any future PR(s).

magenta-lib/src/test/scala/magenta/tasks/ASGTest.scala

adamnfish · 2024-10-25T12:55:50Z

magenta-lib/src/test/scala/magenta/tasks/ASGTest.scala

@@ -398,6 +398,150 @@ class ASGTest extends AnyFlatSpec with Matchers with MockitoSugar {
    ) shouldBe Right(())
  }

+  it should "calculate MinInstancesInService as desired when there is capacity to double" in {


These feel like great candidates for property-based testing! This would let us assert that the described properties hold tue with a range of ASG configurations, rather than just providing a specific example of each. As well as giving us wider coverage, this would help us to cover off the fiddly edge cases without having to manually add lots of extra cases.

With the addition of the following in the build dpendency...

"org.scalatestplus" %% "scalacheck-1-18" % "3.2.19.0" % Test

... we'd have scalacheck available (we're actually pulling in Scalatest's scalachek integration, which in turn pulls in scalacheck). We can then mix the ScalaCheckDrivenPropertyChecks trait into our test to make this functionality available:

import org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks class ASGTest extends AnyFlatSpec with Matchers with MockitoSugar with ScalaCheckDrivenPropertyChecks

We can then use forAll in our tests to have the library generate random test data for us. Here's a simple hello world example to demonstrate the principle in general:

"the length of the concatenation of two strings should be the sum of the strings' lengths" in { forAll { (str1: String, str2: String) => val expected = str1.length + str2.length (str1 ++ str2).length shouldBe expected } }

This will provide random strings to our test and check the property holds true for all of them, automatically. If it finds a problem it will "shrink" the example to try and give a nice clean minimal example the demonstrates the problem. Try this out yourself! Break the implementation and see what happens!

However, in our case we don't want purely random data (Ints in our case). We want ASG configurations that make sense for the test case in question. We're looking at the "when there is capacity to double" case here, so let's take that one. Our test data needs to have a max that is at least double the desired.

Instead of asking for completely random test data, let's create our own generator and tell forAll to use that. In general terms:

val generateMyType = ??? forAll(generateMyType) { myType => // your test }

Our generator wants to make two ints. A desired, and a max. The latter depends on the value of desired, which is neatly expressed using flatMap (aka a for comprehension). Gen.choose is provided by the library to address the common use case of needing an number from a range.

import org.scalacheck.Gen val generateAsgSettings = for { desired <- Gen.choose(1, 100) max <- Gen.choose(desired * 2, desired * 4) // we can think about what values we want here! } yield (desired, max)

Now we can write the test as before, but instead of hard coding the max and desired to 6 and 3 respectively, we can use the generated values.

import org.scalacheck.Gen it should "calculate MinInstancesInService as desired when there is capacity to double" in { val genASGAllowingDouble = for { desired <- Gen.choose(1, 100) max <- Gen.choose(desired * 2, desired * 4) // we can think about what values we want here! } yield (desired, max) forAll(genASGAllowingDouble) { case (desired, max) => // ... when(asgDescribeIterableMock.autoScalingGroups()) thenReturn toSdkIterable( List( AutoScalingGroupWithTags( min = 3, max = max, desired = desired, "Stack" -> "playground", "Stage" -> "PROD", "App" -> "api", "aws:cloudformation:stack-name" -> "playground-PROD-api" ) ).asJava ) } }

Of course, now we start asking ourself other questions, like what should the "min" be for these tests (maybe the same as desired, maybe not)? This is one benefit of property based testing, it helps you to think about the data you're testing!

Sorry for the long comment, be happy to talk about this more with you elsewhere 😄

I'm unsure if the, albeit slight, indirection reduces accessibility/maintainability. Additionally, given this is a new deployment mechanism, might having explicit capacity values in the tests makes it easier to explain teams as they migrate?

Maybe we can implement these changes in a separate PR?

Yep just food for thought, we can do it later or not at all.

👍🏽

I'll leave this comment "unresolved" to make it easier to locate/reference in any future PR(s).

magenta-lib/src/main/scala/magenta/deployment_type/CloudFormationDeploymentTypeParameters.scala

akash1810 · 2024-10-28T13:13:59Z

The other tricky bit is the way tags get added from multiple places now, it'd be great if we could make sure this is covered?

I can't think of the best way to do this:

getMinInstancesInService take explicit tags, rather than a general list of tags, but that's quite a complex interface
Another option could be to cross reference the discovered ASG with a CFN list resources call and cross reference the two. However, this felt overly complex.
Tags prefixed aws: can only be created by AWS. Maybe getMinInstancesInService could assert that the list of tag requirements has at least one aws:cloudformation:* tag? This doesn't directly address your concern though.

Tricky!

Produce a single log line, not three.

adamnfish

We've chatted over all the feedback, and I think it looks fine! The main thing is to do a bunch of testing so let's get it merged and checked 👍

akash1810 force-pushed the aa/minInstancesInService branch from 6259545 to 0b49008 Compare October 11, 2024 07:54

akash1810 mentioned this pull request Oct 11, 2024

Testing https://github.com/guardian/riff-raff/pull/1383 guardian/cdk-playground#545

Closed

akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024

feat: Switch to manual riff-raff.yaml file

ed2a756

This is to help in the testing of guardian/riff-raff#1383.

akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024

test: Add minInstancesInServiceParameters to riff-raff.yaml

6d0d8d1

As a test of guardian/riff-raff#1383.

akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024

feat: Switch to manual riff-raff.yaml file

db89000

This is to help in the testing of guardian/riff-raff#1383.

akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024

test: Add minInstancesInServiceParameters to riff-raff.yaml

ef4cef5

As a test of guardian/riff-raff#1383.

akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024

Switch to manual riff-raff.yaml file

8223c44

This to allow testing of guardian/riff-raff#1383 whilst the `riff-raff.yaml` generator from GuCDK hasn't been updated.

akash1810 added a commit to guardian/cdk-playground that referenced this pull request Oct 11, 2024

Switch to manual riff-raff.yaml file

dbbe5f6

This is to allow testing of guardian/riff-raff#1383 whilst the `riff-raff.yaml` generator from GuCDK hasn't been updated.

akash1810 marked this pull request as ready for review October 11, 2024 17:04

akash1810 requested a review from a team as a code owner October 11, 2024 17:04

akash1810 force-pushed the aa/minInstancesInService branch 5 times, most recently from 1f7d109 to 0f62f2a Compare October 17, 2024 18:23

akash1810 force-pushed the aa/minInstancesInService branch from 0f62f2a to 6ff9612 Compare October 21, 2024 12:35

akash1810 mentioned this pull request Oct 21, 2024

feat(riff-raff.yaml): Add minInstancesInServiceParameters when applicable guardian/cdk#2476

Merged

2 tasks

akash1810 commented Oct 23, 2024

View reviewed changes

magenta-lib/src/main/scala/magenta/tasks/AWS.scala Outdated Show resolved Hide resolved

akash1810 force-pushed the aa/minInstancesInService branch from 6ff9612 to 91abb17 Compare October 23, 2024 14:15

akash1810 commented Oct 23, 2024

View reviewed changes

magenta-lib/src/main/scala/magenta/tasks/AWS.scala Outdated Show resolved Hide resolved

adamnfish reviewed Oct 25, 2024

View reviewed changes

akash1810 force-pushed the aa/minInstancesInService branch from 8909994 to a6bbbbf Compare October 28, 2024 09:08

akash1810 force-pushed the aa/minInstancesInService branch from a6bbbbf to b1c4ed3 Compare November 7, 2024 18:40

akash1810 added 5 commits November 8, 2024 11:16

feat: Set MinInstancesInService via CFN parameters

f9fedd6

refactor: More intention revealing log messages

8650fa0

lint: Post-review style changes

eddbbbd

refactor: Post-review copy improvements

504152a

refactor: Simplify logging

05c834f

Produce a single log line, not three.

akash1810 force-pushed the aa/minInstancesInService branch from b1c4ed3 to 05c834f Compare November 8, 2024 11:16

adamnfish approved these changes Nov 12, 2024

View reviewed changes

akash1810 merged commit b0db62d into main Nov 13, 2024
3 checks passed

akash1810 deleted the aa/minInstancesInService branch November 13, 2024 07:58

guardian-ci mentioned this pull request Nov 13, 2024

🦋 Release package updates guardian/cdk#2503

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Set `MinInstancesInService` via CFN parameters #1383

feat: Set `MinInstancesInService` via CFN parameters #1383

akash1810 commented Oct 11, 2024 •

edited

Loading

github-actions bot commented Oct 11, 2024 •

edited

Loading

adamnfish left a comment

adamnfish Oct 25, 2024

adamnfish Oct 25, 2024

akash1810 Oct 26, 2024

adamnfish Oct 28, 2024

akash1810 Oct 28, 2024

adamnfish Oct 25, 2024

akash1810 Oct 28, 2024 •

edited

Loading

adamnfish Oct 28, 2024

akash1810 Oct 28, 2024

akash1810 commented Oct 28, 2024

adamnfish left a comment

feat: Set MinInstancesInService via CFN parameters #1383

feat: Set MinInstancesInService via CFN parameters #1383

Conversation

akash1810 commented Oct 11, 2024 • edited Loading

Context

What does this change?

How to test

"Normal" capacity (i.e. desired = min)

Partial scaled (i.e. min < desired < max)

Fully scaled (i.e. desired = max)

No scaling policy (i.e. our typical ASG)

minInstancesInServiceParameters describes a CFN parameter that does not exist

minInstancesInServiceParameters describes an ASG that does not exist

Initial deployment (i.e. CFN stack does not yet exist)

How can we measure success?

github-actions bot commented Oct 11, 2024 • edited Loading

adamnfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akash1810 Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akash1810 commented Oct 28, 2024

adamnfish left a comment

Choose a reason for hiding this comment

feat: Set `MinInstancesInService` via CFN parameters #1383

feat: Set `MinInstancesInService` via CFN parameters #1383

akash1810 commented Oct 11, 2024 •

edited

Loading

"Normal" capacity (i.e. `desired = min`)

Partial scaled (i.e. `min < desired < max`)

Fully scaled (i.e. `desired = max`)

`minInstancesInServiceParameters` describes a CFN parameter that does not exist

`minInstancesInServiceParameters` describes an ASG that does not exist

github-actions bot commented Oct 11, 2024 •

edited

Loading

akash1810 Oct 28, 2024 •

edited

Loading