Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Musings on AutoScaling #856

Closed
rix0rrr opened this issue Oct 5, 2018 · 12 comments · Fixed by #1021 or #933
Closed

Musings on AutoScaling #856

rix0rrr opened this issue Oct 5, 2018 · 12 comments · Fixed by #1021 or #933

Comments

@rix0rrr
Copy link
Contributor

rix0rrr commented Oct 5, 2018

Note: this issue is not really an issue per se. It's a place for me to write down my thoughts and solicit feedback in place where people can easily access the document and comment.

I imagine when people want to set up autoscaling, they want to set up something like this:

 Scaling        -3       -1                                     +1       +3        
 activity   │        │        │                            │         │         │   
            ├────────┼────────┼────────────────────────────┼─────────┼─────────┤   
            │        │        │                            │         │         │   
                                                                                   
CPU usage   0%      10%      30%                          70%        90%      100% 

Disregarding TargetTracking scaling for a moment, the way to set this up is with StepScaling. You'll make a StepScaling policy which is activated by a CloudWatch alarm. The StepScaling policy has different scaling tiers depending on the distance of the current metric value to its Alarm threshold. Configuration for a step scaling policy looks like this:

    Scaling           -2             -1               +1              +2           
    activity                  │               │               │                    
               ◀──────────────┼───────────────┼───────────────┼───────────────────▶
                              │               │               │                    
 Distance from                                                                     
alarm threshold              -10              0              +10                   

Normally, CloudWatch Alarm Actions are edge-triggered (that is, an Action occurs only when the alarm transitions from OK to ALARM or vice-versa). However, if the Alarm Action is an AutoScaling policy the Alarm keeps on triggering the AutoScaling policy periodically, so that if the alarm goes further out of spec, higher scaling tiers can be activated. (For example, the CPU usage goes to 75% and an instance is added. However, that doesn't make the load go down enough yet, and after an (undefined) while the policy is activated again and another instance is added).

The question is, how to set up step scaling policies to achieve the scaling that the user wants?

Solution 1: Two alarms, two policies

This is what I see most on the internet--a separate alarm and a separate scaling policy for scaling out and scaling in. Seems inefficient, resource-wise.

Also, AutoScaling policies seem to imply they can both scale in and scale out in a single policy, but you'd never take advantage of that in this way.

               -3       -1                                     +1       +3        
           │        │        │                            │         │         │   
  Metric   ├────────┼────────┼────────────────────────────┼─────────┼─────────┤   
           │        │        │                            │         │         │   
                                                                                  
           0%      10%      30%                          70%        90%      100% 
                                                                                  
                                                          ║                       
   Alarm1                                     > 70%       ║                       
                                                          ║                       
                                                               +1        +3       
                                                          │         │             
  Policy1                                                 ├─────────┼─────────▶   
                                                          │         │             
                                                                                  
                                                          0        +20            
                                                                                  
                             ║                                                    
   Alarm2           < 30%    ║                                                    
                             ║                                                    
               -3       -1                                                        
                    │        │                                                    
  Policy2 ◀─────────┼────────┤                                                    
                    │        │                                                    
                                                                                  
                  -60       -40                                                   

Solution 2: One alarm, one policy

Ideally, it seems like I would want just a single alarm/metric/scalingpolicy configuration. But I don't know if this would even work, it would require that the scaling policy is activated on both sides of the CloudWatch alarm, and it might not do that.

               -3       -1                                     +1       +3        
           │        │        │                            │         │         │   
  Metric   ├────────┼────────┼────────────────────────────┼─────────┼─────────┤   
           │        │        │                            │         │         │   
                                                                                  
           0%      10%      30%                          70%        90%      100% 
                                                                                  
                                                          ║                       
   Alarm                                  Alarm at 70%    ║                       
                                                          ║                       
                                                                                  
               -3       -1                0                    +1        +3       
                    │        │                            │         │             
  Policy  ◀─────────┼────────┼────────────────────────────┼─────────┼─────────▶   
                    │        │                            │         │             
                                                                                  
                  -60       -40                           0        +20            

Solution 3: Two alarms, one policy

Another potential way to go would be to have two alarms trigger the same scaling policy (one lower-than-threshold and one greater-than-threshold) and just describe the scaling behavior with respect to the alarm thresholds on either side.

We can save on a ScalingPolicy in this way (with respect to solution 1). It's not as obvious what's going on though, and would rely on the fact that 2 different alarms give the deltas to 2 different thresholds to the same scaling policy ONLY when they're in alarm.

               -3       -1                                     +1       +3        
           │        │        │                            │         │         │   
  Metric   ├────────┼────────┼────────────────────────────┼─────────┼─────────┤   
           │        │        │                            │         │         │   
                                                                                  
           0%      10%      30%                          70%        90%      100% 
                                                                                  
                                                          ║                       
   Alarm1                                     > 70%       ║                       
                                                          ║                       
                             ║                                                    
   Alarm2           < 30%    ║                            ─                       
                             ║                         ┌ ┘                        
                             ─                        ─                           
                              └ ┐                  ┌ ┘                            
                                 ─                ─                               
                                  └ ┐          ┌ ┘                                
                                     ─        ─                                   
                                      └ ┐  ┌ ┘                                    
                                         ─┌                                       
                         -3        -1     ▼     +1           +3                   
                              │           │           │                           
  Policy  ◀───────────────────┼───────────┼───────────┼───────────────────────▶   
                              │           │           │                           
                                                                                  
                            -20           0          +20                          

Questions

  • Can an AutoScaling Policy be in the OKActions of a CloudWatch Alarm? Does that work? => yes
  • If it does work, will it also continue evaluating and triggering the Scaling Policy periodically if it's on the OKActions, just like it does in AlarmActions? => yes
  • Can a single AutoScaling policy even be the target of two CloudWatch alarms? If so, will it respect both threshold deltas as I'm expecting it to in scenario 3?
  • Is there a difference between Application AutoScaling and EC2 Instance AutoScaling?
  • Can I have multiple StepScalingPolicies on the same target that both scale the target at the same time? Are they going to fight? What if one has a ChangeInCapacity = 0 ?
@rix0rrr
Copy link
Contributor Author

rix0rrr commented Oct 5, 2018

By the way, this is all predicated on the assumption that users would rather think in terms of the very first thing I showed: absolute metric values and the scaling behavior based on that; all the drilldown that happens to scaling policies to me feels just like implementation details and calculations that can and should be hidden.

I don't have data for that, it's just a gut feeling. Am I sorely wrong on that?

@allisaurus
Copy link
Contributor

Can I ask why you're not considering TargetTracking here? From my (admittedly limited) understanding, this would be a way to scale in and out based on one policy/metric. Are there specifics re: how such a setup would behave that you're concerned about?

@rix0rrr
Copy link
Contributor Author

rix0rrr commented Oct 5, 2018

I love the idea of target tracking, and I will definitely implement that as well.

But do you think I should not implement step scaling at all? Just leave it out of the API and force all customers to use target tracking if they want to autoscale?

@rix0rrr
Copy link
Contributor Author

rix0rrr commented Oct 5, 2018

@allisaurus, by the way, do custom metrics work for target tracking? Because the CloudFormation docs seem to imply they don't.

@allisaurus
Copy link
Contributor

It looks like custom metrics related to EC2 instance utilization are permissible with target tracking. Looks like there may be a gap in what's currently supported in the service vs. via CFN.

To your second point: accommodating use cases that require significantly different scale in/out behavior, or those that work off metrics incompatible w/ target tracking, would be a good argument for implementing step scaling as well (though I'm ill equipped to speak to how prevalent they are).

@rix0rrr
Copy link
Contributor Author

rix0rrr commented Oct 6, 2018

I don't think we can afford to only implement only part of the feature set. So that means we have to build an API for step scaling anyway, and I'd like it to be as good as possible.

@jungseoklee
Copy link
Contributor

@allisaurus, by the way, do custom metrics work for target tracking? Because the CloudFormation docs seem to imply they don't.

The link is about Application Auto Scaling, not EC2 Auto Scaling.

@rix0rrr
Copy link
Contributor Author

rix0rrr commented Oct 8, 2018

@jungseoklee where did you get the impression this topic was about EC2 autoscaling? I don't think I've mentioned it anywhere, and in fact I did come at this by just looking at App AutoScaling so far (although I do believe the instance autoscaling API is very similar).

@rix0rrr
Copy link
Contributor Author

rix0rrr commented Oct 8, 2018

Here's another question, feel free to provide input:

How are we going to model the API to represent thresholds and scaling actions?

Let's say for a fictitious CPU usage/scaling example:

Option 1: fluent API

scaling
    .at(0).scale(-2)
    .at(10).scale(-1)
    .at(20).scale(0)
    .at(80).scale(+1)
    .at(90).scale(+2)
.end()

Option 2: allow omitting bounds of bordering intervals

scaling.addTier({ upperBound: 10, adjustment: -2 });
scaling.addTier({ upperBound: 20, adjustment: -1 });
scaling.addTier({ lowerBound: 80, upperBound: 90, adjustment: +1 });
scaling.addTier({ adjustment: +2 });

or:

scaling.addTier({ upperBound: 10, adjustment: -2 });
scaling.addTier({ upperBound: 20, adjustment: -1 });
scaling.addTier({ lowerBound: 80, adjustment: +1 });
scaling.addTier({ lowerBound: 90, adjustment: +2 });

Option 3: mixing thresholds and scales in a single array:

scale([ 0, -2, 10, -1, 20, 0, 80, +1, 90, +2, 100 ])

Option 4: separate thresholds and scales

.thresholds([0, 10, 20, 80, 90, 100])
.scales([-2, -1, 0, 1, 2])

@jungseoklee
Copy link
Contributor

@jungseoklee where did you get the impression this topic was about EC2 autoscaling? I don't think I've mentioned it anywhere, and in fact I did come at this by just looking at App AutoScaling so far (although I do believe the instance autoscaling API is very similar).

True. You mentioned two terms, "CPU usage" and "instance is added", only. Nevertheless, I unconsciously combined the terms with 1) the link, custom metrics related to EC2 instance utilization, in the comment stream and 2) my understanding that step scaling policy is not applicable to DynamoDB [1] which is about Application Auto Scaling, so I got the impression.

I would like to understand this topic correctly. There are no other intentions.

[1] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-applicationautoscaling-scalingpolicy.html#cfn-applicationautoscaling-scalingpolicy-policytype

@jungseoklee
Copy link
Contributor

Here's another question, feel free to provide input:

How are we going to model the API to represent thresholds and scaling actions?

I would vote for option 1 which is a great abstraction and modeling to me.

Regarding option 2, users need to understand what upperBound, lowerBound, and adjustment are, and upperBound and lowerBound seems like optional members of a property used for addTier method.

Regarding option 3, it seems error-prone compared to other options. For example, what if I switch 0 scale with 20 threshold like scale([ 0, -2, 10, -1, 0, 20, 80, +1, 90, +2, 100 ])?

In case of option 4, this option looks better than options 2 and option 3, but we probably need to check if array size of scale == (array size of threshold - 1).

@allisaurus
Copy link
Contributor

@rix0rrr 👍 agreed re: not implementing just part of the feature set (and wanting to make step scaling the best it can be). I only meant that since use cases exist which won't be well accommodated by target tracking, I think we should implement step scaling too vs. just target tracking. Thanks, btw, for the creating this issue for discussion!

rix0rrr added a commit that referenced this issue Oct 25, 2018
Adds a construct library for Application AutoScaling.

The DynamoDB construct library has been updated to use the 
new AutoScaling mechanism, which allows more configuration
and uses a Service Linked Role instead of a role per table.

BREAKING CHANGE: instead of `addReadAutoScaling()`, call
`autoScaleReadCapacity()`, and similar for write scaling.

Fixes #856, #861, #640, #644.
rix0rrr pushed a commit that referenced this issue Oct 26, 2018
__IMPORTANT NOTE__: when upgrading to this version of the CDK framework, you must also upgrade
your installation the CDK Toolkit to the matching version:

```shell
$ npm i -g aws-cdk
$ cdk --version
0.14.0 (build ...)
```

Bug Fixes
=========

* remove CloudFormation property renames ([#973](#973)) ([3f86603](3f86603)), closes [#852](#852)
* **aws-ec2:** fix retention of all egress traffic rule ([#998](#998)) ([b9d5b43](b9d5b43)), closes [#987](#987)
* **aws-s3-deployment:** avoid deletion during update using physical ids ([#1006](#1006)) ([bca99c6](bca99c6)), closes [#981](#981) [#981](#981)
* **cloudformation-diff:** ignore changes to DependsOn ([#1005](#1005)) ([3605f9c](3605f9c)), closes [#274](#274)
* **cloudformation-diff:** track replacements ([#1003](#1003)) ([a83ac5f](a83ac5f)), closes [#1001](#1001)
* **docs:** fix EC2 readme for "natgatway" configuration ([#994](#994)) ([0b1e7cc](0b1e7cc))
* **docs:** updates to contribution guide ([#997](#997)) ([b42e742](b42e742))
* **iam:** Merge multiple principals correctly ([#983](#983)) ([3fc5c8c](3fc5c8c)), closes [#924](#924) [#916](#916) [#958](#958)

Features
=========

* add construct library for Application AutoScaling ([#933](#933)) ([7861c6f](7861c6f)), closes [#856](#856) [#861](#861) [#640](#640) [#644](#644)
* add HostedZone context provider ([#823](#823)) ([1626c37](1626c37))
* **assert:** haveResource lists failing properties ([#1016](#1016)) ([7f6f3fd](7f6f3fd))
* **aws-cdk:** add CDK app version negotiation ([#988](#988)) ([db4e718](db4e718)), closes [#891](#891)
* **aws-codebuild:** Introduce a CodePipeline test Action. ([#873](#873)) ([770f9aa](770f9aa))
* **aws-sqs:** Add grantXxx() methods ([#1004](#1004)) ([8c90350](8c90350))
* **core:** Pre-concatenate Fn::Join ([#967](#967)) ([33c32a8](33c32a8)), closes [#916](#916) [#958](#958)

BREAKING CHANGES
=========

* DynamoDB AutoScaling: Instead of `addReadAutoScaling()`, call `autoScaleReadCapacity()`, and similar for write scaling.
* CloudFormation resource usage: If you use L1s, you may need to change some `XxxName` properties back into `Name`. These will match the CloudFormation property names.
* You must use the matching `aws-cdk` toolkit when upgrading to this version, or context providers will cease to work. All existing cached context values in `cdk.json` will be invalidated and refreshed.
@rix0rrr rix0rrr mentioned this issue Oct 26, 2018
rix0rrr added a commit that referenced this issue Oct 26, 2018
__IMPORTANT NOTE__: when upgrading to this version of the CDK framework, you must also upgrade
your installation the CDK Toolkit to the matching version:

```shell
$ npm i -g aws-cdk
$ cdk --version
0.14.0 (build ...)
```

Bug Fixes
=========

* remove CloudFormation property renames ([#973](#973)) ([3f86603](3f86603)), closes [#852](#852)
* **aws-ec2:** fix retention of all egress traffic rule ([#998](#998)) ([b9d5b43](b9d5b43)), closes [#987](#987)
* **aws-s3-deployment:** avoid deletion during update using physical ids ([#1006](#1006)) ([bca99c6](bca99c6)), closes [#981](#981) [#981](#981)
* **cloudformation-diff:** ignore changes to DependsOn ([#1005](#1005)) ([3605f9c](3605f9c)), closes [#274](#274)
* **cloudformation-diff:** track replacements ([#1003](#1003)) ([a83ac5f](a83ac5f)), closes [#1001](#1001)
* **docs:** fix EC2 readme for "natgatway" configuration ([#994](#994)) ([0b1e7cc](0b1e7cc))
* **docs:** updates to contribution guide ([#997](#997)) ([b42e742](b42e742))
* **iam:** Merge multiple principals correctly ([#983](#983)) ([3fc5c8c](3fc5c8c)), closes [#924](#924) [#916](#916) [#958](#958)

Features
=========

* add construct library for Application AutoScaling ([#933](#933)) ([7861c6f](7861c6f)), closes [#856](#856) [#861](#861) [#640](#640) [#644](#644)
* add HostedZone context provider ([#823](#823)) ([1626c37](1626c37))
* **assert:** haveResource lists failing properties ([#1016](#1016)) ([7f6f3fd](7f6f3fd))
* **aws-cdk:** add CDK app version negotiation ([#988](#988)) ([db4e718](db4e718)), closes [#891](#891)
* **aws-codebuild:** Introduce a CodePipeline test Action. ([#873](#873)) ([770f9aa](770f9aa))
* **aws-sqs:** Add grantXxx() methods ([#1004](#1004)) ([8c90350](8c90350))
* **core:** Pre-concatenate Fn::Join ([#967](#967)) ([33c32a8](33c32a8)), closes [#916](#916) [#958](#958)

BREAKING CHANGES
=========

* DynamoDB AutoScaling: Instead of `addReadAutoScaling()`, call `autoScaleReadCapacity()`, and similar for write scaling.
* CloudFormation resource usage: If you use L1s, you may need to change some `XxxName` properties back into `Name`. These will match the CloudFormation property names.
* You must use the matching `aws-cdk` toolkit when upgrading to this version, or context providers will cease to work. All existing cached context values in `cdk.json` will be invalidated and refreshed.
jonparker pushed a commit to jonparker/aws-cdk that referenced this issue Oct 29, 2018
__IMPORTANT NOTE__: when upgrading to this version of the CDK framework, you must also upgrade
your installation the CDK Toolkit to the matching version:

```shell
$ npm i -g aws-cdk
$ cdk --version
0.14.0 (build ...)
```

Bug Fixes
=========

* remove CloudFormation property renames ([aws#973](aws#973)) ([3f86603](aws@3f86603)), closes [aws#852](aws#852)
* **aws-ec2:** fix retention of all egress traffic rule ([aws#998](aws#998)) ([b9d5b43](aws@b9d5b43)), closes [aws#987](aws#987)
* **aws-s3-deployment:** avoid deletion during update using physical ids ([aws#1006](aws#1006)) ([bca99c6](aws@bca99c6)), closes [aws#981](aws#981) [aws#981](aws#981)
* **cloudformation-diff:** ignore changes to DependsOn ([aws#1005](aws#1005)) ([3605f9c](aws@3605f9c)), closes [aws#274](aws#274)
* **cloudformation-diff:** track replacements ([aws#1003](aws#1003)) ([a83ac5f](aws@a83ac5f)), closes [aws#1001](aws#1001)
* **docs:** fix EC2 readme for "natgatway" configuration ([aws#994](aws#994)) ([0b1e7cc](aws@0b1e7cc))
* **docs:** updates to contribution guide ([aws#997](aws#997)) ([b42e742](aws@b42e742))
* **iam:** Merge multiple principals correctly ([aws#983](aws#983)) ([3fc5c8c](aws@3fc5c8c)), closes [aws#924](aws#924) [aws#916](aws#916) [aws#958](aws#958)

Features
=========

* add construct library for Application AutoScaling ([aws#933](aws#933)) ([7861c6f](aws@7861c6f)), closes [aws#856](aws#856) [aws#861](aws#861) [aws#640](aws#640) [aws#644](aws#644)
* add HostedZone context provider ([aws#823](aws#823)) ([1626c37](aws@1626c37))
* **assert:** haveResource lists failing properties ([aws#1016](aws#1016)) ([7f6f3fd](aws@7f6f3fd))
* **aws-cdk:** add CDK app version negotiation ([aws#988](aws#988)) ([db4e718](aws@db4e718)), closes [aws#891](aws#891)
* **aws-codebuild:** Introduce a CodePipeline test Action. ([aws#873](aws#873)) ([770f9aa](aws@770f9aa))
* **aws-sqs:** Add grantXxx() methods ([aws#1004](aws#1004)) ([8c90350](aws@8c90350))
* **core:** Pre-concatenate Fn::Join ([aws#967](aws#967)) ([33c32a8](aws@33c32a8)), closes [aws#916](aws#916) [aws#958](aws#958)

BREAKING CHANGES
=========

* DynamoDB AutoScaling: Instead of `addReadAutoScaling()`, call `autoScaleReadCapacity()`, and similar for write scaling.
* CloudFormation resource usage: If you use L1s, you may need to change some `XxxName` properties back into `Name`. These will match the CloudFormation property names.
* You must use the matching `aws-cdk` toolkit when upgrading to this version, or context providers will cease to work. All existing cached context values in `cdk.json` will be invalidated and refreshed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants