-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HPA #305
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I would suggest using the ScalingActive
condition to determine deploy success. As the comment around its definition states:
ScalingActive indicates that the HPA controller is able to scale if necessary:
it's correctly configured, can fetch the desired metrics, and isn't disabled.
Though I don't particularly feel strongly one way or the other between this and AbleToScale
I found the difference between the intent of ScalingActive and AbleToScale a bit confusing too, so I looked up all the possible reasons for them (this file): AbleToScale
ScalingActive
Based on that, AbleToScale is talking about ability to scale right now in particular, whereas ScalingActive tells us whether the HPA is correctly configured. So the latter is what we want. We should probably ignore false values with that last reason though, or else deploys will fail when people using HPAs scale to zero manually. |
# frozen_string_literal: true | ||
module KubernetesDeploy | ||
class HorizontalPodAutoscaler < KubernetesResource | ||
PRUNABLE = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the idea of defining this at the class level, but this doesn't do anything right now. Your test passes because, oddly enough, HPA was already on the whitelist.
end | ||
|
||
def deploy_failed? | ||
!exists? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be checking the ScalingActive
condition too (but should the deploy should succeed when that's false because ScalingDisabled
).
end | ||
|
||
def timeout_message | ||
UNUSUAL_FAILURE_MESSAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this message (It is very unusual for this resource type to fail to deploy. Please try the deploy again. If that new deploy also fails, contact your cluster administrator.
) is fitting, since HPAs have legit failure mode (unlike say configmaps).
end | ||
|
||
def type | ||
'hpa.v2beta1.autoscaling' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this? Don't we want it called HorizontalPodAutoscaler
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, but this method determines how kubectl fetches the resource. And we don't get the status conditions until v2beta1. Do we want to add a new method?
test/fixtures/hpa/deployment.yml
Outdated
metadata: | ||
labels: | ||
name: web | ||
app: hello-cloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: these labels should be "hpa" and "hpa-deployment"
@@ -1006,4 +1006,21 @@ def test_raise_on_yaml_missing_kind | |||
" datapoint: value1" | |||
], in_order: true) | |||
end | |||
|
|||
def test_hpa_can_be_successful | |||
skip if KUBE_SERVER_VERSION < Gem::Version.new('1.8.0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your PR description says The downside of using v2beta is no support in 1.8.
; did you mean 1.7? If that's true, our unofficial deprecation policy (following GKE) would allow us to drop 1.7 at this point, and then the hpa could be folded into hello-cloud
, as could the cronjob tests. If that isn't the case (i.e. the conditions aren't present in a version we need to support), don't we need to have alternate success/failure conditions in the code itself rather than here, to avoid breaking deploys for people with hpas on that version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added in 1.8 and I've updated the description. What does dropping support for 1.7 mean past not running 1.7 CI?
test/helpers/kubeclient_helper.rb
Outdated
@@ -25,4 +25,8 @@ def apps_v1beta1_kubeclient | |||
def batch_v1beta1_kubeclient | |||
@batch_v1beta1_kubeclient ||= build_batch_v1beta1_kubeclient(MINIKUBE_CONTEXT) | |||
end | |||
|
|||
def autoscaling_v2beta1_kubeclient | |||
@autoscaling_v2beta1_kubeclient ||= build_autoscaling_v2beta1_kubeclient(MINIKUBE_CONTEXT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing these clients being used
|
||
def test_hpa_can_be_successful | ||
skip if KUBE_SERVER_VERSION < Gem::Version.new('1.8.0') | ||
assert_deploy_success(deploy_fixtures("hpa")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to include at least one relevant logging assertion with your tests. The logs are our UI, so we should always check (PRINT_LOGS=1
) what they look like and assert that we haven't broken them. Typically I'll assert on what gets printed about the resource in the summary
It looks like I used
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like I used AbleToScale vs ScalingActive because scalingActive is false in minikube
Doesn't that mean that HPAs don't work in minikube? I don't think that's a good reason not to use the correct condition, even if testing will be a problem. Did you dig into the cause of that minikube problem? This issue seems to suggest the problem could be that the resource in question doesn't have req/limits set properly. Could that be the case with the fixture you're using?
We do, but this method determines how kubectl fetches the resource. And we don't get the status conditions until v2beta1. Do we want to add a new method?
I guess so? Perhaps we should consider pinning our access GVs in general. 🤔 Bottom line for this PR though is I don't think we should print an unfriendly gvk string as part of the name in our output.
It was added in 1.8 and I've updated the description. What does dropping support for 1.7 mean past not running 1.7 CI?
That's pretty much all it means (well, that and documenting it in our readme, and removing any test workarounds we have in place to support 1.7).
# frozen_string_literal: true | ||
module KubernetesDeploy | ||
class HorizontalPodAutoscaler < KubernetesResource | ||
TIMEOUT = 30.seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this a little longer? Seemingly some metrics check is happening, and I have no idea how long that can take.
Turns out the issue was kubernetes/kubernetes#57673 e.g. that we need to deploy the metric server from https://github.com/kubernetes-incubator/metrics-server. This is ready for 👀 again |
|
||
private | ||
|
||
def deploy_metric_server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a helper file we've been putting methods like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of, yes. This is being deployed globally, so I think it should happen during the initial test suite setup, which is taken care of inside test_helper
itself. I'd add it to this module:
And then call it with the PV setup, i.e. when the file is loaded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you have to tweak the metrics server configs at all to get them working, or are they a direct copy-paste from https://github.com/kubernetes-incubator/metrics-server?
@@ -146,6 +146,10 @@ def type | |||
@type || self.class.kind | |||
end | |||
|
|||
def fetch_type | |||
type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: the policial check, we can disable Shopify/StrongerParameters
for this repo since it doesn't use that gem / have controllers
I found this name confusing though. What we have here is a "group version resource" string that we're underspecifying in all cases except HPA. This string is really kubectl-specific (e.g. if we were using kubeclient, we'd need GV paths), so maybe a name that reflects that would be appropriate, e.g. kubectl_resource_type
.
def deploy_failed? | ||
return false unless exists? | ||
recoverable = RECOVERABLE_CONDITIONS.any? { |c| scaling_active_condition.fetch("reason", "").start_with?(c) } | ||
able_to_scale_condition["status"] == "False" || (scaling_active_condition["status"] == "False" && !recoverable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the "able to scale" condition is relevant at all (same in status
). It's talking about whether or not the target in question can be scaled at this particular moment (based on things like whether or not it was just scaled a moment ago), not on the general validity of the autoscaling configuration. I see you have a test that implies that "able to scale - false; FailedGet" means the target resource may not exist, but couldn't failing on that cause a race condition, since the HPA doesn't get deployed after deployments? Couldn't it also mean the request for that resource (transiently) failed, i.e. should be excluded on the same grounds as other FailedGetXs? Is "Scaling active" actually true when the target doesn't exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think FailedGetScale
is relevant since when we are failing with that reason there isn't an 'AbleToScale' condition to look at. In theory its recoverable, but do we want the deploy to fail or timeout ?
@@ -114,7 +114,7 @@ def file_path | |||
end | |||
|
|||
def sync(mediator) | |||
@instance_data = mediator.get_instance(type, name) | |||
@instance_data = mediator.get_instance(fetch_type, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sync mediator's caching should also use the new method, and we should make sure we add a test that would catch that.
if !exists? | ||
super | ||
elsif deploy_succeeded? | ||
"Succeeded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Configured"? "Succeeded" sounds like we're claiming it scaled something.
test/fixtures/hpa/hpa/deployment.yml
Outdated
@@ -0,0 +1,24 @@ | |||
apiVersion: apps/v1beta1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this fixture set two layers deep? i.e. why not test/fixtures/hpa/thing.yml
?
Edit: I see it's because the top-level dir contains another one with the metrics server config, which I commented on elsewhere
test/fixtures/hpa/hpa/deployment.yml
Outdated
metadata: | ||
name: hpa-deployment | ||
annotations: | ||
shipit.shopify.io/restart: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this annotation
|
||
private | ||
|
||
def deploy_metric_server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of, yes. This is being deployed globally, so I think it should happen during the initial test suite setup, which is taken care of inside test_helper
itself. I'd add it to this module:
And then call it with the PV setup, i.e. when the file is loaded
# Set-up the metric server that the HPA needs https://github.com/kubernetes-incubator/metrics-server | ||
ns = @namespace | ||
@namespace = "kube-system" | ||
assert_deploy_success(deploy_fixtures("hpa/kube-system", allow_protected_ns: true, prune: false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hpa/kube-system
isn't really a fixture set; it's more like test infra components. I'd move all those configs somewhere else, like test/setup/metrics-server
or something. Maybe we should even use KubeClient or kubectl apply
to create it... seems a little weird to have the test setup itself depend on the correct functioning of the core code.
# frozen_string_literal: true | ||
module KubernetesDeploy | ||
class HorizontalPodAutoscaler < KubernetesResource | ||
TIMEOUT = 5.minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this arbitrary or based on testing? 30s seemed short to me, but this seems really long 😄
Direct copy-paste |
# frozen_string_literal: true | ||
module KubernetesDeploy | ||
class HorizontalPodAutoscaler < KubernetesResource | ||
TIMEOUT = 3.minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One test deploy took 120.5s , this should give us enough buffer
skip if KUBE_SERVER_VERSION < Gem::Version.new('1.8.0') | ||
assert_deploy_failure(deploy_fixtures("hpa", subset: ["hpa.yml"]), :timed_out) | ||
assert_logs_match_all([ | ||
"Deploying HorizontalPodAutoscaler/hello-hpa (timeout: 180s)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is guaranteed to take at least 180s 😢
There's some value in asserting that the message displayed on timeout is helpful, but I'm not sure it's worth 3 minutes of CI. Timeouts are fallback behaviour, not something we're generally striving for. I'd be inclined to replace this with a test for a case that actually fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time figuring out how to make the HPA pass validation, have AbleToScale be true, and have ScalingActive false with out the cause being in RECOVERABLE_CONDITIONS
.
The k8s tests (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/horizontal_test.go) weren't very instructional, any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find anything useful either. Kinda weird that so many conditions have been dedicated to situations that are not possible to create from the outside. My final observation is that in playing around with this locally, there seems to be a substantial delay before the conditions get populated, so maybe it would be safe to fail on ScalingActive/FailedGetThingMetric
after all, i.e. the benefit of failing fast for those misconfigurations would outweigh the (small) possibility of a race condition causing spurious failures. Without much direct HPA experience, I'm not really sure what is best.
If we can't reproduce a failure scenario, maybe we can set a short timeout on the hpa resource? That won't really work if the condition message we'd look for in the logs can also take 3 minutes to appear though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel about using Timecop.scale
to make this test faster in wallclock time?
In general I dislike trading correctness for an optimization, failing fast.
"Configured" | ||
elsif scaling_active_condition.present? || able_to_scale_condition.present? | ||
condition = scaling_active_condition.presence || able_to_scale_condition | ||
"#{condition['reason']} (#{condition['message']})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the test output, this makes for a realllly long status string. Usually those are a word or two. I'd suggest using the reason as the status and moving the message to failure_message
/timeout_message
when it is present. We're not currently setting those at all, and as a result we're getting the default "Kubernetes will continue to deploy this..." timeout message, which isn't great.
test/unit/sync_mediator_test.rb
Outdated
|
||
class FakeHPA < MockResource | ||
def kubectl_resource_type | ||
'fakeHPA' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as the value of type
? If so the test isn't proving anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[1] pry(#)> hpa.kubectl_resource_type
=> "fakeHPA"
[2] pry(#)> hpa.type
=> "FakeHPA"
I'll update to make it more clear.
c5c06ef
to
4201e28
Compare
1.8 is failing but not 1.9 or 1.10. Do you think https://github.com/kubernetes-incubator/metrics-server supports 1.7 and > 1.8 (not >=1.8)? |
1.8 failing appears to be related to minikube minikube defaults that were changed in 1.9. I was able to get the tests to pass by adding kubernetes/kubernetes#57673 |
Add HPA Resource (part of splitting of https://github.com/Shopify/kubernetes-deploy/pull/188/files)
In v2beta1 we get access to threecondition statuses
AbleToScale, ScalingActive , and ScalingLimited
. I think it makes sense to succeeded whenAbleToScale
is true.The downside of using v2beta is no support in 1.7. I think its worth the tradeoff