-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8s 110 ci with fixes #308
Conversation
👀 Thanks for looking at this, I dropped the ball with following up. |
@@ -119,14 +119,14 @@ def rollout_data | |||
end | |||
|
|||
def progress_condition | |||
return unless exists? | |||
return unless exists? && progress_deadline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit rusty on this code - I understand why the progress_deadline
body had to be changed, but why/how was progress_condition
working with just the exists?
check before this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 1.10 the Progressing
condition (added by a k8s controller) is now being added to deployments even when progressDeadlineSeconds
isn't in the spec deployed. This isn't happening in 1.9. (I'm not sure why it's happening. A guess, who knows if its valid since I'm not sure where in the k8s diff between 1.9-1.10 to look, is that the default value of progressDeadlineSeconds
is getting merged in sooner than before?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test in question uses an extensions/v1beta1 Deployment. PDS is not supposed to get defaulted at all in that GV... maybe 1.10 added a bug? Can you confirm whether PDS itself is being defaulted now, or whether it's just the Progressing condition being added (and never expiring, I guess?)?
If we change our progress deadline to depend on if @definition['spec']['progressDeadlineSeconds'].present?
as in this PR, we'll have a behaviour change: previously, if your Deployment got a PDS via defaulting (e.g. an apps/v1 Deployment with no PDS specified), we'd use a PDS-based timeout. After this change, cases like that would use the hard timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the 1.10 docs still claim there is no default value in extensions/v1beta1. And the controller code removes the Progressing condition if no PDS is set: https://github.com/kubernetes/kubernetes/blob/74bcefc8b2bf88a2f5816336999b524cc48cf6c0/pkg/controller/deployment/progress.go#L38-L41 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at an older deployment that didn't have PDS set and also using extensions/v1beta1
. It does indeed get its PDS set with the default 600 seconds in its spec. Could it be that the apiserver is using apps/v1
under the hood and the default is being injected into the extensions/v1beta1
spec?
The spec definition: https://github.com/Shopify/u2/blob/73c0ed1f461b319bd5adb35a5865f9beacf82a9e/config/deploy/staging/web.yml.erb#L62
Inside the cluster (PDS has been set automatically somewhere):
spec:
progressDeadlineSeconds: 600
replicas: 3
revisionHistoryLimit: 2
selector:
...
The controller code linked to by Katrina seems to suggest that you need to explicitly set PDS to nil to avoid progressing, not rely on the default when viewed in this light.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 that looks like good evidence. Can you open the issue upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue created kubernetes/kubernetes#66135
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 - If the resource has the kubernetes-deploy.shopify.io/timeout-override annotation on it, use that. This is currently supported for all resources except deployment.
I agree we should add this, but I'd like to keep the scope of this PR to be just whats needed to make CI work with 1.10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. We can just open an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue created #315
dc3f729
to
2dd27a2
Compare
%r{Deployment/bad-probe: TIMED OUT \(timeout: \d+s\)}, | ||
"Timeout reason: hard deadline for Deployment" | ||
] | ||
end_bad_probe_logs = [%r{Unhealthy: Readiness probe failed: .* \(8 events \)}] # event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its ok to ignore Policial here since we've been using the %r{}
syntax for regex and I don't think we want to start using the //
syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rule it is enforcing is the "mixed" style if you look at the rubocop cache file. If we don't want that, we need to change our rubocop rules. But in general I'm inclined to stick with the standard style guide.
2dd27a2
to
778ba0b
Compare
778ba0b
to
8b68bee
Compare
CI is passing, can I get 👀 again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems straightforward now
dev.yml
Outdated
@@ -8,7 +8,7 @@ up: | |||
- custom: | |||
name: Minikube Cluster | |||
met?: test $(minikube status | grep Running | wc -l) -eq 2 && $(minikube status | grep -q 'Correctly Configured') | |||
meet: minikube start --kubernetes-version=v1.9.4 --vm-driver=hyperkit | |||
meet: minikube start --kubernetes-version=v1.10.4 --vm-driver=hyperkit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.10.0 is still the latest available in minikube afaik
%r{Deployment/bad-probe: TIMED OUT \(timeout: \d+s\)}, | ||
"Timeout reason: hard deadline for Deployment" | ||
] | ||
end_bad_probe_logs = [%r{Unhealthy: Readiness probe failed: .* \(8 events \)}] # event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rule it is enforcing is the "mixed" style if you look at the rubocop cache file. If we don't want that, we need to change our rubocop rules. But in general I'm inclined to stick with the standard style guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is passing
It is, but it is waiting 600s for that deployment. This should not merge until it doesn't cause a regression in our CI times. How to do that will have to be case-by-case, depending on what the purpose of using that fixture was for a given test.
bad_probe_timeout = if KUBE_SERVER_VERSION < Gem::Version.new("1.10.0") | ||
"Deployment/bad-probe: TIMED OUT (timeout: 5s)" | ||
else | ||
"Deployment/bad-probe: GLOBAL WATCH TIMEOUT (20 seconds)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this version, bad-probe
has no purpose since missing-volumes
is already covering the global watch timeout.
4683381
to
18b2724
Compare
18b2724
to
1f909ae
Compare
@@ -534,7 +534,11 @@ def test_pruning_of_existing_managed_secrets_when_ejson_file_has_been_deleted | |||
|
|||
def test_deploy_result_logging_for_mixed_result_deploy | |||
subset = ["bad_probe.yml", "init_crash.yml", "missing_volumes.yml", "config_map.yml"] | |||
result = deploy_fixtures("invalid", subset: subset) | |||
result = deploy_fixtures("invalid", subset: subset) do |f| | |||
if KUBE_SERVER_VERSION >= Gem::Version.new("1.10.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add an inline comment by the first of these changes in each test, linking to the issue you opened?
In 1.10 progressDeadlineSeconds is in instance data even if its not in the spec. As a result, our logic for progressDeadlineSeconds is triggered instead of the hard-timeout logic.
This is #299 with a commit to fix the issue.