Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s 110 ci with fixes #308

Merged
merged 9 commits into from
Jul 20, 2018
Merged

K8s 110 ci with fixes #308

merged 9 commits into from
Jul 20, 2018

Conversation

dturn
Copy link
Contributor

@dturn dturn commented Jun 29, 2018

In 1.10 progressDeadlineSeconds is in instance data even if its not in the spec. As a result, our logic for progressDeadlineSeconds is triggered instead of the hard-timeout logic.

This is #299 with a commit to fix the issue.

@stefanmb
Copy link
Contributor

👀 Thanks for looking at this, I dropped the ball with following up.

@@ -119,14 +119,14 @@ def rollout_data
end

def progress_condition
return unless exists?
return unless exists? && progress_deadline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit rusty on this code - I understand why the progress_deadline body had to be changed, but why/how was progress_condition working with just the exists? check before this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 1.10 the Progressing condition (added by a k8s controller) is now being added to deployments even when progressDeadlineSeconds isn't in the spec deployed. This isn't happening in 1.9. (I'm not sure why it's happening. A guess, who knows if its valid since I'm not sure where in the k8s diff between 1.9-1.10 to look, is that the default value of progressDeadlineSeconds is getting merged in sooner than before?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test in question uses an extensions/v1beta1 Deployment. PDS is not supposed to get defaulted at all in that GV... maybe 1.10 added a bug? Can you confirm whether PDS itself is being defaulted now, or whether it's just the Progressing condition being added (and never expiring, I guess?)?

If we change our progress deadline to depend on if @definition['spec']['progressDeadlineSeconds'].present? as in this PR, we'll have a behaviour change: previously, if your Deployment got a PDS via defaulting (e.g. an apps/v1 Deployment with no PDS specified), we'd use a PDS-based timeout. After this change, cases like that would use the hard timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the 1.10 docs still claim there is no default value in extensions/v1beta1. And the controller code removes the Progressing condition if no PDS is set: https://github.com/kubernetes/kubernetes/blob/74bcefc8b2bf88a2f5816336999b524cc48cf6c0/pkg/controller/deployment/progress.go#L38-L41 😕

Copy link
Contributor

@timothysmith0609 timothysmith0609 Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at an older deployment that didn't have PDS set and also using extensions/v1beta1. It does indeed get its PDS set with the default 600 seconds in its spec. Could it be that the apiserver is using apps/v1 under the hood and the default is being injected into the extensions/v1beta1 spec?

The spec definition: https://github.com/Shopify/u2/blob/73c0ed1f461b319bd5adb35a5865f9beacf82a9e/config/deploy/staging/web.yml.erb#L62

Inside the cluster (PDS has been set automatically somewhere):

spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 2
  selector:
     ...

The controller code linked to by Katrina seems to suggest that you need to explicitly set PDS to nil to avoid progressing, not rely on the default when viewed in this light.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 that looks like good evidence. Can you open the issue upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KnVerey

1 - If the resource has the kubernetes-deploy.shopify.io/timeout-override annotation on it, use that. This is currently supported for all resources except deployment.

I agree we should add this, but I'd like to keep the scope of this PR to be just whats needed to make CI work with 1.10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. We can just open an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created #315

@dturn dturn force-pushed the k8s-110-ci-with-fixes branch 3 times, most recently from dc3f729 to 2dd27a2 Compare July 12, 2018 17:31
%r{Deployment/bad-probe: TIMED OUT \(timeout: \d+s\)},
"Timeout reason: hard deadline for Deployment"
]
end_bad_probe_logs = [%r{Unhealthy: Readiness probe failed: .* \(8 events \)}] # event
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its ok to ignore Policial here since we've been using the %r{} syntax for regex and I don't think we want to start using the // syntax.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule it is enforcing is the "mixed" style if you look at the rubocop cache file. If we don't want that, we need to change our rubocop rules. But in general I'm inclined to stick with the standard style guide.

@dturn
Copy link
Contributor Author

dturn commented Jul 12, 2018

CI is passing, can I get 👀 again

Copy link
Contributor

@timothysmith0609 timothysmith0609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems straightforward now

dev.yml Outdated
@@ -8,7 +8,7 @@ up:
- custom:
name: Minikube Cluster
met?: test $(minikube status | grep Running | wc -l) -eq 2 && $(minikube status | grep -q 'Correctly Configured')
meet: minikube start --kubernetes-version=v1.9.4 --vm-driver=hyperkit
meet: minikube start --kubernetes-version=v1.10.4 --vm-driver=hyperkit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.10.0 is still the latest available in minikube afaik

%r{Deployment/bad-probe: TIMED OUT \(timeout: \d+s\)},
"Timeout reason: hard deadline for Deployment"
]
end_bad_probe_logs = [%r{Unhealthy: Readiness probe failed: .* \(8 events \)}] # event
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule it is enforcing is the "mixed" style if you look at the rubocop cache file. If we don't want that, we need to change our rubocop rules. But in general I'm inclined to stick with the standard style guide.

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is passing

It is, but it is waiting 600s for that deployment. This should not merge until it doesn't cause a regression in our CI times. How to do that will have to be case-by-case, depending on what the purpose of using that fixture was for a given test.

bad_probe_timeout = if KUBE_SERVER_VERSION < Gem::Version.new("1.10.0")
"Deployment/bad-probe: TIMED OUT (timeout: 5s)"
else
"Deployment/bad-probe: GLOBAL WATCH TIMEOUT (20 seconds)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this version, bad-probe has no purpose since missing-volumes is already covering the global watch timeout.

@@ -534,7 +534,11 @@ def test_pruning_of_existing_managed_secrets_when_ejson_file_has_been_deleted

def test_deploy_result_logging_for_mixed_result_deploy
subset = ["bad_probe.yml", "init_crash.yml", "missing_volumes.yml", "config_map.yml"]
result = deploy_fixtures("invalid", subset: subset)
result = deploy_fixtures("invalid", subset: subset) do |f|
if KUBE_SERVER_VERSION >= Gem::Version.new("1.10.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add an inline comment by the first of these changes in each test, linking to the issue you opened?

@dturn dturn merged commit e63a1a3 into master Jul 20, 2018
@dturn dturn deleted the k8s-110-ci-with-fixes branch July 20, 2018 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants