Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix avaibility reporting for DC with MinReadySeconds set #14936

Merged
merged 2 commits into from
Jul 4, 2017

Conversation

tnozicka
Copy link
Contributor

@tnozicka tnozicka commented Jun 28, 2017

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1463594

Makes AvailableReplicas work with MinReadySeconds set. Also removes obsolete counting of pods which makes it overlap with AvailableReplicas.

@Kargakis PTAL. I want to be sure there was no gotcha for not using RC with MinReadySeconds except it might not be available at the time this was written.

cc: @mfojtik

@tnozicka tnozicka added this to the 3.6.0 milestone Jun 28, 2017
@mfojtik
Copy link
Contributor

mfojtik commented Jun 28, 2017

@tnozicka this seems pretty big change for 3.6

@0xmichalis
Copy link
Contributor

Let's get the fix that propagates MinReadySeconds in RCs and the rest can be deferred to 3.7 (I am in favor of deferring)

@tnozicka tnozicka force-pushed the fix-minreadyseconds-for-dc branch 2 times, most recently from 00f6fb5 to de420ff Compare June 29, 2017 09:44
@tnozicka tnozicka changed the title WIP - Fix minReadySeconds for DC Fix avaibility reporting for DC with MinReadySeconds set Jun 29, 2017
@tnozicka
Copy link
Contributor Author

@mfojtik @Kargakis I've forked the scary changes to #14954

[test]

// FIXME: remove when tests are migrated to the new client
// (the old one incorrectly translates nil into an empty array)
dc.Spec.Triggers = append(dc.Spec.Triggers, deployapi.DeploymentTriggerPolicy{Type: deployapi.DeploymentTriggerOnConfigChange})
dc, err = oc.Client().DeploymentConfigs(namespace).Create(dc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing error check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad rebase, fixing it. thx. And I guess I know where that check went :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

dc, err = oc.Client().DeploymentConfigs(namespace).Create(dc)

g.By("verifying the deployment is created")
rcEvent, err := watch.Until(deploymentChangeTimeout, watcher, func(event watch.Event) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watch.Until is going to flake - use wait.Poll for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis why is it going to flake? If you mean that wait can fail in such short interval I guess retry would be an option. I can't do polling - because I would loose eventual consistency and have a bad resourceVersion and that would cause a flake for sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue with watch.Until that is yet to be fixed and causes stuff like rollout status to break unexpectedly: kubernetes/kubernetes#40224 cc: @mfojtik

Not sure what you mean re resourceVersion, the log above says " verifying the deployment is created ", how can polling not work in that case?

return false, nil
}
g.By("verifying that the deployment is still running")
o.Expect(err).NotTo(o.HaveOccurred())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant error check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's where it went :)

@@ -1,6 +1,7 @@
package deployments

import (
//"errors"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop it, not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also commented parts of code which use it this is because this PR was split in half and the next one uncomments it. Also felt good to leave it because a had to rename kubernetes errors package to kerrors because of it

ready++
g.By("waiting for the deployment to finish")
rc1, err = waitForRCModification(oc, namespace, rc1.Name,
deploymentChangeTimeout+time.Duration(dc.Spec.MinReadySeconds)*10*time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability reasons put it in its own line.

@tnozicka
Copy link
Contributor Author

tnozicka commented Jun 29, 2017

@Kargakis I cant' test this behaviour without WATCH. It already revealed major bug with a race between RC controller and deployer pod. I can't be tested without it. Polling sux for testing this and it causes errors not to be discovered.

@Kargakis Can we really hit that bug in tests? I though it was closed because of long timeout or too many events.

@0xmichalis
Copy link
Contributor

@Kargakis I cant' test this behaviour without WATCH. It already revealed major bug with a race between RC controller and deployer pod. I can't be tested without it. Polling sux for testing this and it causes errors not to be discovers.

I am not sure this holds true. The bug that you uncovered regarding the RC controller and the deployer pod is unrelated to the poll vs watch argument. Here, you want to observe that the RC exists. Having a Poll that exits as soon as you get a 200 back has nothing to do with resourceVersion checks?

@Kargakis Can we really hit that bug in tests? I though it was closed because of long timeout or too many events.

We have hit problems with watch.Until already upstream:
kubernetes/kubernetes#47756
kubernetes/kubernetes#41112

@tnozicka
Copy link
Contributor Author

Here, you want to observe that the RC exists. Having a Poll that exits as soon as you get a 200 back has nothing to do with resourceVersion checks?

@Kargakis but the below watches do - you would miss that state when it's broken if you do polling instead of watch. To be concrete here you would miss the state when availableReplicas==0 and deployment-phase==Complete because with polling you would see just that reconciled availableReplicas==2 and deployment-phase==Complete

@tnozicka
Copy link
Contributor Author

(just removed a debug *10 seconds on watch)

@tnozicka
Copy link
Contributor Author

I believe we should not sacrifice correctness for a potential flake. I have tested it in a loop and haven't seen it. From what I have seen in upstream it happens with longer timeout. These tests have the longest timeout about 1m30s and a slow rate of events. Upstream mentions
5 minutes (kubernetes/kubernetes#31345).

@Kargakis here is the deal: If that will flake I will go and fix watch.Until myself!

It should be also simple to hot fix it if needed - just have a wrapper that validates the timeout and inspects events to know the last resourceVersion and restart it from there with the remaining timeout. I think @soltysh was suggesting a similar fix.

I don't want to fix watch as part of this to make it for code freeze and due to the low timeout it shouldn't be affected.

@0xmichalis
Copy link
Contributor

I believe we should not sacrifice correctness for a potential flake. I have tested it in a loop and haven't seen it. From what I have seen in upstream it happens with longer timeout. These tests have the longest timeout about 1m30s and a slow rate of events. Upstream mentions
5 minutes (kubernetes/kubernetes#31345).

kubernetes/kubernetes#47697 (comment) talks about 2s but I won't bikeshed on this anymore ... You own the test now :)

@tnozicka
Copy link
Contributor Author

tnozicka commented Jul 3, 2017

@tnozicka
Copy link
Contributor Author

tnozicka commented Jul 3, 2017

Ok, I was kind of expecting that. Because @bparees removed kind/test.flake label in #14043 (comment) I can't re-trigger the test now :(

I will rebase and push (no changes)

@tnozicka tnozicka force-pushed the fix-minreadyseconds-for-dc branch from bc2690d to 953dac8 Compare July 3, 2017 07:51
Copy link
Contributor

@mfojtik mfojtik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mfojtik
Copy link
Contributor

mfojtik commented Jul 3, 2017

@stevekuznetsov seems like the node failed to restart but we dunno why:

Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See "systemctl status origin-node.service" and "journalctl -xe" for details.

re-[test]

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 953dac8

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2923/) (Base Commit: fc34104) (PR Branch Commit: 953dac8)

@mfojtik
Copy link
Contributor

mfojtik commented Jul 3, 2017

[merge][severity:blocker]

@stevekuznetsov
Copy link
Contributor

@mfojtik as per the e-mail out to aos-devel -- we capture the node logs. Look under artifacts/journals/origin-node.journal.

@mfojtik
Copy link
Contributor

mfojtik commented Jul 3, 2017

this was GCE, so no fun :/

@tnozicka
Copy link
Contributor Author

tnozicka commented Jul 3, 2017

GCE Flake #15029
re-[merge]

@wanghaoran1988
Copy link
Member

/cc @zhouying7780

@mfojtik
Copy link
Contributor

mfojtik commented Jul 4, 2017

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 953dac8

@openshift-bot
Copy link
Contributor

openshift-bot commented Jul 4, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1223/) (Base Commit: ecd7763) (PR Branch Commit: 953dac8) (Extended Tests: blocker) (Image: devenv-rhel7_6420)

@openshift-bot openshift-bot merged commit 3cc73c6 into openshift:master Jul 4, 2017
openshift-merge-robot added a commit that referenced this pull request Aug 29, 2017
Automatic merge from submit-queue

Fix minReadySeconds for DC

Follow up to: #14936 (needs to be merged first)

Make AvailableReplicas work with MinReadySeconds set. 
Removes obsolete counting of pods which makes it overlap with AvailableReplicas from RC. This was causing RC to be in a state where AvailableReplicas=0 and deployment-phase=Complete with about 50% chance. This state lasts for a very short time.

[Outdated] At this time ignore the first 2 commits which are part of #14936  (because that isn't merged yet)
@tnozicka tnozicka deleted the fix-minreadyseconds-for-dc branch November 30, 2017 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants