New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix avaibility reporting for DC with MinReadySeconds set #14936

Merged

openshift-bot merged 2 commits into openshift:master from tnozicka:fix-minreadyseconds-for-dc

Jul 4, 2017

Contributor

tnozicka commented Jun 28, 2017 •

edited

Loading

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1463594

Makes AvailableReplicas work with MinReadySeconds set. Also removes obsolete counting of pods which makes it overlap with AvailableReplicas.

@Kargakis PTAL. I want to be sure there was no gotcha for not using RC with MinReadySeconds except it might not be available at the time this was written.

tnozicka added the priority/P1 label

tnozicka added this to the 3.6.0 milestone

Contributor

mfojtik commented Jun 28, 2017

@tnozicka this seems pretty big change for 3.6

Contributor

0xmichalis commented Jun 28, 2017

Let's get the fix that propagates MinReadySeconds in RCs and the rest can be deferred to 3.7 (I am in favor of deferring)


Fix avaibility reporting for DC with MinReadySeconds

83ec4e2

tnozicka force-pushed the fix-minreadyseconds-for-dc branch 2 times, most recently from 00f6fb5 to de420ff Compare

June 29, 2017 09:44

tnozicka changed the title ~~WIP - Fix minReadySeconds for DC~~ Fix avaibility reporting for DC with MinReadySeconds set

tnozicka requested review from mfojtik and 0xmichalis

June 29, 2017 09:51

tnozicka mentioned this pull request

Fix minReadySeconds for DC #14954

Merged

Contributor Author

tnozicka commented Jun 29, 2017

@mfojtik @Kargakis I've forked the scary changes to #14954

[test]

0xmichalis reviewed

View reviewed changes

test/extended/deployments/deployments.go

+ // FIXME: remove when tests are migrated to the new client
+ // (the old one incorrectly translates nil into an empty array)
+ dc.Spec.Triggers = append(dc.Spec.Triggers, deployapi.DeploymentTriggerPolicy{Type: deployapi.DeploymentTriggerOnConfigChange})
+ dc, err = oc.Client().DeploymentConfigs(namespace).Create(dc)

Contributor

0xmichalis Jun 29, 2017

missing error check

Contributor Author

tnozicka Jun 29, 2017

bad rebase, fixing it. thx. And I guess I know where that check went :)

Contributor Author

tnozicka Jun 29, 2017

fixed

0xmichalis reviewed

View reviewed changes

test/extended/deployments/deployments.go Outdated

+ dc, err = oc.Client().DeploymentConfigs(namespace).Create(dc)
+ g.By("verifying the deployment is created")
+ rcEvent, err := watch.Until(deploymentChangeTimeout, watcher, func(event watch.Event) (bool, error) {

Contributor

0xmichalis Jun 29, 2017

watch.Until is going to flake - use wait.Poll for now

Contributor Author

tnozicka Jun 29, 2017

@Kargakis why is it going to flake? If you mean that wait can fail in such short interval I guess retry would be an option. I can't do polling - because I would loose eventual consistency and have a bad resourceVersion and that would cause a flake for sure

Contributor

0xmichalis Jun 29, 2017

There is an issue with watch.Until that is yet to be fixed and causes stuff like rollout status to break unexpectedly: kubernetes/kubernetes#40224 cc: @mfojtik

Not sure what you mean re resourceVersion, the log above says " verifying the deployment is created ", how can polling not work in that case?

0xmichalis reviewed

View reviewed changes

test/extended/deployments/deployments.go Outdated

- return false, nil
- }
+ g.By("verifying that the deployment is still running")
+ o.Expect(err).NotTo(o.HaveOccurred())

Contributor

0xmichalis Jun 29, 2017

redundant error check

Contributor Author

tnozicka Jun 29, 2017

that's where it went :)

tnozicka force-pushed the fix-minreadyseconds-for-dc branch from de420ff to 7047cf9 Compare

June 29, 2017 12:41

soltysh reviewed

View reviewed changes

test/extended/deployments/deployments.go

@@ @@ -1,6 +1,7 @@ @@
 package deployments
 import (
+ //"errors"

Contributor

soltysh Jun 29, 2017

Drop it, not needed.

Contributor Author

tnozicka Jun 29, 2017

There are also commented parts of code which use it this is because this PR was split in half and the next one uncomments it. Also felt good to leave it because a had to rename kubernetes errors package to kerrors because of it

test/extended/deployments/deployments.go Outdated

- ready++
+ g.By("waiting for the deployment to finish")
+ rc1, err = waitForRCModification(oc, namespace, rc1.Name,
+ deploymentChangeTimeout+time.Duration(dc.Spec.MinReadySeconds)*10*time.Second,

Contributor

soltysh Jun 29, 2017

For readability reasons put it in its own line.

Contributor Author

tnozicka commented Jun 29, 2017 •

edited

Loading

@Kargakis I cant' test this behaviour without WATCH. It already revealed major bug with a race between RC controller and deployer pod. I can't be tested without it. Polling sux for testing this and it causes errors not to be discovered.

@Kargakis Can we really hit that bug in tests? I though it was closed because of long timeout or too many events.

Contributor

0xmichalis commented Jun 29, 2017

@Kargakis I cant' test this behaviour without WATCH. It already revealed major bug with a race between RC controller and deployer pod. I can't be tested without it. Polling sux for testing this and it causes errors not to be discovers.

I am not sure this holds true. The bug that you uncovered regarding the RC controller and the deployer pod is unrelated to the poll vs watch argument. Here, you want to observe that the RC exists. Having a Poll that exits as soon as you get a 200 back has nothing to do with resourceVersion checks?

@Kargakis Can we really hit that bug in tests? I though it was closed because of long timeout or too many events.

We have hit problems with watch.Until already upstream:
kubernetes/kubernetes#47756
kubernetes/kubernetes#41112

Contributor Author

tnozicka commented Jun 29, 2017

Here, you want to observe that the RC exists. Having a Poll that exits as soon as you get a 200 back has nothing to do with resourceVersion checks?

@Kargakis but the below watches do - you would miss that state when it's broken if you do polling instead of watch. To be concrete here you would miss the state when availableReplicas==0 and deployment-phase==Complete because with polling you would see just that reconciled availableReplicas==2 and deployment-phase==Complete

tnozicka force-pushed the fix-minreadyseconds-for-dc branch from 7047cf9 to bc2690d Compare

June 29, 2017 13:22

Contributor Author

tnozicka commented Jun 29, 2017

(just removed a debug *10 seconds on watch)

Contributor Author

tnozicka commented Jun 30, 2017

I believe we should not sacrifice correctness for a potential flake. I have tested it in a loop and haven't seen it. From what I have seen in upstream it happens with longer timeout. These tests have the longest timeout about 1m30s and a slow rate of events. Upstream mentions
5 minutes (kubernetes/kubernetes#31345).

@Kargakis here is the deal: If that will flake I will go and fix watch.Until myself!

It should be also simple to hot fix it if needed - just have a wrapper that validates the timeout and inspects events to know the last resourceVersion and restart it from there with the remaining timeout. I think @soltysh was suggesting a similar fix.

I don't want to fix watch as part of this to make it for code freeze and due to the low timeout it shouldn't be affected.

Contributor

0xmichalis commented Jun 30, 2017

I believe we should not sacrifice correctness for a potential flake. I have tested it in a loop and haven't seen it. From what I have seen in upstream it happens with longer timeout. These tests have the longest timeout about 1m30s and a slow rate of events. Upstream mentions
5 minutes (kubernetes/kubernetes#31345).

kubernetes/kubernetes#47697 (comment) talks about 2s but I won't bikeshed on this anymore ... You own the test now :)

Contributor Author

tnozicka commented Jul 3, 2017

new-app flake #14043 (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2819/)
[test]

Contributor Author

tnozicka commented Jul 3, 2017

Ok, I was kind of expecting that. Because @bparees removed kind/test.flake label in #14043 (comment) I can't re-trigger the test now :(

I will rebase and push (no changes)


Test DC's AvailableReplicas reporting when MinReadySeconds set

953dac8

tnozicka force-pushed the fix-minreadyseconds-for-dc branch from bc2690d to 953dac8 Compare

July 3, 2017 07:51

mfojtik approved these changes

View reviewed changes

Contributor

mfojtik left a comment

LGTM

Contributor

mfojtik commented Jul 3, 2017

@stevekuznetsov seems like the node failed to restart but we dunno why:

Unable to restart service origin-node: Job for origin-node.service failed because the control process exited with error code. See "systemctl status origin-node.service" and "journalctl -xe" for details.

re-[test]

Contributor

openshift-bot commented Jul 3, 2017

Evaluated for origin test up to 953dac8

Contributor

openshift-bot commented Jul 3, 2017

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2923/) (Base Commit: fc34104) (PR Branch Commit: 953dac8)

Contributor

mfojtik commented Jul 3, 2017

[merge][severity:blocker]

Contributor

stevekuznetsov commented Jul 3, 2017

@mfojtik as per the e-mail out to aos-devel -- we capture the node logs. Look under artifacts/journals/origin-node.journal.

Contributor

mfojtik commented Jul 3, 2017

this was GCE, so no fun :/

Contributor Author

tnozicka commented Jul 3, 2017 •

edited

Loading

GCE Flake #15029
re-[merge]

Member

wanghaoran1988 commented Jul 4, 2017

/cc @zhouying7780

Contributor

mfojtik commented Jul 4, 2017

[merge]

Contributor

openshift-bot commented Jul 4, 2017

Evaluated for origin merge up to 953dac8

Contributor

openshift-bot commented Jul 4, 2017 •

edited

Loading

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1223/) (Base Commit: ecd7763) (PR Branch Commit: 953dac8) (Extended Tests: blocker) (Image: devenv-rhel7_6420)

openshift-bot merged commit 3cc73c6 into openshift:master

tnozicka mentioned this pull request

Fix broken watches kubernetes/kubernetes#50102

Closed

5 tasks

openshift-merge-robot added a commit that referenced this pull request


Merge pull request #14954 from tnozicka/fix-minreadyseconds-for-dc-final

5edc6a9

Automatic merge from submit-queue

Fix minReadySeconds for DC

Follow up to: #14936 (needs to be merged first)

Make AvailableReplicas work with MinReadySeconds set. 
Removes obsolete counting of pods which makes it overlap with AvailableReplicas from RC. This was causing RC to be in a state where AvailableReplicas=0 and deployment-phase=Complete with about 50% chance. This state lasts for a very short time.

[Outdated] At this time ignore the first 2 commits which are part of #14936  (because that isn't merged yet)

tnozicka deleted the fix-minreadyseconds-for-dc branch

November 30, 2017 11:15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment