-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flake: deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied #16025
Comments
on a first look:
@smarterclayton Seems like we are out of memory - can we increase that for our jobs? |
That can happen for other reasons too . What's in the master logs? We can't increase memory really - nothing our tests are doing should be hitting 8gb of total use. Maybe someone should trace a run. |
@openshift/sig-continuous-infrastructure can someone pls trace this? It is happening quite often lately. @smarterclayton any chance that your parallelism PR raised the memory expectations? |
Parallelism was integration, this is e2e. The master logs are in the artifacts dir. |
Is it possible that test infra doesn't build deployer image and uses the old one? I can see in logs that is uses If that's the case I might have been derailed in my previous investigation. Actually trying this with the 4 weeks old deployer image gives me flakiness of 1/3; using the one build from the same commit 0/16 and counting |
So it seems to be true:
this isn't build from current master that's why this test is flaky. It needs updated 'openshift/origin-deployer' image Why aren't we building the images and using them for tests as they are part of the codebase as well? @stevekuznetsov @Kargakis Is this a CI bug? |
Was there a bug fix in the image?
Is install_update failing as well?
|
GCE test can not and will not build images. How did the change to the image merge without GCE passing? |
From the most recent release job:
|
@smarterclayton we have fixed how the minreadyseconds are counted few days ago in deployer pod. The old deployer pod and the new one count it differently. So depending on timing and cluster utilization those might match or not. (they mostly do match) |
Do we know when this occurred for the first time? What changes to deployments we made recently that will cause 3 tests to flake heavily? |
@smarterclayton looks like the release job is not tagging the images as 3.7 or whatever is necessary -- can you update that? |
@tnozicka thx, will see if this improves when we get the updated images out. |
btw. this is the PR fixing deployer #14954 |
Not doing that for now |
Will this cause a problem when a user updates to master 3.7 while deployer
3.6 pods are running? If so this was not a safe change to make.
…On Tue, Aug 29, 2017 at 1:46 PM, Steve Kuznetsov ***@***.***> wrote:
Latest tag should also be pushed in ***@***.***
<openshift-eng/aos-cd-jobs@fab6ea7>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16025 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pzwbMBoZlSlyP4S7rmllsboys-aCks5sdE5wgaJpZM4PFYv6>
.
|
@smarterclayton I don't think it will. It used to count minreadyseconds by pods availability in the deployer. But throughout the time RC got minreadyseconds as well counted separately using the controller. Then when you deployed DC there was a brief moment when RC wasn't available due to minreadysec but DC was. We have unified it so DC now uses RC's minreadyseconds. The reason why the test won't tolerate the old deployer is that it precisely checks the order of events and validates every state to make sure it is working now because it sometimes wasn't before.
Shouldn't cause any issues. (it just won't fix the issue described above without updating the deployer. That said we should make sure QA test it as well.) |
Looks like it is still failing even with the latest image
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16046/test_pull_request_origin_extended_conformance_gce/6910/#-deploymentconfigs-with-minimum-ready-seconds-set-conformance-should-not-transition-the-deployment-to-complete-before-satisfied
…On Tue, Aug 29, 2017 at 5:16 PM, Tomáš Nožička ***@***.***> wrote:
@smarterclayton <https://github.com/smarterclayton> I don't think it
will. It used to count minreadyseconds by pods availability in the
deployer. But throughout the time RC got minreadyseconds as well counted
separately using the controller. Then when you deployed DC there was a
brief moment when RC wasn't available due to minreadysec but DC was. We
have unified it so DC now uses RC's minreadyseconds.
The reason why the test won't tolerate the old deployer is that it
precisely checks the order of events and validates every state to make sure
it is working now because it wasn't before.
Will this cause a problem when a user updates to master 3.7 while deployer
3.6 pods are running?
Shouldn't cause any issues. (it just won't fix the issue described above
without updating the deployer. That said we should make sure QA test it as
well.)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16025 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p684nOlzcZD-UdS6FNu2gop60TOYks5sdH_HgaJpZM4PFYv6>
.
|
I can't get to master logs because Jenkins is down with 503 for several hours now. (We don't have master logs on appspot.com AFAICS, only in S3 artifacts in Jenkins.) From the build log this is a different (timeout) issue which might not hopefully manifest that frequently although I should fix it once I can see master logs and I can reproduce it. |
Master logs are all on gcs and are accessible via artifacts link on
gubernator
On Aug 30, 2017, at 4:41 AM, Tomáš Nožička <notifications@github.com> wrote:
I can't get to master logs because Jenkins is down with 503 for several
hours now. (We don't have master logs on appspot.com AFAICS, only in S3
artifacts in Jenkins.)
From the build log this is a different (timeout) issue which might not
hopefully manifest that frequently although I should fix it once I can see
master logs and I can reproduce it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16025 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pxlg78iXFXBzzX114Z5crhID4Redks5sdSA9gaJpZM4PFYv6>
.
|
…nds-test-more-tolerant-to-infra Automatic merge from submit-queue Make deployments minreadyseconds test more tolerant to infra Fixes #16025 Will run several test runs here to ses that the flake isn't occurring anymore.
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16022/test_pull_request_origin_extended_conformance_gce/6827/#-deploymentconfigs-with-minimum-ready-seconds-set-conformance-should-not-transition-the-deployment-to-complete-before-satisfied
The text was updated successfully, but these errors were encountered: