Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky unit tests #2904

Merged
merged 2 commits into from
Dec 6, 2022

Conversation

cblecker
Copy link
Contributor

@cblecker cblecker commented Dec 4, 2022

Description of the change:
This fixes two flaky unit tests TestResolver and TestUpdates.

First, TestResolver, specifically the FailForwardEnabled/3EntryReplacementChain/ReplacementChainBroken/NotSatisfiable test case.

This test was changed in #2788 and currently references one of the three failed CSVs in the test namespace.

However, when you run this test multiple times, the resolver cache is sometimes handing back a different CSV (the following was a purposefully broken match in order to always print the test result):

$ go test -count=10 -run TestResolver ./pkg/controller/registry/resolver | grep 'failed to populate resolver cache from source'
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v2 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v2 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v2 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"
                Error:          "failed to populate resolver cache from source @existing/catsrc-namespace: csv catsrc-namespace/a.v1 in phase Failed instead of Replacing" does not contain "failed to populate resolver cache from source @existing/catsrc-namespace: csv TEST"

Note how sometimes the resolver is returning catsrc-namespace/a.v1 and sometimes it's catsrc-namespace/a.v2.

As I understand this test, the specific CSV doesn't matter for the error as all three are in the CSVPhaseFailed state. Therefore, my proposed fix removes the CSV name from the error match, so that it will match on any of the three CSVs. I believe this was the original intent of the test, looking at what it was prior to #2788 (noting that the error matching was split in two, based on the same omission of the CSV name).

Second, TestUpdates. In this piece of code:

for current.Status.Phase != e.whenIn.phase {
fmt.Printf("waiting for (when) %s to be %s\n", e.whenIn.name, e.whenIn.phase)
csvsToSync = syncCSVs(csvsToSync, deletedCSVs(e.shouldBe))
current = csvsToSync[e.whenIn.name]
}

The for loop will keep hammering the fake operator with op.syncClusterServiceVersion(csv) and Get calls as fast as it possibly can. However, this is creating a race condition in the *RaceFreeFakeWatcher that is watching the fake OperatorGroup. Basically the watch channel is filling up (default watch channel length is 100 events, and if it fills up and closes, the go routine panics) more quickly than the operator can drain it.

The proposed fix here creates a sleep of 1ms between instances of this for loop. It only slows the test down negligibly, but it's enough to help the watch channel drain faster than it fills. In a real world Kubernetes API server, responses aren't going to be that fast anyways.

Motivation for the change:
Fix flaky unit tests, because they are the worst.

Architectural changes:

Testing remarks:

With fix, ran test 100 times to verify flake is gone:

$ go test -count=100 -run TestResolver ./pkg/controller/registry/resolver 
ok      github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver       391.615s
$ go test -count=100 -run TestUpdates ./pkg/controller/operators/olm
ok      github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/olm   13.768s

This compared to the current HEAD, which this test fails somewhere between 10-20% of the time.

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

@openshift-ci
Copy link

openshift-ci bot commented Dec 4, 2022

Hi @cblecker. Thanks for your PR.

I'm waiting for a operator-framework member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 4, 2022
@cblecker
Copy link
Contributor Author

cblecker commented Dec 4, 2022

/cc @exdx @grokspawn
This is the flake that I ran into on #2903 :)

@openshift-ci openshift-ci bot requested review from exdx and grokspawn December 4, 2022 22:34
@cblecker cblecker changed the title Fix flaky TestResolver unit test Fix flaky unit tests Dec 5, 2022
Signed-off-by: Christoph Blecker <cblecker@redhat.com>
Signed-off-by: Christoph Blecker <cblecker@redhat.com>
@grokspawn
Copy link
Contributor

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 5, 2022
Copy link
Contributor

@ankitathomas ankitathomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 5, 2022
@grokspawn grokspawn added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2022
@openshift-ci
Copy link

openshift-ci bot commented Dec 6, 2022

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: ankitathomas, cblecker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 4da774f into operator-framework:master Dec 6, 2022
@cblecker cblecker deleted the fix-flake branch December 6, 2022 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants