Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/controller: start informers before populating store in tests #457

Merged
merged 1 commit into from
Feb 20, 2019

Conversation

runcom
Copy link
Member

@runcom runcom commented Feb 19, 2019

Signed-off-by: Antonio Murdaca runcom@linux.com

- What I did

The indexer in the informer uses a store which manages object using a read-write lock, the flakes come from the fact that if we don't sync up the store before adding objects, the listing of objects can race against an add (verified this by adding a bunch of debug logs during investigation and I could spot that lister.List can return empty even if we added objects to the indexer before starting the informers themselves).

Fixes #417
Fixes #444
Fixes #451
Fixes #449

Cannot reproduce anymore with this patch (running in a loop since 3 hours now), I'm able to consistently reproduce by running go test -race w/o this patch.

- How to verify it

true && while [ $? -eq 0 ]; do GOCACHE=off go test -race -v ./cmd/... ./pkg/... ./lib/...; done

w/o this patch the command above fails as soon as the flake makes the test error out. With the patch, the above continues perpetually.

- Description for the changelog

Signed-off-by: Antonio Murdaca <runcom@linux.com>
@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 19, 2019
@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

CI glitch...

could not wait for build: the build machine-config-server failed after 6m6s with reason PullBuilderImageFailed: Failed pulling builder image.

Pulling image docker-registry.default.svc:5000/ci-op-v6y05...8fe8db228e015b2bd5ec62da99e0505f8fd6dbc7ce842e2153680dc ...

Pulling image registry.svc.ci.openshift.org/openshift/release:golang-1.10 ...
error: build error: failed to pull image: Get https://regi...t canceled (Client.Timeout exceeded while awaiting headers)

/retest

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

still glitch #457 (comment)

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

CI errors apart (not unit), this is fixing the flakes as my tests never failed in more than 5 hours

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

the glitch is a CI issue

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Feb 19, 2019

aws limit

/retest

@cgwalters
Copy link
Member

LGTM, giving for another final review/merge:

/assign kikisdeliveryservice

@rphillips
Copy link
Contributor

lgtm (deferring to @kikisdeliveryservice for review). Nice find

@LorbusChris
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LorbusChris, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LorbusChris
Copy link
Member

/test e2e-aws

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@kikisdeliveryservice
Copy link
Contributor

HAProxy and other known flakes.

/test e2e-aws

@abhinavdahiya
Copy link
Contributor

known flake.

fail [github.com/openshift/origin/test/extended/bootstrap_user/bootstrap_user_login.go:52]: Expected error:
    <*util.ExitError | 0xc422561740>: {
        Cmd: "oc login --config=/tmp/configfile916071508 --namespace=e2e-test-bootstrap-login-cpctm -u kubeadmin -p 5LpcuRoMeuNntNkAusTRa269NuAbQoU9ptn4aloHbp4",
        StdErr: "Error from server (InternalError): Internal error occurred: unexpected response: 400",
        ExitError: {
            ProcessState: {
                pid: 11753,
                status: 256,
                rusage: {
                    Utime: {Sec: 0, Usec: 268025},
                    Stime: {Sec: 0, Usec: 66749},
                    Maxrss: 94440,
                    Ixrss: 0,
                    Idrss: 0,
                    Isrss: 0,
                    Minflt: 15028,
                    Majflt: 0,
                    Nswap: 0,
                    Inblock: 0,
                    Oublock: 0,
                    Msgsnd: 0,
                    Msgrcv: 0,
                    Nsignals: 0,
                    Nvcsw: 1198,
                    Nivcsw: 27,
                },
            },
            Stderr: nil,
        },
    }
    exit status 1
not to have occurred

openshift/origin#22088 is moving it to flake

@kikisdeliveryservice
Copy link
Contributor

Thanks for the update on that test, @abhinavdahiya !

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Member Author

runcom commented Feb 20, 2019

/retest

1 similar comment
@runcom
Copy link
Member Author

runcom commented Feb 20, 2019

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Member Author

runcom commented Feb 20, 2019

CI errors tracked in slack

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Member Author

runcom commented Feb 20, 2019

/retest

@openshift-merge-robot openshift-merge-robot merged commit 53923bc into openshift:master Feb 20, 2019
@runcom runcom deleted the flakes-fixes branch February 20, 2019 16:14
runcom added a commit to runcom/machine-config-operator that referenced this pull request Mar 17, 2019
This patch does various things, all related:

- start the informers for controllers before running them to follow what
https://github.com/kubernetes/sample-controller/blob/master/main.go#L64-L71
does and to avoid races already spotted in unit tests
(openshift#457)
- move the clients builder code from cmd/common to a new package just for that
under the new internal/ folder so nobody but us can use that
- move common controllers code under pkg/controller/common to be reused
- create an e2e only clientset, avoiding us to type client version
everytime (eg client.CoreV1()..). This is a need cause if in the future
the api change, we don't play grep for a week replacing old apis...

Signed-off-by: Antonio Murdaca <runcom@linux.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
9 participants