Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce raciness in test #1996

Merged
merged 1 commit into from
Oct 18, 2018

Conversation

lukedirtwalker
Copy link
Collaborator

@lukedirtwalker lukedirtwalker commented Oct 17, 2018

fetchSegsFromDBRetry select on ctx.Done() and time.After().
If the setup/calling of ctx.Done() takes more than what we pass in tim.After,
it can be that both channels are ready at the same time and then the test might fail.

By increasing the timeout in the test this should no longer be a problem.


This change is Reviewable

Copy link
Contributor

@sgmonroy sgmonroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just a bit more context, this is the result of investigating a CI failure of an unrelated change.
It seems that CI is very much resource constrained and race conditions show up more often.

func (h *baseHandler) fetchSegsFromDBRetry(ctx context.Context,
        params *query.Params) ([]*seg.PathSegment, error) {

        for {
                upSegs, err := h.fetchSegsFromDB(ctx, params)
                if err != nil || len(upSegs) > 0 {
                        return upSegs, err
                }
                select {
                case <-ctx.Done():
                        return nil, ctx.Err()
                case <-time.After(h.retryInt):
                        // retry
                }
        }
}

We reasoned that on each select evaluation, a timer/channel is created which spawns its own go routine. Thus, it is basically a race between the timer and the current go routines. If the current go routine does not run enough to evaluate the select cases, it is theoretically possible that the timer go routine did run and expired, which ends up in both cases being true and one of them randomly executed.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @kormat)

Copy link
Contributor

@sgmonroy sgmonroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @kormat)

fetchSegsFromDBRetry select on ctx.Done() and time.After().
If the setup/calling of ctx.Done() takes more than what we pass in time.After,
it can be that both channels are ready at the same time and then the test might fail.

By increasing the timeout in the test this should no longer be a problem.
Copy link
Contributor

@kormat kormat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @kormat)

@lukedirtwalker lukedirtwalker merged commit a83dfac into scionproto:master Oct 18, 2018
@lukedirtwalker lukedirtwalker deleted the pubStabilizeTest branch October 18, 2018 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants