Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: more aggressively replica GC learner replicas #41300

Merged

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Oct 3, 2019

This PR fixes a test flake in TestSystemZoneConfig:

    client_replica_test.go:1753: condition failed to evaluate within 45s: mismatch between
        r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):7, (n7,s7):5, next=8, gen=14]
        r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):4, (n7,s7):5, (n3,s3):6LEARNER, next=7, gen=9]

The above flake happens because we set the expectation in the map to a
descriptor which contains a learner which has since been removed.

We shouldn't use a range descriptor which contains learners as the expectation.
To avoid that we return an error in the succeeds soon if we come across a
descriptor which contains learners. This behavior unvealed another issue,
we are way too conservative with replica GC for learners. Most of the time
when learners are removed they hear about their own removal, but if they don't
we won't consider the Replica for removal for 10 days! This commit changes
the replica gc queue behavior to treat learners line candidates.

Fixes #40980.

Release Justification: bug fixes and low-risk updates to new functionality.

Release note: None

@ajwerner ajwerner requested a review from tbg October 3, 2019 19:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: and thanks for looking into this!

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/storage/replica_gc_queue.go, line 39 at r1 (raw file):

	// A Replica is suspected to have been removed if either it is in the
	// candidate Raft state (which is a typical sign of having been removed
	// from the group) or it is a leaner which has been removed but never heard

learner

This PR fixes a test flake in TestSystemZoneConfig:

```
    client_replica_test.go:1753: condition failed to evaluate within 45s: mismatch between
        r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):7, (n7,s7):5, next=8, gen=14]
        r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):4, (n7,s7):5, (n3,s3):6LEARNER, next=7, gen=9]
```

The above flake happens because we set the expectation in the map to a
descriptor which contains a learner which has since been removed.

We shouldn't use a range descriptor which contains learners as the expectation.
To avoid that we return an error in the succeeds soon if we come across a
descriptor which contains learners. This behavior unvealed another issue,
we are way too conservative with replica GC for learners. Most of the time
when learners are removed they hear about their own removal, but if they don't
we won't consider the Replica for removal for 10 days! This commit changes
the replica gc queue behavior to treat learners line candidates.

Fixes cockroachdb#40980.

Release Justification: bug fixes and low-risk updates to new functionality.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/fix-TestSystemZoneConfigFlake branch from 7376180 to 2e9ce1c Compare October 4, 2019 13:58
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r=tbg

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)


pkg/storage/replica_gc_queue.go, line 39 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

learner

Done.

@craig
Copy link
Contributor

craig bot commented Oct 4, 2019

Build failed

@ajwerner
Copy link
Contributor Author

ajwerner commented Oct 4, 2019

Flaked on TestComposeGSS.

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 4, 2019

Build failed

@ajwerner
Copy link
Contributor Author

ajwerner commented Oct 4, 2019

bors r+

craig bot pushed a commit that referenced this pull request Oct 4, 2019
41300: storage: more aggressively replica GC learner replicas r=ajwerner a=ajwerner

This PR fixes a test flake in TestSystemZoneConfig:

```
    client_replica_test.go:1753: condition failed to evaluate within 45s: mismatch between
        r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):7, (n7,s7):5, next=8, gen=14]
        r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):4, (n7,s7):5, (n3,s3):6LEARNER, next=7, gen=9]
```

The above flake happens because we set the expectation in the map to a
descriptor which contains a learner which has since been removed.

We shouldn't use a range descriptor which contains learners as the expectation.
To avoid that we return an error in the succeeds soon if we come across a
descriptor which contains learners. This behavior unvealed another issue,
we are way too conservative with replica GC for learners. Most of the time
when learners are removed they hear about their own removal, but if they don't
we won't consider the Replica for removal for 10 days! This commit changes
the replica gc queue behavior to treat learners line candidates.

Fixes #40980.

Release Justification: bug fixes and low-risk updates to new functionality.

Release note: None

41308: storage: remove error from Replica.applyTimestampCache() r=ajwerner a=ajwerner

Stumbled upon a function with an error in its return signature
that never returns an error. Better to remove it and the stale
comment that goes with it. The removal of the code paths which
could have returned an error occurred in #33396.

Release justification: Low risk, does not change logic. Could also
hold off.

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Oct 4, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

teamcity: failed test: TestSystemZoneConfigs
3 participants