Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: rebalance-leases-by-load failed #31303

Closed
cockroach-teamcity opened this issue Oct 12, 2018 · 2 comments
Closed

roachtest: rebalance-leases-by-load failed #31303

cockroach-teamcity opened this issue Oct 12, 2018 · 2 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/62c976dafc7f48f0a52142310a8fdc42908c30a5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=rebalance-leases-by-load PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=960824&tab=buildLog

The test failed on master:
	test.go:575,rebalance_load.go:109,rebalance_load.go:124: timed out before leases were evenly spread

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Oct 12, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Oct 12, 2018
@a-robinson
Copy link
Contributor

a-robinson commented Oct 12, 2018

This failure was caused by a combination of 3 factors:

  1. One of the range leaseholder's transferred its lease via the replicate queue to improve lease count balance just before the store rebalancers ran. This meant that the former leaseholder didn't count it anymore, but the new leaseholder didn't yet know how much qps it's responsible for and also still thought the other store had that qps (because gossip hadn't propagated the former leaseholder's new, reduced qps yet). Thus the new leaseholder didn't think it was overfull (even though it was) and didn't transfer any leases. storage: New leaseholder doesn't know range qps for some time #31320
    2a. During the next run of the StoreRebalancer on the overfull store, it hit storage: Load-based rebalancing should be more willing to make certain incremental improvements #31135 for one of its replicas, preventing it from transferring it.
    2b. ...and for its other replica it didn't transfer it because it thought that transferring it would bring itself below the minimum qps threshold. This was because the store's total qps was less than the sum of its hot replicas, due to the measurements of the store-level qps and the replicas' qps being taken at different times.
  2. The StoreRebalancer only got to run twice during the test due to the combination of the test's short running time. This was intentional, but I'm not sure how wise it is given that we don't make guarantees about the quickness of load-based rebalancing's decisions. Also, the recent change to add jitter (storage: Jitter the StoreRebalancer loop's timing #31227) makes it possible for the delay between runs to be more than a minute, meaning in unfortunate scenarios we may only get one run in during the test.

Problem 2b is a mistake that wasn't supposed to happen -- we're recomputing the qps instead of using the cached one from when the store's total qps was last computed, and that's not intentional. I'll fix that.

Problem 3 is also worth making a bit more lenient, especially since #31289 will make the test terminate as soon as it succeeds instead of always taking the full duration.

@a-robinson
Copy link
Contributor

It's also very possible that we should reduce the time between StoreRebalancer runs, but one minute was chosen initially to line up with the frequency of gossiping the local StoreDescriptor (and thus recomputing each replica's qps).

a-robinson added a commit to a-robinson/cockroach that referenced this issue Oct 12, 2018
a-robinson added a commit to a-robinson/cockroach that referenced this issue Oct 12, 2018
It's better to use the same measurement for everything, especially
because it's the one that was used to compute the store's total QPS.

Fixes cockroachdb#31303

Release note: None
craig bot pushed a commit that referenced this issue Oct 12, 2018
31323: roachtest: Slightly increase timeout in rebalance-leases-by-load r=a-robinson a=a-robinson

Touches #31303

Release note: None

Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>
craig bot pushed a commit that referenced this issue Oct 12, 2018
31324: storage: Avoid using different qps measurements in StoreRebalancer r=a-robinson a=a-robinson

It's better to use the same measurement for everything, especially
because it's the one that was used to compute the store's total QPS.

Fixes #31303

Release note: None

Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>
@craig craig bot closed this as completed in #31324 Oct 12, 2018
a-robinson added a commit to a-robinson/cockroach that referenced this issue Oct 15, 2018
It's better to use the same measurement for everything, especially
because it's the one that was used to compute the store's total QPS.

Fixes cockroachdb#31303

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

2 participants