Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed #71244

Closed
cockroach-teamcity opened this issue Oct 6, 2021 · 9 comments · Fixed by #75739
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. skipped-test T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ aef9aa08e12d50ca6401440cac5b57d5a5d6b720:

The test failed on branch=master, cloud=gce:
test timed out (see artifacts for details)
Reproduce

See: roachtest README

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 6, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on master @ cc6296c24ddb048215dabe5cc41339f306db4f41:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:752,follower_reads.go:364,follower_reads.go:71,test_runner.go:777: too many intervals with more than 2 nodes with low follower read ratios: 5 intervals > 4 threshold. Bad intervals:
		interval 09:57:00-09:57:10: n1 ratio: 0.850 n2 ratio: 0.000 n3 ratio: 0.000 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:57:20-09:57:30: n1 ratio: 0.810 n2 ratio: 0.000 n3 ratio: 0.000 n4 ratio: 0.997 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:58:30-09:58:40: n1 ratio: 0.880 n2 ratio: 0.000 n3 ratio: 0.000 n4 ratio: 1.000 n5 ratio: 0.997 n6 ratio: 1.000 
		interval 09:58:40-09:58:50: n1 ratio: 0.000 n2 ratio: 0.607 n3 ratio: 0.193 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:59:10-09:59:20: n1 ratio: 0.000 n2 ratio: 1.000 n3 ratio: 0.343 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 0.740 
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@AlexTalks AlexTalks removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 7, 2021
@AlexTalks
Copy link
Contributor

AlexTalks commented Oct 7, 2021

tl;dr: With respect to the first failure here, what's happening is that the range with timeseries data are entirely located on the dead nodes 1-3, which would indicate we do not enforce any constraints to keep the timeseries data on nodes that the test doesn't intentionally kill. This causes the test to hang when querying timeseries data.

Longer:
Specifically on the first failure here, it looks like the test hung on sending an HTTP request to gather timeseries data: https://gist.github.com/AlexTalks/a155206d20b005576c5b7d9c6598c0cd#file-71244_logs-txt-L3

It looks like this HTTP request doesn't have a timeout, which could be useful:

var response tspb.TimeSeriesQueryResponse
if err := httputil.PostJSON(http.Client{}, url, &request, &response); err != nil {
t.Fatal(err)
}

Node 4 received the HTTP request and hung waiting on a TimeSeries.Query gRPC request: https://gist.github.com/AlexTalks/a155206d20b005576c5b7d9c6598c0cd#file-71244_logs-txt-L35

This gRPC request was waiting on a channel for worker output: https://gist.github.com/AlexTalks/a155206d20b005576c5b7d9c6598c0cd#file-71244_logs-txt-L81

We can see there are 6 worker goroutines all waiting on reading timeseries data (waiting on sendPartialBatch): https://gist.github.com/AlexTalks/a155206d20b005576c5b7d9c6598c0cd#file-71244_logs-txt-L109

Tagging @aayushshah15 who is already looking into the follower-reads test.

@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on master @ 7e4ba61845bb47cc2d7146d1bbd70fd53eff5457:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:647,follower_reads.go:372,follower_reads.go:71,test_runner.go:777: 6 latency values ([25.821178ms 25.100282ms 25.821178ms 25.493498ms 25.362426ms 26.804218ms]) are above target latency 25ms, 4 permitted
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Jan 6, 2022
This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Jan 12, 2022
This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Jan 15, 2022
This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Jan 15, 2022
This commit fixes the regression(s) introduced by
cockroachdb#65379 where we observed replica
thrashing in various workloads (cockroachdb#70396 and cockroachdb#71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before).

** lease rebalancing **
*** release 21.2 and before ***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

*** master and this commit ***
In cockroachdb#65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

** replica rebalancing **
*** release 21.2 and before ***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

*** master and this commit ***
The approach implemented by cockroachdb#65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevent CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.
craig bot pushed a commit that referenced this issue Jan 29, 2022
72296: kvserver: rebalance ranges to minimize QPS delta among stores  r=aayushshah15 a=aayushshah15

kvserver: rebalance ranges to minimize QPS delta among stores

This commit fixes the regression(s) introduced by
#65379 where we observed replica
thrashing in various workloads (#70396 and #71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before) and the "new" implementation (22.1 and beyond).

**lease rebalancing**
***release 21.2 and before***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

***master and this commit***
In #65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

**replica rebalancing**
***release 21.2 and before***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

***master and this commit***
The approach implemented by #65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

**Manual testing**
This patch has been stress tested with the follower reads roachtests (~250 iterations of 
`follower-reads/survival=region/locality=global/reads=strong` and 100 iterations of 
`follower-reads/survival=zone/locality=regional/reads=exact-staleness`). It has also been 
stress tested with the `rebalance/by-load` roachtests (100 iterations for both `..leases` and 
`..replicas` tests). I also manually ran a TPCC 10K run with a small ramp (something we
know triggers #31135) a few times and
saw average QPS converge among stores fairly quickly.
![tpcc-with-low-ramp](https://user-images.githubusercontent.com/10788754/149742518-981825f4-6812-41c1-8320-519caafda9c1.png)
  

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevented CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.


75624: kv: compare MVCC GC threshold against Refresh{Range}Request.RefreshFrom r=nvanbenschoten a=nvanbenschoten

Noticed by Sumeer in #74628.

A Refresh request needs to observe all MVCC versions between its
exclusive RefreshFrom time and its inclusive RefreshTo time. If it were
to permit MVCC GC between these times then it could miss conflicts that
should cause the refresh to fail. This could in turn lead to violations
of serializability. For example:

```
txn1 reads value k1@10
txn2 deletes (tombstones) k1@15
mvcc gc @ 20 clears versions k1@10 and k1@15
txn1 refreshes @ 25, sees no value between (10, 25], refresh successful
```

In the example, the refresh erroneously succeeds because the request is
permitted to evaluate after part of the MVCC history it needs to read
has been GCed. By considering the RefreshFrom time to be the earliest
active timestamp of the request, we avoid this hazard. Instead of being
allowed to evaluate, the refresh request in the example would have hit
a BatchTimestampBeforeGCError.

Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@aayushshah15
Copy link
Contributor

Closed by #72296

@aayushshah15
Copy link
Contributor

Actually, this issue captures a different failure mode. Re-opening.

@aayushshah15 aayushshah15 reopened this Jan 29, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on master @ a5158c4c93e9dca858b38ef7e94321f0a0a2b5c5:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:270,follower_reads.go:72,test_runner.go:780: failed to get follower read counts: Get "http://35.237.98.222:26258/_status/vars": dial tcp 35.237.98.222:26258: connect: connection refused
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ a5158c4c93e9dca858b38ef7e94321f0a0a2b5c5:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
	follower_reads.go:270,follower_reads.go:72,test_runner.go:780: failed to get follower read counts: Get "http://35.243.247.122:26258/_status/vars": dial tcp 35.243.247.122:26258: connect: connection refused
Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@blathers-crl blathers-crl bot added the T-kv KV Team label Jan 31, 2022
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 31, 2022
This setting had rotted, leading to a span use-after-Finish.

This was caught by the recently-unskipped follower reads roachtests
(roachtests panic on use-after-finish), accounting for all the issues
below. The last entry in these issues refers to the nodes crashing
because of the bug fixed here. Some of these issues have been open for a
long time for more relevant failures - those have been fixed in cockroachdb#72296.

Fixes cockroachdb#75716
Fixes cockroachdb#75715
Fixes cockroachdb#75714
Fixes cockroachdb#71244
Fixes cockroachdb#71126
Fixes cockroachdb#70818
Fixes cockroachdb#70191
Fixes cockroachdb#70011
Fixes cockroachdb#70010
Fixes cockroachdb#69952
Fixes cockroachdb#69951

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 31, 2022
This setting had rotted, leading to a span use-after-Finish.

This was caught by the recently-unskipped follower reads roachtests
(roachtests panic on use-after-finish), accounting for all the issues
below. The last entry in these issues refers to the nodes crashing
because of the bug fixed here. Some of these issues have been open for a
long time for more relevant failures - those have been fixed in cockroachdb#72296.

Fixes cockroachdb#75716
Fixes cockroachdb#75715
Fixes cockroachdb#75714
Fixes cockroachdb#71244
Fixes cockroachdb#71126
Fixes cockroachdb#70818
Fixes cockroachdb#70191
Fixes cockroachdb#70011
Fixes cockroachdb#70010
Fixes cockroachdb#69952
Fixes cockroachdb#69951

Release note: None
craig bot pushed a commit that referenced this issue Feb 1, 2022
75739: sql: fix the sql.trace.stmt.enable_threshold cluster setting  r=andreimatei a=andreimatei

This setting had rotted, leading to a span use-after-Finish.

This was caught by the recently-unskipped follower reads roachtests
(roachtests panic on use-after-finish), accounting for all the issues
below. The last entry in these issues refers to the nodes crashing
because of the bug fixed here. Some of these issues have been open for a
long time for more relevant failures - those have been fixed in #72296.

Fixes #75716
Fixes #75715
Fixes #75714
Fixes #71244
Fixes #71126
Fixes #70818
Fixes #70191
Fixes #70011
Fixes #70010
Fixes #69952
Fixes #69951

Release note: None

75764: roachtest: add a variant of tpch_concurrency with no admission control r=yuzefovich a=yuzefovich

Informs: #74836.

Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on master @ 8548987813ff9e1b8a9878023d3abfc6911c16db:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:270,follower_reads.go:72,test_runner.go:780: failed to get follower read counts: Get "http://34.75.220.236:26258/_status/vars": dial tcp 34.75.220.236:26258: connect: connection refused
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 8548987813ff9e1b8a9878023d3abfc6911c16db:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
	follower_reads.go:270,follower_reads.go:72,test_runner.go:780: failed to get follower read counts: Get "http://35.231.172.61:26258/_status/vars": dial tcp 35.231.172.61:26258: connect: connection refused
Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@craig craig bot closed this as completed in f5d78e2 Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. skipped-test T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants