Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: nil pointer dereference in TestMergeQueue #63009

Closed
irfansharif opened this issue Apr 2, 2021 · 0 comments · Fixed by #65912
Closed

kvserver: nil pointer dereference in TestMergeQueue #63009

irfansharif opened this issue Apr 2, 2021 · 0 comments · Fixed by #65912
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@irfansharif
Copy link
Contributor

Describe the problem

See this build run against the 21.1 branch.


------- Stdout: -------
=== RUN   TestMergeQueue/sticky-bit
    --- FAIL: TestMergeQueue/sticky-bit (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1c1a2e7]

goroutine 196713 [running]:
testing.tRunner.func1.1(0x4182cc0, 0x737a9c0)
	/usr/local/go/src/testing/testing.go:1072 +0x30d
testing.tRunner.func1(0xc001b04f00)
	/usr/local/go/src/testing/testing.go:1075 +0x41a
panic(0x4182cc0, 0x737a9c0)
	/usr/local/go/src/runtime/panic.go:969 +0x1b9
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SetZoneConfig(0x0, 0xc008ca0fc0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica.go:689 +0x47
github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestMergeQueue.func5(0xc005694700, 0xc0056946f0, 0xc0056946f8, 0x0, 0xc0056946e0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_merge_test.go:4144 +0xae
github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestMergeQueue.func6(0xc001b04f00)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_merge_test.go:4155 +0x302
github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestMergeQueue.func12(0xc001b04f00)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_merge_test.go:4234 +0x8f
testing.tRunner(0xc001b04f00, 0xc00539cf80)
	/usr/local/go/src/testing/testing.go:1123 +0xef
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1168 +0x2b3

+cc @tbg for routing.

@irfansharif irfansharif added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 2, 2021
@tbg tbg removed their assignment Apr 8, 2021
@irfansharif irfansharif self-assigned this Apr 22, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 26, 2021
Informs cockroachdb#63009.
Informs cockroachdb#64056.

In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic.
I don't know quite what's going on here, but when working on a patch for cockroachdb#62700,
I managed to hit this panic reliably by accidentally breaking all range merges.

After a bit of debugging, it became clear that we were always hitting a panic in
the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest,
`TestMergeQueue/non-collocated`, was moving the RHS range to a different node,
failing to merge the two range, and failing itself. This soft failure was being
drowned out by the hard failure in the next subtest.

This commit replaces the crash with a failure that looks something like the
following when range merges are completely disabled:
```
--- FAIL: TestMergeQueue (0.34s)
    test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827
    test_log_scope.go:74: use -show-logs to present logs inline
    --- FAIL: TestMergeQueue/both-empty (0.00s)
        client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/lhs-undersize (0.00s)
        client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/combined-threshold (0.00s)
        client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/non-collocated (0.03s)
        client_merge_test.go:4236: replica doesn't exist
    --- FAIL: TestMergeQueue/sticky-bit (0.00s)
        client_merge_test.go:4243: right-hand side range not found
    --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s)
        client_merge_test.go:4268: right-hand side range not found
```

I expect that under stress on master, we will see the
`TestMergeQueue/non-collocated` subtest fail.
craig bot pushed a commit that referenced this issue Apr 27, 2021
64199: kv: avoid panic on TestMergeQueue/non-collocated failure r=nvanbenschoten a=nvanbenschoten

Informs #63009.
Informs #64056.

In #63009/#64056, we saw that this test could flake with a nil pointer panic.
I don't know quite what's going on here, but when working on a patch for #62700,
I managed to hit this panic reliably by accidentally breaking all range merges.

After a bit of debugging, it became clear that we were always hitting a panic in
the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest,
`TestMergeQueue/non-collocated`, was moving the RHS range to a different node,
failing to merge the two range, and failing itself. This soft failure was being
drowned out by the hard failure in the next subtest.

This commit replaces the crash with a failure that looks something like the
following when range merges are completely disabled:
```
--- FAIL: TestMergeQueue (0.34s)
    test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827
    test_log_scope.go:74: use -show-logs to present logs inline
    --- FAIL: TestMergeQueue/both-empty (0.00s)
        client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/lhs-undersize (0.00s)
        client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/combined-threshold (0.00s)
        client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/non-collocated (0.03s)
        client_merge_test.go:4236: replica doesn't exist
    --- FAIL: TestMergeQueue/sticky-bit (0.00s)
        client_merge_test.go:4243: right-hand side range not found
    --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s)
        client_merge_test.go:4268: right-hand side range not found
```

I expect that under stress on master, we will see the `TestMergeQueue/non-collocated` subtest fail.

The fact that `TestMergeQueue/non-collocated` is the test failing means that we may want to have @aayushshah15 take over this investigation, since he's made changes in that area recently. What do you two think?

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 2, 2021
Informs cockroachdb#63009.
Informs cockroachdb#64056.

In cockroachdb#64199, we found that the flake was likely due to the non-collocated
subtest, so this commit un-skips the parent test and skips only this
single subtest.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Jun 1, 2021
The test started flaking after ce73bc1. The issue was that there was a race
between when the merge queue was enabled via an `Override()` call and when
these tests assert that a given range actually got merged away. In other words,
these tests were sometime getting to their assertion before the relevant server
learned that the merge queue was enabled.

This patch modifies the test to instead enable the merge queue via SQL and not
through the `Override()` call.

Resolves cockroachdb#64866
Resolves cockroachdb#64056
Resolves cockroachdb#63009

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Jun 3, 2021
The test started flaking after ce73bc1. The issue was that there was a race
between when the merge queue was enabled via an `Override()` call and when
these tests assert that a given range actually got merged away. In other words,
these tests were sometime getting to their assertion before the relevant server
learned that the merge queue was enabled.

This patch modifies the test to instead enable the merge queue via SQL and not
through the `Override()` call.

Resolves cockroachdb#64866
Resolves cockroachdb#64056
Resolves cockroachdb#63009

Release note: None
craig bot pushed a commit that referenced this issue Jun 3, 2021
65912: kvserver: deflake TestMergeQueue r=aayushshah15 a=aayushshah15

The test started flaking after ce73bc1. The issue was that there was a race
between when the merge queue was enabled via an `Override()` call and when
these tests assert that a given range actually got merged away. In other words,
these tests were sometime getting to their assertion before the relevant server
learned that the merge queue was enabled.

This patch modifies the test to instead enable the merge queue via SQL and not
through the `Override()` call on the cluster setting.

Resolves #64866
Resolves #64056
Resolves #63009

Release note: None

66039: backupccl: fixing flaky TestCleanupIntentsDuringBackupPerformanceRegression test r=rickystewart,erikgrinaker a=aliher1911

Increasing test timeout from 10 to 20 sec to stop CI failing. This
value is safe as regressions take 60 sec when they start showing
quadratic behaviours.

Release note: None

Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Co-authored-by: Oleg Afanasyev <oleg@cockroachlabs.com>
@craig craig bot closed this as completed in 0ba5a73 Jun 4, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 8, 2021
Informs cockroachdb#63009.
Informs cockroachdb#64056.

In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic.
I don't know quite what's going on here, but when working on a patch for cockroachdb#62700,
I managed to hit this panic reliably by accidentally breaking all range merges.

After a bit of debugging, it became clear that we were always hitting a panic in
the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest,
`TestMergeQueue/non-collocated`, was moving the RHS range to a different node,
failing to merge the two range, and failing itself. This soft failure was being
drowned out by the hard failure in the next subtest.

This commit replaces the crash with a failure that looks something like the
following when range merges are completely disabled:
```
--- FAIL: TestMergeQueue (0.34s)
    test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827
    test_log_scope.go:74: use -show-logs to present logs inline
    --- FAIL: TestMergeQueue/both-empty (0.00s)
        client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/lhs-undersize (0.00s)
        client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/combined-threshold (0.00s)
        client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/non-collocated (0.03s)
        client_merge_test.go:4236: replica doesn't exist
    --- FAIL: TestMergeQueue/sticky-bit (0.00s)
        client_merge_test.go:4243: right-hand side range not found
    --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s)
        client_merge_test.go:4268: right-hand side range not found
```

I expect that under stress on master, we will see the
`TestMergeQueue/non-collocated` subtest fail.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 8, 2021
Informs cockroachdb#63009.
Informs cockroachdb#64056.

In cockroachdb#64199, we found that the flake was likely due to the non-collocated
subtest, so this commit un-skips the parent test and skips only this
single subtest.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 16, 2021
Informs cockroachdb#63009.
Informs cockroachdb#64056.

In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic.
I don't know quite what's going on here, but when working on a patch for cockroachdb#62700,
I managed to hit this panic reliably by accidentally breaking all range merges.

After a bit of debugging, it became clear that we were always hitting a panic in
the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest,
`TestMergeQueue/non-collocated`, was moving the RHS range to a different node,
failing to merge the two range, and failing itself. This soft failure was being
drowned out by the hard failure in the next subtest.

This commit replaces the crash with a failure that looks something like the
following when range merges are completely disabled:
```
--- FAIL: TestMergeQueue (0.34s)
    test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827
    test_log_scope.go:74: use -show-logs to present logs inline
    --- FAIL: TestMergeQueue/both-empty (0.00s)
        client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/lhs-undersize (0.00s)
        client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/combined-threshold (0.00s)
        client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/non-collocated (0.03s)
        client_merge_test.go:4236: replica doesn't exist
    --- FAIL: TestMergeQueue/sticky-bit (0.00s)
        client_merge_test.go:4243: right-hand side range not found
    --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s)
        client_merge_test.go:4268: right-hand side range not found
```

I expect that under stress on master, we will see the
`TestMergeQueue/non-collocated` subtest fail.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants