Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=3/cpu=16 failed #59424

Closed
cockroach-teamcity opened this issue Jan 26, 2021 · 60 comments · Fixed by #61777
Closed

roachtest: tpccbench/nodes=3/cpu=16 failed #59424

cockroach-teamcity opened this issue Jan 26, 2021 · 60 comments · Fixed by #61777
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@d86781c07065421f4a4d8bf5d988900ab07fdce5:

		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

	cluster.go:1666,context.go:140,cluster.go:1655,test_runner.go:848: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2615395-1611644173-19-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: dead
		2: dead
		1: 12227
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 2: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (4) 3: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 26, 2021
@aayushshah15
Copy link
Contributor

Very similar symptoms to what we're seeing in #58298. It's unclear to me what's causing the crash. The dmesg file doesn't have any mention of oom-killer.

Heap profile doesn't seem to indicate any foulness and looks like 64d6d87 is doing what it's supposed to.
Screen Shot 2021-01-26 at 10 42 49 AM

However, I am seeing a whole lot of gcTxnRecords in the stacktraces, as @nvanbenschoten previously observed.

cat goroutine_dump.2021-01-26T08_47_17.877.double_since_last_dump.000006950.txt | rg "gcTxnRecord" | wc -l
     251

@irfansharif
Copy link
Contributor

Should no longer be a release blocker as long as the release is picking up #59431.

@irfansharif irfansharif removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 1, 2021
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@5971ecb9dd1a25c81cd6012d6be1ff922802eae5:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	cluster.go:2219,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:767: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2676655-1613372337-19-n4cpu16:1-3 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2676655-1613372337-19-n4cpu16:1-3 returned
		  | stderr:
		  |
		  | stdout:
		  | teamcity-2676655-1613372337-19-n4cpu16: stopping and waiting..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 2: exit status 255: Connection to 35.222.134.47 closed by remote host.
		  |
		  | I210215 09:00:36.471961 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@3c223f5f5162103110a790743b687ef2bf952489:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 3: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2675
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@83e70ce84b740e27e721c3b73c38a4b8b515094a:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 3: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2675
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 3: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@nvanbenschoten
Copy link
Member

The last two failures are likely fallout from #60765.

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@64c4aef909f4382523cd9248341ca9f4448d841a:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 2: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@bf9744bad5a416a4b06907f0f3dd42896f7342f3:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	cluster.go:2220,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:767: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2702231-1613977007-25-n4cpu16:1-3 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2702231-1613977007-25-n4cpu16:1-3 returned
		  | stderr:
		  |
		  | stdout:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 0: exit status 255: 
		  | 1: exit status 255: 
		  | I210222 10:42:40.368808 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@5cfd7e5553a3072a1490d392390dddf968844215:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	cluster.go:2220,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:767: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2707822-1614064242-26-n4cpu16:1-3 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2707822-1614064242-26-n4cpu16:1-3 returned
		  | stderr:
		  |
		  | stdout:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 1: exit status 255: 
		  | 2: exit status 255: 
		  | I210223 11:01:44.715647 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@irfansharif
Copy link
Contributor

08:41:31 test.go:196: test status: 
08:41:31 tpcc.go:911: --- SEARCH ITER FAIL: TPCC 2200 resulted in 6142.5 tpmC and failed due to efficiency value of 22.150252691927975 is below ppassing threshold of 85
08:41:31 tpcc.go:805: initializing cluster for 2188 warehouses (search attempt: 2)

I think this is a real failure that needs to be investigated. Partway through the workload we seem to hit 0QPS due to crashed nodes. Except the logs for these nodes aren't available, because the machines have gone? Could be an infra-flake, but I'm not sure why this is the only test (that I'm aware of) that's seeing a pattern like this.

 238.0s        0           84.8          228.4  22548.6  47244.6  81604.4 103079.2 newOrder
  238.0s        0           17.9           24.7  12348.0  38654.7  47244.6  47244.6 orderStatus
  238.0s        0          195.5          241.4  24696.1  49392.1  90194.3 103079.2 payment
  238.0s        0           24.9           24.3  15569.3  34359.7  47244.6  47244.6 stockLevel
  239.0s        0           16.0           23.5  38654.7  49392.1  51539.6  51539.6 delivery
  239.0s        0          107.3          227.9  31138.5  73014.4 103079.2 103079.2 newOrder
  239.0s        0            8.0           24.7  21474.8  45097.2  45097.2  45097.2 orderStatus
  239.0s        0          104.3          240.8  23622.3  47244.6  68719.5  85899.3 payment
  239.0s        0           14.0           24.2  23622.3  36507.2  45097.2  45097.2 stockLevel
  240.0s        0            0.0           23.4      0.0      0.0      0.0      0.0 delivery
  240.0s        0            0.0          227.0      0.0      0.0      0.0      0.0 newOrder
  240.0s        0            0.0           24.6      0.0      0.0      0.0      0.0 orderStatus
  240.0s        0            0.0          239.8      0.0      0.0      0.0      0.0 payment
  240.0s        0            0.0           24.1      0.0      0.0      0.0      0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  241.0s        0            0.0           23.3      0.0      0.0      0.0      0.0 delivery
  241.0s        0            0.0          226.0      0.0      0.0      0.0      0.0 newOrder
  241.0s        0            0.0           24.5      0.0      0.0      0.0      0.0 orderStatus
  241.0s        0            0.0          238.8      0.0      0.0      0.0      0.0 payment
  241.0s        0            0.0           24.0      0.0      0.0      0.0      0.0 stockLevel
  242.0s        0            0.0           23.2      0.0      0.0      0.0      0.0 delivery
  242.0s        0            0.0          225.1      0.0      0.0      0.0      0.0 newOrder
  242.0s        0            0.0           24.4      0.0      0.0      0.0      0.0 orderStatus
  242.0s        0            0.0          237.8      0.0      0.0      0.0      0.0 payment
  242.0s        0            0.0           23.9      0.0      0.0      0.0      0.0 stockLevel
  243.0s        0            0.0           23.1      0.0      0.0      0.0      0.0 delivery
  243.0s        0            0.0          224.1      0.0      0.0      0.0      0.0 newOrder
  243.0s        0            0.0           24.3      0.0      0.0      0.0      0.0 orderStatus
  243.0s        0            0.0          236.9      0.0      0.0      0.0      0.0 payment
  243.0s        0            0.0           23.8      0.0      0.0      0.0      0.0 stockLevel
  244.0s        0            0.0           23.0      0.0      0.0      0.0      0.0 delivery
  244.0s        0            0.0          223.2      0.0      0.0      0.0      0.0 newOrder
  244.0s        0            0.0           24.2      0.0      0.0      0.0      0.0 orderStatus
  244.0s        0            0.0          235.9      0.0      0.0      0.0      0.0 payment
  244.0s        0            0.0           23.7      0.0      0.0      0.0      0.0 stockLevel
E210222 08:35:35.374278 1 workload/cli/run.go:471  [-] 3  error in delivery: ERROR: rpc error: code = Canceled desc = context canceled (SQLSTATE XXUUU)
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  245.0s       15            1.0           22.9  21474.8  21474.8  21474.8  21474.8 delivery
  245.0s       15            0.0          222.3      0.0      0.0      0.0      0.0 newOrder
  245.0s       15            0.0           24.1      0.0      0.0      0.0      0.0 orderStatus
  245.0s       15            2.0          234.9  21474.8  27917.3  27917.3  27917.3 payment
  245.0s       15            0.0           23.6      0.0      0.0      0.0      0.0 stockLevel
  246.0s       15            0.0           22.8      0.0      0.0      0.0      0.0 delivery
  246.0s       15            0.0          221.4      0.0      0.0      0.0      0.0 newOrder
  246.0s       15            0.0           24.0      0.0      0.0      0.0      0.0 orderStatus
  246.0s       15            0.0          234.0      0.0      0.0      0.0      0.0 payment
  246.0s       15            0.0           23.5      0.0      0.0      0.0      0.0 stockLevel
  247.0s       15            0.0           22.7      0.0      0.0      0.0      0.0 delivery
  247.0s       15            0.0          220.5      0.0      0.0      0.0      0.0 newOrder
  247.0s       15            0.0           23.9      0.0      0.0      0.0      0.0 orderStatus
  247.0s       15            0.0          233.0      0.0      0.0      0.0      0.0 payment
  247.0s       15            0.0           23.4      0.0      0.0      0.0      0.0 stockLevel
  248.0s       15            0.0           22.6      0.0      0.0      0.0      0.0 delivery
  248.0s       15            0.0          219.6      0.0      0.0      0.0      0.0 newOrder
  248.0s       15            0.0           23.8      0.0      0.0      0.0      0.0 orderStatus
  248.0s       15            0.0          232.1      0.0      0.0      0.0      0.0 payment
  248.0s       15            0.0           23.3      0.0      0.0      0.0      0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  249.0s       15            0.0           22.5      0.0      0.0      0.0      0.0 delivery
  249.0s       15            2.0          218.8  16106.1  24696.1  24696.1  24696.1 newOrder
  249.0s       15            0.0           23.7      0.0      0.0      0.0      0.0 orderStatus
  249.0s       15            0.0          231.2      0.0      0.0      0.0      0.0 payment
  249.0s       15            0.0           23.2      0.0      0.0      0.0      0.0 stockLevel
  250.0s       15            0.0           22.4      0.0      0.0      0.0      0.0 delivery
  250.0s       15            0.0          217.9      0.0      0.0      0.0      0.0 newOrder
  250.0s       15            0.0           23.6      0.0      0.0      0.0      0.0 orderStatus
  250.0s       15            0.0          230.2      0.0      0.0      0.0      0.0 payment
  250.0s       15            0.0           23.1      0.0      0.0      0.0      0.0 stockLevel
  251.0s       15            0.0           22.3      0.0      0.0      0.0      0.0 delivery
  251.0s       15            1.0          217.0  38654.7  38654.7  38654.7  38654.7 newOrder
  251.0s       15            0.0           23.5      0.0      0.0      0.0      0.0 orderStatus
  251.0s       15            0.0          229.3      0.0      0.0      0.0      0.0 payment
  251.0s       15            0.0           23.1      0.0      0.0      0.0      0.0 stockLevel
  252.0s       15            0.0           22.3      0.0      0.0      0.0      0.0 delivery
  252.0s       15            0.0          216.2      0.0      0.0      0.0      0.0 newOrder
  252.0s       15            0.0           23.4      0.0      0.0      0.0      0.0 orderStatus
  252.0s       15            0.0          228.4      0.0      0.0      0.0      0.0 payment
  252.0s       15            0.0           23.0      0.0      0.0      0.0      0.0 stockLevel

@irfansharif irfansharif removed their assignment Feb 23, 2021
@nvanbenschoten
Copy link
Member

These are exactly the symptoms we would expect to see due to the bug fixed by #60992, and the timing lines up. So I'd optimistically sit on this and wait to see if it continues reproducing before spending time investigating.

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@ec011620c7cf299fdbb898db692b36454defc4a2:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@c7e088826bc079620dfd3b5ae75d1c15cd9cd16d:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 2: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@6601d827b814d4e85a1081b03bf2562d8ac2a4ab:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	cluster.go:2220,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:767: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2721381-1614322964-25-n4cpu16:1-3 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2721381-1614322964-25-n4cpu16:1-3 returned
		  | stderr:
		  |
		  | stdout:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 0: exit status 255: 
		  | 2: exit status 255: 
		  | I210226 11:07:00.140381 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@9595a158f0233e1c3d86786ec4462dd39c7beb20:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 1: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@yuzefovich
Copy link
Member

Thanks for the find about mistakenly attaching flow stats regardless of the fact whether the execution stats are collected. I opened up a PR to fix it.

@irfansharif #61532 recently was merged which changed the code quoted here. We now should be using the span created here for stats propagation:

ctx, sp := tracing.EnsureChildSpan(
ctx, ex.server.cfg.AmbientCtx.Tracer,
// We print the type of command, not the String() which includes long
// statements.
cmd.command())

@irfansharif
Copy link
Contributor

Gotcha, thanks for the heads up. I think we'll want to revert #61532. I'm assuming the original motivation was to reduce the overhead of sampling in order to not create any new spans? Well, with #61777 the statement's span will be a no-op one, so when we know we're sampling, we should go and create a new span. I think in the end we'd be doing no worse for sampled statements (relative to #61532).

@yuzefovich
Copy link
Member

Yes, the only motivation to #61532 was to reduce the perf hit of sampling, so if #61777 effectively makes the stmt's span a noop, the revert of #61532 sounds good to me so that the instrumentation is responsible for ensuring that a proper span is present when sampling is needed.

tbg added a commit to tbg/cockroach that referenced this issue Mar 12, 2021
This isn't something I am planning to finish, but I wanted to see if I
would immediately run into any snags when trying to get `roachprod
start` to spin up a Docker container instead of running bare-metal.

The motivation is twofold:

- I'm generally interested in how we involve roachprod over time.
  Specifically, the question is whether in the long run systems testing
  should be done in k8s and which, if any, stepping stones there are to
  evolve what we currently do into that direction.
- The current bare-metal setup goes [unresponsive] when CRDB goes into
  overdrive, and tests fail in the most opaque ways. We then incur a
  large tax for debugging these situations since we can't access the
  cluster in that state.

Work that would need to be done to really finish this:

- maintain ubuntu images that come with docker installed (right now
  needs to be manually set up) and the `ubuntu` user set up and the images
  we need cached
- we hit the old problem of having to pass the uid/gid to the container
  to avoid creating files as root, I hacked around it by hard-coding
  them
- Setting the cluster settings, etc, is all done via the `./cockroach`
  binary but that is no longer a thing. Ideally that should use SQL
- The whole premise of uploading binaries is out of the window, we
  need to deal in CRDB containers exclusively, though we could
  conceivably use a wrapper container running the uploaded binary
  if we wanted to retain how roachprod/test work.

[unresponsive]: cockroachdb#59424 (comment)

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Mar 12, 2021
This drastically reduces the memory overhead for tracing we're observing
in cockroachdb#59424. This commit does a few disparate things to make it happen:

1. We now access the tracing span through txnState.Ctx exclusively. This
   gives us a single point to hijack, which we'll later do. By default
   txn's are initialized with a no-op span. If later on session tracing is
   enabled, we'll create a real (verbose) span and swap it out with the
   txn's no-op one. This gives us the same semantics as earlier, and on the
   plus side, we're not re-using the same tracing span when session tracing
   is toggled.
2. Hard tracing methods to work with no-op spans. Specifically
   GetRecording and TraceID.
3. Remove a crash vector through crdb_internal.trace_id. It was
   previously reaching into the first recording to retrieve a trace ID. But
   it's not guaranteed that recordings are non-empty. This could be used to
   induce panics in the server.

This PR will need to get backported to 21.1. Fixes cockroachdb#59424.

Release note: None
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=3/cpu=16 failed on master@4b98115dfda02a9498f566958bd915c45ec7e449:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=3/cpu=16/run_1
	tpcc.go:917,tpcc.go:617,test_runner.go:767: monitor failure: unexpected node event: 2: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:894
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:803
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=3/cpu=16
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

irfansharif added a commit to irfansharif/cockroach that referenced this issue Mar 12, 2021
This drastically reduces the memory overhead for tracing we're observing
in cockroachdb#59424. This commit does a few disparate things to make it happen:

1. We now access the tracing span through txnState.Ctx exclusively. This
   gives us a single point to hijack, which we'll later do. By default
   txn's are initialized with a no-op span. If later on session tracing is
   enabled, we'll create a real (verbose) span and swap it out with the
   txn's no-op one. This gives us the same semantics as earlier, and on the
   plus side, we're not re-using the same tracing span when session tracing
   is toggled.
2. Hard tracing methods to work with no-op spans. Specifically
   GetRecording and TraceID.
3. Remove a crash vector through crdb_internal.trace_id. It was
   previously reaching into the first recording to retrieve a trace ID. But
   it's not guaranteed that recordings are non-empty. This could be used to
   induce panics in the server.

This PR will need to get backported to 21.1. Fixes cockroachdb#59424.

Release note: None
craig bot pushed a commit that referenced this issue Mar 13, 2021
61777:  sql: only create real spans when session tracing/sampling r=irfansharif a=irfansharif

This drastically reduces the memory overhead for tracing we're observing
in #59424. This commit does a few disparate things to make it happen:

1. We now access the tracing span through txnState.Ctx exclusively. This
   gives us a single point to hijack, which we'll later do. By default
   txn's are initialized with a no-op span. If later on session tracing is
   enabled, we'll create a real (verbose) span and swap it out with the
   txn's no-op one. This gives us the same semantics as earlier, and on the
   plus side, we're not re-using the same tracing span when session tracing
   is toggled.
2. Hard tracing methods to work with no-op spans. Specifically
   GetRecording and TraceID.
3. Remove a crash vector through crdb_internal.trace_id. It was
   previously reaching into the first recording to retrieve a trace ID. But
   it's not guaranteed that recordings are non-empty. This could be used to
   induce panics in the server.

This PR will need to get backported to 21.1. Fixes #59424.

Release note: None

---

+cc @cockroachdb/kv-east.


Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
@craig craig bot closed this as completed in d7407d5 Mar 13, 2021
irfansharif added a commit to irfansharif/cockroach that referenced this issue Mar 13, 2021
This drastically reduces the memory overhead for tracing we're observing
in cockroachdb#59424. This commit does a few disparate things to make it happen:

1. We now access the tracing span through txnState.Ctx exclusively. This
   gives us a single point to hijack, which we'll later do. By default
   txn's are initialized with a no-op span. If later on session tracing is
   enabled, we'll create a real (verbose) span and swap it out with the
   txn's no-op one. This gives us the same semantics as earlier, and on the
   plus side, we're not re-using the same tracing span when session tracing
   is toggled.
2. Hard tracing methods to work with no-op spans. Specifically
   GetRecording and TraceID.
3. Remove a crash vector through crdb_internal.trace_id. It was
   previously reaching into the first recording to retrieve a trace ID. But
   it's not guaranteed that recordings are non-empty. This could be used to
   induce panics in the server.

This PR will need to get backported to 21.1. Fixes cockroachdb#59424.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Mar 22, 2021
---

sql: only create real spans when session tracing/sampling r=irfansharif a=irfansharif

This drastically reduces the memory overhead for tracing we're observing
in cockroachdb#59424. This commit does a few disparate things to make it happen:

1. We now access the tracing span through txnState.Ctx exclusively. This
   gives us a single point to hijack, which we'll later do. By default
   txn's are initialized with a no-op span. If later on session tracing is
   enabled, we'll create a real (verbose) span and swap it out with the
   txn's no-op one. This gives us the same semantics as earlier, and on the
   plus side, we're not re-using the same tracing span when session tracing
   is toggled.
2. Hard tracing methods to work with no-op spans. Specifically
   GetRecording and TraceID.
3. Remove a crash vector through crdb_internal.trace_id. It was
   previously reaching into the first recording to retrieve a trace ID. But
   it's not guaranteed that recordings are non-empty. This could be used to
   induce panics in the server.

This PR will need to get backported to 21.1. Fixes cockroachdb#59424.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 1, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in cockroachdb#59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in cockroachdb#62227), but they're much too invasive to backport to
21.1. It's unfortunate that we only discovered the overhead this late in
the development cycle. It was due to two major reasons:
- cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
  propagating real tracing spans across rpc boundaries). We had done
  sanity checking for the tracing overhead before this point, but failed
  to realize that cockroachdb#59992 would merit re-analysis.
- The test that alerted us to the degradation (tpccbench) had be
  persistently failing for a myriad of other reasons, so we didn't learn
  until too late that tracing was the latest offendor. tpccbench also
  doesn't deal with VM overload well (something cockroachdb#62361 hopes to
  address), and after tracing was enabled for realsies, this was the
  dominant failure mode. This resulted in perf data not making it's way
  to roachperf, which further hid possible indicators we had a major
  regression on our hands. We also didn't have a healthy process looking
  at roachperf on a continual basis, something we're looking to rectify
  going forward. We would've picked up on this regression had we been
  closely monitoring the kv95 charts.

Release note: None
craig bot pushed a commit that referenced this issue Apr 2, 2021
62998: sql: lower default sampling rate to 1% r=irfansharif a=irfansharif

We arrived at the previous default rate of 10% back in #59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what #59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced #61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in #59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in #62227), but they're much too invasive to backport to
21.1.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 2, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in cockroachdb#59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in cockroachdb#62227), but they're much too invasive to backport to
21.1. It's unfortunate that we only discovered the overhead this late in
the development cycle. It was due to two major reasons:
- cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
  propagating real tracing spans across rpc boundaries). We had done
  sanity checking for the tracing overhead before this point, but failed
  to realize that cockroachdb#59992 would merit re-analysis.
- The test that alerted us to the degradation (tpccbench) had be
  persistently failing for a myriad of other reasons, so we didn't learn
  until too late that tracing was the latest offendor. tpccbench also
  doesn't deal with VM overload well (something cockroachdb#62361 hopes to
  address), and after tracing was enabled for realsies, this was the
  dominant failure mode. This resulted in perf data not making it's way
  to roachperf, which further hid possible indicators we had a major
  regression on our hands. We also didn't have a healthy process looking
  at roachperf on a continual basis, something we're looking to rectify
  going forward. We would've picked up on this regression had we been
  closely monitoring the kv95 charts.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
7 participants