roachtest: c2c/BulkOps/full failed #114341

cockroach-teamcity · 2023-11-13T11:27:21Z

roachtest.c2c/BulkOps/full failed with artifacts on master @ 063fa2b8930019a16adc0bca5b8363d8e80f3132:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:750
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1650
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:746
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:728
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1650
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-33470}

The text was updated successfully, but these errors were encountered:

msbutler · 2023-11-13T14:02:24Z

Cutover failed on a timeout because the replication lag never caught up during the test. I wonder why the auto replanner didn't help.

11:15:39 latency_verifier.go:139: could not record value 38m4.650553667s: value 2284650553667 is too large to be recorded
11:15:40 latency_verifier.go:139: could not record value 38m5.690941624s: value 2285690941624 is too large to be recorded
11:15:41 latency_verifier.go:139: could not record value 38m6.730999152s: value 2286730999152 is too large to be recorded
11:15:42 latency_verifier.go:139: could not record value 38m7.771010502s: value 2287771010502 is too large to be recorded
11:15:43 cluster_to_cluster.go:906: workload successfully finished
11:15:43 cluster_to_cluster.go:944: workload finished on its own
11:15:43 cluster_to_cluster.go:962: test status: cutover time chosen: 2023-11-13 11:10:43.563407 +0000 UTC
11:15:43 cluster_to_cluster.go:648: LogicalMegabytes: 36181.85
11:15:43 cluster_to_cluster.go:648: PhysicalMegabytes: 21536.65
11:15:43 cluster_to_cluster.go:648: PhysicalReplicatedMegabytes: 56270.69
11:15:43 cluster_to_cluster.go:970: test status: waiting for replication stream to cutover to 2023-11-13 11:10:43.563407 +0000 UTC

msbutler · 2023-11-13T14:11:41Z

For some reason, node lag replanning was never executed:

❯ grep "replan" *.unredacted/cockroach.log | grep -v "import"
5.unredacted/cockroach.log:I231113 10:34:35.627903 5528 util/log/event_log.go:32 ⋮ [T1,Vsystem,n1,client=34.150.235.219:56372,hostssl,user=root] 216 ={"Timestamp":1699871675624390486,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"stream_replication.replan_flow_threshold\" = ‹0.1›","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"physical_replication.consumer.replan_flow_threshold","Value":"‹0.1›"}
5.unredacted/cockroach.log:I231113 10:34:35.644254 5528 util/log/event_log.go:32 ⋮ [T1,Vsystem,n1,client=34.150.235.219:56372,hostssl,user=root] 217 ={"Timestamp":1699871675640079685,"EventType":"set_cluster_setting","Statement":"SET CLUSTER SETTING \"physical_replication.consumer.node_lag_replanning_threshold\" = ‹'10m'›","Tag":"SET CLUSTER SETTING","User":"root","SettingName":"physical_replication.consumer.node_lag_replanning_threshold","Value":"‹00:10:00›"}

If it were executed, we would have seen this log line node %d is %.2f minutes behind the next node. Try replanning

So, either all nodes were equally far behind (i.e. no node was more than 10 minutes behind every other node), or there's some bug in our replan policy.

msbutler · 2023-11-13T14:20:24Z

One node has a huge backlog of catchup scans, which means that one node is indeed lagging behind others:

msbutler · 2023-11-13T14:29:07Z

@adityamaru One node (node 2) is lagging behind on catchup scans here, which suggests that its hwm should be far behind all other nodes. To confirm, it would be great to see the replication_frontier.txt, but its not in the debug zip or artifacts. If I enabled a periodic collection of the replication frontier in all c2c roachtests, would the debug zip machinery collect it?

cockroach-teamcity · 2023-11-15T11:43:21Z

roachtest.c2c/BulkOps/full failed with artifacts on master @ 7f96bac8ce44a128b1f98f5462369a9910c802e5:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:750
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1650
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:746
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:728
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1650
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

msbutler · 2023-11-15T15:55:34Z

I was able to repro the failure. For some reason, the replication-frontier.txt file is empty...

adityamaru · 2023-11-15T16:02:59Z

For some reason, the replication-frontier.txt file is empty

Was the cluster setting to dump the frontier set to something other than 0?

msbutler · 2023-11-15T16:04:44Z

oh wait

msbutler · 2023-11-15T16:11:32Z

ok ,replication frontier looks fine. we should replan on node 2's lag, but it doesnt:

Src Instance Dest Instance Span                                                         Frontier Timestamp                      Behind By
1            2             /Tenant/2/Table/106/1/15{1014753-2029505}                    2023-11-15 15:40:29.57541861 +0000 UTC  27m10s
1            2             /Tenant/2/Table/106/1/1{69711297-85943911}                   2023-11-15 15:40:27.969296666 +0000 UTC 27m12s
1            2             /Tenant/2/Table/106/1/22{5000001-7029411}                    2023-11-15 15:40:27.774149789 +0000 UTC 27m12s
1            2             /Tenant/2/Table/106/1/3{78042979-94709447}                   2023-11-15 15:31:31.926283872 +0000 UTC 36m8s
1            2             /Tenant/2/Table/106/1/45{2029029-3043555}                    2023-11-15 15:40:29.375254051 +0000 UTC 27m10s
1            2             /Tenant/2/Table/106/1/4{27172420-43404961}                   2023-11-15 15:31:31.926283872 +0000 UTC 36m8s
1            2             /Tenant/2/Table/106/1/4{69711525-85942823}                   2023-11-15 15:32:09.142240862 +0000 UTC 35m30s
1            2             /Tenant/2/Table/106/1/7{7028741-8043111}                     2023-11-15 15:32:45.181294853 +0000 UTC 34m54s
1            2             /Tenant/2/Table/106/1/{1023174-3069568}                      2023-11-15 15:40:27.969296666 +0000 UTC 27m12s
1            2             /Tenant/2/Table/106/1/{19854022-36089410}                    2023-11-15 15:40:28.118166001 +0000 UTC 27m11s
1            2             /Tenant/2/Table/106/1/{3069568-19854022}                     2023-11-15 15:40:28.064543615 +0000 UTC 27m12s
1            2             /Tenant/2/Table/106/1/{52321121-68552704}                    2023-11-15 15:40:28.064543615 +0000 UTC 27m12s
1            2             /Tenant/2/Table/106/1/{68552704-75000001}                    2023-11-15 15:32:45.181294853 +0000 UTC 34m54s
1            2             /Tenant/2/Table/106/1/{94709889-143402727}                   2023-11-15 15:32:45.181294853 +0000 UTC 34m54s
1            2             /Tenant/2/Table/106/2/1{1308045/379643975-2014656/321960419} 2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/2/{4949782/427552132-5657155/164005090}  2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/2/{7070599/373073184-9895573/142776448}  2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/2/{718696/49504160-2828092/178715332}    2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/3/8{486/322776101-844/513185414}         2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/3/9{323/155295265-805/186177284}         2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/{1/589347074-2/1/27843492}               2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
1            2             /Tenant/2/Table/106/{2/14867686/564436069-3/8371/181802916}  2023-11-15 15:32:39.881947565 +0000 UTC 35m0s
2            3             /Tenant/2/Table/106/1/1{53044257-69711297}                   2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/277{172512-719973}                     2023-11-15 16:07:20.765557784 +0000 UTC 18.8s
2            3             /Tenant/2/Table/106/1/2{02172357-18404001}                   2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/2{18404001-25000001}                   2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/2{77719973-92825318}                   2023-11-15 16:07:21.972875964 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/30{1014500-2028999}                    2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/30{2028999-3043522}                    2023-11-15 16:07:21.972875964 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/37{7028645-8042979}                    2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/3{35943207-52174211}                   2023-11-15 16:07:21.97978201 +0000 UTC  17.6s
2            3             /Tenant/2/Table/106/1/4{43404961-50000001}                   2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/52{5000001-6014503}                    2023-11-15 16:07:21.97978201 +0000 UTC  17.6s
2            3             /Tenant/2/Table/106/1/54{0653637-1203073}                    2023-11-15 16:07:23.781526617 +0000 UTC 15.8s
2            3             /Tenant/2/Table/106/1/54{1203073-8462787}                    2023-11-15 16:07:21.972875964 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/556{304519-884517}                     2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/5{18405057-25000001}                   2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/5{48462787-56304519}                   2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/5{56884517-73115557}                   2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/7{6014371-7028741}                     2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/1/{292825318-301014500}                  2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/2/12{014656/321960419-720244/578910049}  2023-11-15 16:07:21.772673264 +0000 UTC 17.8s
2            3             /Tenant/2/Table/106/2/1{2720244/578910049-4160751/43724224}  2023-11-15 16:07:21.972875964 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/2/{5657155/164005090-6362552/502968865}  2023-11-15 16:07:21.572075397 +0000 UTC 18s
2            3             /Tenant/2/Table/106/3/8{371/181802916-486/322776101}         2023-11-15 16:07:21.948562462 +0000 UTC 17.6s
2            3             /Tenant/2/Table/106/3/9{805/186177284-924/571393728}         2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
2            3             /Tenant/{2/Table/106/3/10413/13894371-3}                     2023-11-15 16:07:21.967109762 +0000 UTC 17.6s
3            1             /Tenant/2/Table/106/1/15{2029505-3044257}                    2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/1{-023174}                             2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/1{43402727-50000001}                   2023-11-15 16:07:24.982576296 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/260{361987-941701}                     2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/2{28044128-30798211}                   2023-11-15 16:07:24.957201003 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/2{30798211-44710023}                   2023-11-15 16:07:25.157539582 +0000 UTC 14.4s
3            1             /Tenant/2/Table/106/1/2{44710023-52519300}                   2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/2{52519300-60361987}                   2023-11-15 16:07:24.973114365 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/37{6014311-7028645}                    2023-11-15 16:07:24.982576296 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/3{52174211-68405217}                   2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/41{0940646-1471557}                    2023-11-15 16:07:27.985031386 +0000 UTC 11.6s
3            1             /Tenant/2/Table/106/1/45{1014503-2029029}                    2023-11-15 16:07:24.982576296 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/4{11471557-27172420}                   2023-11-15 16:07:24.973114365 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/52{6014503-7029029}                    2023-11-15 16:07:24.973114365 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/52{7029029-8043555}                    2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/5{28043555-39494084}                   2023-11-15 16:07:24.973114365 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/5{39494084-40653637}                   2023-11-15 16:07:24.982576296 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/1/{185943911-202172357}                  2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/2/4{239136/539234624-949782/427552132}   2023-11-15 16:07:27.988565601 +0000 UTC 11.6s
3            1             /Tenant/2/Table/106/2/{1/27843492-718696/49504160}           2023-11-15 16:07:27.778162906 +0000 UTC 11.8s
3            1             /Tenant/2/Table/106/2/{2828092/178715332-3534896/129169536}  2023-11-15 16:07:28.018700172 +0000 UTC 11.6s
3            1             /Tenant/2/Table/106/2/{3534896/129169536-4239136/539234624}  2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/3/104{04/84903972-13/13894371}           2023-11-15 16:07:25.173432415 +0000 UTC 14.4s
3            1             /Tenant/2/Table/106/3/10{164/174863682-404/84903972}         2023-11-15 16:07:24.957201003 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/3/8{844/513185414-963/576430853}         2023-11-15 16:07:24.957201003 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106/3/9{203/417237639-323/155295265}         2023-11-15 16:07:24.981910965 +0000 UTC 14.6s
3            1             /Tenant/2/Table/106{-/1/1}                                   2023-11-15 16:07:24.957201003 +0000 UTC 14.6s
3            1             /Tenant/2{-/Table/106}                                       2023-11-15 16:07:27.951796081 +0000 UTC 11.6s
4            4             /Tenant/2/Table/106/1/15{0000001-1014753}                    2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/22{7029411-8044128}                    2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/26{0941701-1503584}                    2023-11-15 16:07:29.79323158 +0000 UTC  9.8s
4            4             /Tenant/2/Table/106/1/2{61503584-77172512}                   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/37601431{0-1}                          2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/37{5000001-6014310}                    2023-11-15 16:07:29.786431561 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/3{03043522-19710177}                   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/3{19710177-35943207}                   2023-11-15 16:07:29.79323158 +0000 UTC  9.8s
4            4             /Tenant/2/Table/106/1/3{68405217-75000001}                   2023-11-15 16:07:29.79323158 +0000 UTC  9.8s
4            4             /Tenant/2/Table/106/1/410{360676-940646}                     2023-11-15 16:07:29.786431561 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/45{0000001-1014503}                    2023-11-15 16:07:29.786431561 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/4{02518566-10360676}                   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/4{53043555-69711525}                   2023-11-15 16:07:29.786431561 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/58{0923716-8767557}                    2023-11-15 16:07:29.79323158 +0000 UTC  9.8s
4            4             /Tenant/2/Table/106/1/58{8767557-9347074}                    2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/5{02173190-18405057}                   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/5{73115557-80923716}                   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/7{5000001-6014371}                     2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/{36089410-52321121}                    2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/{394709447-402518566}                  2023-11-15 16:07:29.79323158 +0000 UTC  9.8s
4            4             /Tenant/2/Table/106/1/{485942823-502173190}                  2023-11-15 16:07:29.786431561 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/1/{78043111-94709889}                    2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/2/14{160751/43724224-867686/564436069}   2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/2/1{0602851/135163616-1308045/379643975} 2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/2/{6362552/502968865-7070599/373073184}  2023-11-15 16:07:27.966339919 +0000 UTC 11.6s
4            4             /Tenant/2/Table/106/2/{9895573/142776448-10602851/135163616} 2023-11-15 16:07:29.97979045 +0000 UTC  9.6s
4            4             /Tenant/2/Table/106/3/9{083/414592929-203/417237639}         2023-11-15 16:07:29.785138662 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/3/{8963/576430853-9083/414592929}        2023-11-15 16:07:29.772458998 +0000 UTC 9.8s
4            4             /Tenant/2/Table/106/3/{9924/571393728-10164/174863682}       2023-11-15 16:07:29.79323158 +0000 UTC  9.8s

msbutler · 2023-11-15T20:23:54Z

aha, i think I figured out the problem after rerunning the test with some extra logging: when we consider checking for lagging nodes, we always update sf.lastNodeLagCheck, even if we exit early from this check because we haven't waited enough time to check the nodes: timeutil.Since(sf.lastNodeLagCheck) < checkFreq. If we call maybeCheckForLaggingNodes frequently enough, then we will never consider replanning. This should be quick fix.

func (sf *streamIngestionFrontier) maybeCheckForLaggingNodes() error {
	defer func() {
		sf.lastNodeLagCheck = timeutil.Now()
	}()
	log.Infof(sf.Ctx(), "Consider replanning check")
	checkFreq := streamingccl.ReplanFrequency.Get(&sf.FlowCtx.Cfg.Settings.SV)
	maxLag := streamingccl.InterNodeLag.Get(&sf.FlowCtx.Cfg.Settings.SV)
	if checkFreq == 0 || maxLag == 0 || timeutil.Since(sf.lastNodeLagCheck) < checkFreq {
		log.Infof(sf.Ctx(), "Don't replan: maxLag %d; checkFreq %.2f; last node check %s; time since last check %.2f",
			maxLag, checkFreq.Minutes(), sf.lastNodeLagCheck, timeutil.Since(sf.lastNodeLagCheck).Minutes())
		return nil
	}

This patch prevents the lastNodeLagCheck time from updating every time the frontier processor receives a checkpoint, which can happen every few seconds. This previously prevented the node lag replanning check to trigger because this time needed to be older than 10 minutes. Rather, this timestamp should only update if we actually compute the lag check. Fixes cockroachdb#114341 Release note: none

cockroach-teamcity · 2023-11-16T06:16:28Z

roachtest.c2c/BulkOps/full failed with artifacts on master @ e19c24fb62d24595e74c0bae0aaad0a736c2bdc7:

(test_runner.go:1141).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-11-16T11:47:14Z

roachtest.c2c/BulkOps/full failed with artifacts on master @ aa812a63b8023d35b7bd62d12c6c47708f48a817:

(assertions.go:333).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:750
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            				github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            				main/pkg/cmd/roachtest/monitor.go:119
	            				golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            				src/runtime/asm_amd64.s:1650
	Error:      	Received unexpected error:
	            	expected job status succeeded, but got running
	            	(1) attached stack trace
	            	  -- stack trace:
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream.func1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:746
	            	  | github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
	            	  | 	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:213
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).stopReplicationStream
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:728
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:971
	            	  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterToCluster.func1.1
	            	  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1264
	            	  | main.(*monitorImpl).Go.func1
	            	  | 	main/pkg/cmd/roachtest/monitor.go:119
	            	  | golang.org/x/sync/errgroup.(*Group).Go.func1
	            	  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75
	            	  | runtime.goexit
	            	  | 	src/runtime/asm_amd64.s:1650
	            	Wraps: (2) expected job status succeeded, but got running
	            	Error types: (1) *withstack.withStack (2) *errutil.leafError
	Test:       	c2c/BulkOps/full
(require.go:1360).NoError: FailNow called
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/BulkOps/full/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

114525: streamingccl: prevent node lag replanning starvation r=stevendanna a=msbutler This patch prevents the lastNodeLagCheck time from updating every time the frontier processor receives a checkpoint, which can happen every few seconds. This previously prevented the node lag replanning check to trigger because this time needed to be older than 10 minutes. Rather, this timestamp should only update if we actually compute the lag check. Fixes #114341 Release note: none Co-authored-by: Michael Butler <butler@cockroachlabs.com>

This patch prevents the lastNodeLagCheck time from updating every time the frontier processor receives a checkpoint, which can happen every few seconds. This previously prevented the node lag replanning check to trigger because this time needed to be older than 10 minutes. Rather, this timestamp should only update if we actually compute the lag check. Fixes #114341 Release note: none

cockroach-teamcity added this to the 24.1 milestone Nov 13, 2023

msbutler added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 13, 2023

msbutler self-assigned this Nov 15, 2023

msbutler mentioned this issue Nov 15, 2023

streamingccl: prevent node lag replanning starvation #114525

Merged

craig bot closed this as completed in b61cbb9 Nov 16, 2023

blathers-crl bot mentioned this issue Nov 16, 2023

release-23.2: streamingccl: prevent node lag replanning starvation #114569

Merged

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: c2c/BulkOps/full failed #114341

roachtest: c2c/BulkOps/full failed #114341

cockroach-teamcity commented Nov 13, 2023 •

edited by cockroach-jira-scripts

Loading

msbutler commented Nov 13, 2023

msbutler commented Nov 13, 2023

msbutler commented Nov 13, 2023

msbutler commented Nov 13, 2023

cockroach-teamcity commented Nov 15, 2023

msbutler commented Nov 15, 2023

adityamaru commented Nov 15, 2023

msbutler commented Nov 15, 2023

msbutler commented Nov 15, 2023

msbutler commented Nov 15, 2023 •

edited

Loading

cockroach-teamcity commented Nov 16, 2023

cockroach-teamcity commented Nov 16, 2023

roachtest: c2c/BulkOps/full failed #114341

roachtest: c2c/BulkOps/full failed #114341

Comments

cockroach-teamcity commented Nov 13, 2023 • edited by cockroach-jira-scripts Loading

msbutler commented Nov 13, 2023

msbutler commented Nov 13, 2023

msbutler commented Nov 13, 2023

msbutler commented Nov 13, 2023

cockroach-teamcity commented Nov 15, 2023

msbutler commented Nov 15, 2023

adityamaru commented Nov 15, 2023

msbutler commented Nov 15, 2023

msbutler commented Nov 15, 2023

msbutler commented Nov 15, 2023 • edited Loading

cockroach-teamcity commented Nov 16, 2023

cockroach-teamcity commented Nov 16, 2023

cockroach-teamcity commented Nov 13, 2023 •

edited by cockroach-jira-scripts

Loading

msbutler commented Nov 15, 2023 •

edited

Loading