roachtest: cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null failed #113489

cockroach-teamcity · 2023-10-31T13:04:13Z

roachtest.cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null failed with artifacts on release-23.2 @ 348d11425a1184ce4de8a6f8dc85995cd3c653bc:

(monitor.go:153).Wait: monitor failure: unexpected node event: n6: cockroach process for system interface died (exit code 15)
test artifacts and logs in: /artifacts/cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/cdc _{This test on roachdash | Improve this report!

Jira issue: CRDB-33017
Epic CRDB-26372}

The text was updated successfully, but these errors were encountered:

miretskiy · 2023-11-02T21:22:09Z

The node oomed:

@erikgrinaker
As part of #109346, we changed client side semaphore from fixed one (8) to rate based (100/s). More importantly, we have changed the condition when we release catchup scan allocation from

// If we see the first non-empty checkpoint, we know we're done with the catchup scan.
if !t.ResolvedTS.IsEmpty() && active.catchupRes != nil {
  active.catchupRes.Release()
  active.catchupRes = nil
}

to if active.catchupRes != nil {... That is, we release alloc as soon as we see any checkpoint, not just non-empty checkpoint.
Do you recall why this was done? I'm under impression that the empty checkpoint is emitted before all of the catchup events are emitted -- it's the first event that the range will receive.
As a result, if the consumer is slower than the rate of incoming events, we could wind up in a situation where
we keep starting catchup scans -- even before previous ones have completed.

I think the goal was to manage the goroutine creation rate, and we didn't think about the indirect impact the fixed sized semaphore had on the (untracked) memory usage in the http2/grpc land.

I wonder if we should add the "non-empty" checkpoint bit.
Thoughts?

erikgrinaker · 2023-11-02T21:25:14Z

The empty checkpoint is emitted after the catchup scan. Its purpose is exactly to signal to the client that the catchup scan is completed.

miretskiy · 2023-11-02T22:02:35Z

Hmm... Then it's rather strange in this case, and I'm confused how we could have used up so much memory in http2 buffers.
We clearly allocated more than 2MB * 100 worth of memory... And because we're doing catchup scan -- we're not running workload (i.e. we are running w/ end_time); there shouldn't be any events other than checkpoints coming in after the catchup scan completes.

miretskiy · 2023-11-02T22:31:26Z

We definitely started lots of catchup ranges:

but it's very strange to see them on node 6... which has no ranges.

miretskiy · 2023-11-03T20:24:58Z

Okay, so there were few strange things going on: first, the captured metrics seem to indicate that 5 nodes (all but the 6th one) were dead -- which sort of makes sense since after the restart, all of the ranges were assigned to node 6.
I don't know if there was some partitioning, or whatnot, but, the changefeed was restarted:

W231031 12:52:59.162781 5334 ccl/changefeedccl/changefeed_stmt.go:1310 ⋮ [T1,Vsystem,n6,job=‹CHANGEFEED id=913195922462900230›] 127  Changefeed job 913195922462900230 encountered transient error: ‹inbox communication error›: grpc: ‹context canceled› [code 1/Canceled] (attempt 1)

And after that, all ranges were assigned to node 6 -- why? not clear:

FInally, the OOM -- the change to acquire catchup scan quota based on a rate -- I think it works exactly as it should.
The catchup scan for each range completes -- and completes quickly. So we're able to keep starting more and more rangefeeds; the consumer -- even though it's writing to null sink cannot keep up with a large fan-in factor; but the data produced by catchup scans fits in the network buffers (2MB) -- and so, eventually the node ooms.

We could consider lowering 100/sec down to something smaller -- but the reality is that there was always a possiblity of OOMs -- catchup scan or not, if enough ranges generate enough data, and the consumer cannot keep up, regular rangefeeds can cause OOM. That's why there is mux rangefeed.

I suspect it would be prudent to temp down this setting when using regular rangefeed.

erikgrinaker · 2023-11-07T12:25:29Z

I think this was a cascading failure.

The first node that started to struggle was n3. At 12:53:00 it saw some kind of transient stall, which caused high Raft commit latencies and a spike in runnable goroutines.

We also see signs of this in the logs:

I231031 12:52:59.018028 501 kv/kvserver/replica_raft.go:1571 ⋮ [T1,Vsystem,n3,s3,r42/4:‹/Table/{39-40}›,raft] 100701  slow non-blocking raft commit: commit-wait 993.925532ms sem 812ns
I231031 12:53:00.022247 446 kv/kvserver/store_raft.go:699 ⋮ [T1,Vsystem,n3,s3,r4/2:‹/System{/tsd-tse}›,raft] 100709  raft ready handling: 0.99s [append=0.00s, apply=0.99s, , other=0.00s], wrote [apply=913 KiB (4 in 3 batches)]; node might be overloaded
W231031 12:53:00.064912 1974 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n3,rnode=2,raddr=‹10.142.0.177:26257›,class=system,rpc] 911  latency jump (prev avg 0.41ms, current 2039.55ms)

This in turn led to it disconnecting from all RPC peers, likely because it wasn't responding to gRPC pings:

E231031 12:52:59.032693 39 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=?,raddr=‹34.138.110.78:26257›,class=system,rpc] 100703  disconnected (was healthy for 1h8m15.801s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E231031 12:53:00.065741 285 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=1,raddr=‹10.142.0.214:26257›,class=default,rpc] 100755  disconnected (was healthy for 1h8m16.814s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E231031 12:53:00.067112 1871 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=2,raddr=‹10.142.0.177:26257›,class=default,rpc] 100763  disconnected (was healthy for 1h8m14.548s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E231031 12:53:00.068378 1855 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=4,raddr=‹10.142.0.104:26257›,class=default,rpc] 100765  disconnected (was healthy for 1h8m14.577s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E231031 12:53:00.069328 537 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=1,raddr=‹10.142.0.214:26257›,class=system,rpc] 100768  disconnected (was healthy for 1h8m16.835s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E231031 12:53:00.069540 2116 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=5,raddr=‹10.142.0.113:26257›,class=system,rpc] 100770  disconnected (was healthy for 1h8m13.402s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E231031 12:53:00.072918 6597153 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=6,raddr=‹10.142.1.180:26257›,class=default,rpc] 100772  disconnected (was healthy for 18m14.231s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]

When this happens, n6's changefeed restarts:

W231031 12:52:59.162781 5334 ccl/changefeedccl/changefeed_stmt.go:1310 ⋮ [T1,Vsystem,n6,job=‹CHANGEFEED id=913195922462900230›] 127  Changefeed job 913195922462900230 encountered transient error: ‹inbox communication error›: grpc: ‹context canceled› [code 1/Canceled] (attempt 1)

Note that at this point, n6 is still connected to all other RPC nodes except n3. However, it still schedules all changefeed processors locally:

At this point, n6 is slammed by traffic (probably checkpoints?).

20 seconds later, n6 is so overwhelmed that it loses RPC connections with all other peers since it can't respond to gRPC heartbeats.

E231031 13:00:19.555282 6712057 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n1,rnode=6,raddr=‹10.142.1.180:26257›,class=default,rpc] 103881  disconnected (was healthy for 29m1.404s): grpc: ‹error reading from server: read tcp 10.142.0.214:57394->10.142.1.180:26257: read: connection reset by peer› [code 14/Unavailable]
E231031 13:00:19.556763 6711980 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n1,rnode=6,raddr=‹10.142.1.180:26257›,class=system,rpc] 103883  disconnected (was healthy for 29m1.432s): grpc: ‹error reading from server: read tcp 10.142.0.214:57378->10.142.1.180:26257: read: connection reset by peer› [code 14/Unavailable]
E231031 13:00:19.554896 6406128 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n2,rnode=6,raddr=‹10.142.1.180:26257›,class=default,rpc] 100127  disconnected (was healthy for 28m1.354s): grpc: ‹error reading from server: read tcp 10.142.0.177:51436->10.142.1.180:26257: read: connection reset by peer› [code 14/Unavailable]
E231031 13:00:19.555154 6234295 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n2,rnode=6,raddr=‹10.142.1.180:26257›,class=system,rpc] 100128  disconnected (was healthy for 29m1.419s): grpc: ‹error reading from server: read tcp 10.142.0.177:40926->10.142.1.180:26257: read: connection reset by peer› [code 14/Unavailable]
E231031 12:53:00.072918 6597153 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=6,raddr=‹10.142.1.180:26257›,class=default,rpc] 100772  disconnected (was healthy for 18m14.231s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E231031 13:00:19.554872 6597153 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=6,raddr=‹10.142.1.180:26257›,class=default,rpc] 100925  disconnected (was healthy for 7m19.481s): grpc: ‹error reading from server: read tcp 10.142.1.181:58698->10.142.1.180:26257: read: connection reset by peer› [code 14/Unavailable]
E231031 13:00:19.556042 6277791 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=6,raddr=‹10.142.1.180:26257›,class=system,rpc] 100926  disconnected (was healthy for 29m1.425s): grpc: ‹error reading from server: read tcp 10.142.1.181:44466->10.142.1.180:26257: read: connection reset by peer› [code 14/Unavailable]
E231031 13:00:19.554924 7063443 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n4,rnode=6,raddr=‹10.142.1.180:26257›,class=default,rpc] 99954  disconnected (was healthy for 25m33.715s): grpc: ‹error reading from server: read tcp 10.142.0.104:35132->10.142.1.180:26257: read: connection reset by peer› [code 14/Unavailable]
E231031 13:00:19.555191 6745845 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n4,rnode=6,raddr=‹10.142.1.180:26257›,class=system,rpc] 99955  disconnected (was healthy for 29m1.419s): grpc: ‹error reading from server: read tcp 10.142.0.104:38034->10.142.1.180:26257: read: connection reset by peer› [code 14/Unavailable]
E231031 13:00:19.554976 7029439 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n5,rnode=6,raddr=‹10.142.1.180:26257›,class=default,rpc] 101430  disconnected (was healthy for 25m33.713s): grpc: ‹error reading from server: read tcp 10.142.0.113:60984->10.142.1.180:26257: read: connection reset by peer› [code 14/Unavailable]
E231031 13:00:19.555238 6705254 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n5,rnode=6,raddr=‹10.142.1.180:26257›,class=system,rpc] 101431  disconnected (was healthy for 29m1.325s): grpc: ‹error reading from server: read tcp 10.142.0.113:50134->10.142.1.180:26257: read: connection reset by peer› [code 14/Unavailable]

Why did losing contact with a single node restart the entire changefeed? Is this because DistSQL has to restart the entire flow if it loses contact with a single node?
When we restart the flow, why do we schedule it on the local node? n6 was still connected to all other nodes except n3 at this time, and could have started the flows there.
Can we reproduce this simply by disconnecting (e.g. pausing) a single node while a changefeed is running?

miretskiy · 2023-11-07T18:30:32Z

Why did losing contact with a single node restart the entire changefeed? Is this because DistSQL has to restart the entire flow if it loses contact with a single node?
Yes

When we restart the flow, why do we schedule it on the local node? n6 was still connected to all other nodes except n3 at this time, and could have started the flows there.
No, that shouldn't happen -- but we use distsql to plan the flow and if that information is out of whack on n6 -- we can get into this mode (we have 0 observability here -- something that should be fixed).

perhaps we can.

miretskiy · 2023-11-07T19:53:15Z

We are going to reintroduce catchup semaphore (in addition to rate limiter) for regular rangefeeds.
This will need to be backported; The PR will be sent out soon.

miretskiy · 2023-11-07T22:56:36Z

I just confirmed that we got pretty bad plan when changefeed restarted:

{
  "sql": "job:913195922462900230",
  "nodeNames": [
    "6"
  ],
  "processors": [
    {
      "nodeIdx": 0,
      "inputs": [],
      "core": {
        "title": "ChangeAggregator/0",
        "details": [
          "/Table/106/{1-2}"
        ]
      },
      "outputs": [],
      "stage": 1,
      "processorID": 0
    },
    {
      "nodeIdx": 0,
      "inputs": [],
      "core": {
        "title": "ChangeFrontier/1",
        "details": [
          "Out: @2,@3,@4"
        ]
      },
      "outputs": [],
      "stage": 2,
      "processorID": 1
    },
    {
      "nodeIdx": 0,
      "inputs": [],
      "core": {
        "title": "Response",
        "details": []
      },
      "outputs": [],
      "stage": 0,
      "processorID": -1
    }
  ],
  "edges": [
    {
      "sourceProc": 0,
      "sourceOutput": 0,
      "destProc": 1,
      "destInput": 0,
      "streamID": 0
    },
    {
      "sourceProc": 1,
      "sourceOutput": 0,
      "destProc": 2,
      "destInput": 0,
      "streamID": 0
    }
  ],
  "flow_id": "36459508-8538-40de-997f-5e600240b460",
  "flags": {
    "ShowInputTypes": false,
    "MakeDeterministic": false
  }
}

113966: kvcoord: Reintroduce catchup scan semaphore for regular rangefeed r=miretskiy a=miretskiy Re-introduce catchup scan semaphore limit, removed by #110919, for regular rangefeed. This hard limit on the number of catchup scans is necessary to avoid OOMs when handling large scan rangefeeds (large fan-in factor) when executing many non-local ranges. Fixes #113489 Release note: None 114000: colfetcher: disable metamorphic randomization for direct scans r=yuzefovich a=yuzefovich This commit makes it so that we no longer - for now - use metamorphic randomization for the default value of `sql.distsql.direct_columnar_scans.enabled` cluster setting that controls whether the direct columnar scans (aka "KV projection pushdown") is enabled. It appears that we might be missing some memory accounting in the local fast path of this feature, and some backup-related roachtests run into OOMs with binaries with "enabled assertions". Disabling this metamorphization for now seems good to silence failures in case of this now-known issue. Informs: #113816 Epic: None Release note: None 114026: kvnemesis: bump default steps to 100 r=erikgrinaker a=erikgrinaker 50 steps is usually too small to trigger interesting behaviors. Bump it to 100, which is still small enough to be easily debuggable. The nightlies already run with 1000 steps. Epic: none Release note: None Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

Re-introduce catchup scan semaphore limit, removed by #110919, for regular rangefeed. This hard limit on the number of catchup scans is necessary to avoid OOMs when handling large scan rangefeeds (large fan-in factor) when executing many non-local ranges. Fixes #113489 Release note: None

cockroach-teamcity added this to the 23.2 milestone Oct 31, 2023

blathers-crl bot added the A-cdc Change Data Capture label Oct 31, 2023

miretskiy self-assigned this Nov 2, 2023

miretskiy mentioned this issue Nov 7, 2023

kvcoord: Reintroduce catchup scan semaphore for regular rangefeed #113966

Merged

craig bot closed this as completed in d870c93 Nov 8, 2023

blathers-crl bot mentioned this issue Nov 8, 2023

release-23.2: kvcoord: Reintroduce catchup scan semaphore for regular rangefeed #114063

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null failed #113489

roachtest: cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null failed #113489

cockroach-teamcity commented Oct 31, 2023 •

edited by exalate-issue-sync bot

Loading

miretskiy commented Nov 2, 2023

erikgrinaker commented Nov 2, 2023

miretskiy commented Nov 2, 2023

miretskiy commented Nov 2, 2023

miretskiy commented Nov 3, 2023

erikgrinaker commented Nov 7, 2023 •

edited

Loading

miretskiy commented Nov 7, 2023

miretskiy commented Nov 7, 2023

miretskiy commented Nov 7, 2023

roachtest: cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null failed #113489

roachtest: cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null failed #113489

Comments

cockroach-teamcity commented Oct 31, 2023 • edited by exalate-issue-sync bot Loading

miretskiy commented Nov 2, 2023

erikgrinaker commented Nov 2, 2023

miretskiy commented Nov 2, 2023

miretskiy commented Nov 2, 2023

miretskiy commented Nov 3, 2023

erikgrinaker commented Nov 7, 2023 • edited Loading

miretskiy commented Nov 7, 2023

miretskiy commented Nov 7, 2023

miretskiy commented Nov 7, 2023

cockroach-teamcity commented Oct 31, 2023 •

edited by exalate-issue-sync bot

Loading

erikgrinaker commented Nov 7, 2023 •

edited

Loading