server: `use of closed network connection` when quiescing #98523

renatolabs · 2023-03-13T20:26:39Z

During a (likely unrelated to this failure) roachtest, a node is stopped gracefully and restarted with a new binary. In the process of terminating the node, the following panic was observed:

@server/server_http.go:227 ⋮ [T1,n1] 280  close tcp ‹[::]:26258›: ‹use of closed network connection›
@server/server_http.go:227 ⋮ [T1,n1] 280 !goroutine 150 [running]:
@server/server_http.go:227 ⋮ [T1,n1] 280 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x1)
@server/server_http.go:227 ⋮ [T1,n1] 280 ! github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0x89
@server/server_http.go:227 ⋮ [T1,n1] 280 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc000816420, {{{0xc00b004a20, 0x24}, {0x0, 0x0}, {0x5a5ba7a, 0x1}, {0x5a5ba7a, 0x1}}, 0x174c0a264829ca96, ...})
@server/server_http.go:227 ⋮ [T1,n1] 280 ! github.com/cockroachdb/cockroach/pkg/util/log/clog.go:262 +0x97
@server/server_http.go:227 ⋮ [T1,n1] 280 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal({0x6ee4260, 0xc000e0ef00}, 0x2, 0x4, 0x1, 0x0?, {0x59891a6, 0x2}, {0xc001671f10, 0x1, ...})
@server/server_http.go:227 ⋮ [T1,n1] 280 ! github.com/cockroachdb/cockroach/pkg/util/log/channels.go:106 +0x645
@server/server_http.go:227 ⋮ [T1,n1] 280 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(...)
@server/server_http.go:227 ⋮ [T1,n1] 280 ! github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39
@server/server_http.go:227 ⋮ [T1,n1] 280 !github.com/cockroachdb/cockroach/pkg/util/log.loggerOps.Fatalf(...)
@server/server_http.go:227 ⋮ [T1,n1] 280 ! github.com/cockroachdb/cockroach/bazel-out/k8-fastbuild/bin/pkg/util/log/log_channels_generated.go:1378
@server/server_http.go:227 ⋮ [T1,n1] 280 !github.com/cockroachdb/cockroach/pkg/server.startHTTPService.func1({0x6ee4260, 0xc000e0ef00})
@server/server_http.go:227 ⋮ [T1,n1] 280 ! github.com/cockroachdb/cockroach/pkg/server/server_http.go:227 +0xbf
@server/server_http.go:227 ⋮ [T1,n1] 280 !github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
@server/server_http.go:227 ⋮ [T1,n1] 280 ! github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 +0x146
@server/server_http.go:227 ⋮ [T1,n1] 280 !created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
@server/server_http.go:227 ⋮ [T1,n1] 280 ! github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:461 +0x43b

For reference, this happened while running a roachtest that has not been merged to master yet [1], but it's quite likely this crash is unrelated to that. A similar failure has also been observed in a different roachtest [2], although we don't see the same panic logs in that case [2].

Apologies for the lack of artifacts here; I have them but they are too large for a GitHub attachment. I can upload them somewhere as needed.

@cockroachdb/obs-inf-prs @cockroachdb/server

[1] #96991
[2] #98436

Jira issue: CRDB-25308

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-03-13T21:21:50Z

Hi @dhartunian, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

dhartunian · 2023-03-13T21:33:54Z

@knz can you check my understanding here.

First, it seems like the error here doesn't need to call Fatalf. Other places where we close connections in response to shouldQuiesce we log a warning:

cockroach/pkg/server/server_sql.go

Lines 1671 to 1683 in 311c452

    
           waitQuiesce := func(ctx context.Context) { 
        
           	<-stopper.ShouldQuiesce() 
        
           	// NB: we can't do this as a Closer because (*Server).ServeWith is 
        
           	// running in a worker and usually sits on accept() which unblocks 
        
           	// only when the listener closes. In other words, the listener needs 
        
           	// to close when quiescing starts to allow that worker to shut down. 
        
           	if err := unixLn.Close(); err != nil { 
        
           		log.Ops.Warningf(ctx, "closing unix socket: %v", err) 
        
           	} 
        
           	if err := socketLock.Unlock(); err != nil { 
        
           		log.Ops.Warningf(ctx, "removing unix socket lock: %v", err) 
        
           	} 
        
           }

Second, connManager.Serve() will itself close the connection in a defer. If we don't want to downgrade the log to a Warningf, then is there a way to guarantee that the connection isn't closed during quiesce? Should I check in advance of closing it? That seems odd too.

knz · 2023-03-13T22:14:47Z

This can be a warning.

nicktrav · 2023-03-14T14:58:42Z

I hit this again in a manual roachtest run as part of #97428.

  |   | Error: ERROR: result is ambiguous: error=ba: Put [/Table/112/1/-4732293137230184541/0,/Min), EndTxn(parallel commit) [/Table/112/1/-4732293137230184541/0], [txn: 428b963d], [can-forward-ts] RPC error: grpc: error reading from server: read tcp 10.142.0.54:51128->10.142.0.102:26257: use of closed network connection [code 14/Unavailable] [propagate] (SQLSTATE 40003)
  |   | COMMAND_PROBLEM: ssh verbose log retained in ssh_075738.008347902_n10_cockroach-workload-r.log: exit status 1

95789: pkg/util/log: don't falsify tenant ID tag in logs if none in ctx r=andreimatei a=abarganier Previously, I made the decision to always tag a log entry with a tenant ID, even if no tenant ID was found in the context associated with the log entry. In this case, the system tenant ID was used in the tag, instead of omitting a tenant ID tag altogether. I received some feedback that this is confusing. For example, imagine testing a feature, expecting log entries to come from a secondary tenant, and the context being used in that feature is not annotated with a tenant ID. With the previous behavior, the log entry would default to being tagged with the system tenant ID instead of having empty tags (or at least, no tenant ID tag). In this scenario, how do I tell the actual state of the log entry? Did the log entry indeed come from a goroutine belonging to the system tenant? Or was the context just missing the tenant ID annotation, but otherwise came from the correct tenant? This ambiguity is not helpful. By falsifying a tenant ID tag we confuse the log reader about the actual state of the system. Furthermore, our eventual goal should be that almost no context objects in the system exist without a tenant ID (except for perhaps at startup before tenant initialization). Tagging with the system tenant ID in the case of a missing tenant ID annotation in the context makes it difficult to track down offending context objects. This patch removes this default behavior from the logging package. Now, if no tenant ID is found in the context, we do not tag the entry with a tenant ID. Note however that on the *decode* side, we will maintain this default tenant ID tagging behavior. If a log entry does not have a tenant ID tag, then we must assume that only the system tenant has privilege to view said log entry, since the owner is ambiguous. Release note: none Epic CRDB-14486 98175: cdc: show all changefeed jobs in `SHOW CHANGEFEED JOBS` r=HonoreDB a=jayshrivastava ### cdc: show all changefeed jobs in SHOW CHANGEFEED JOBS Release note (general change): Previously, the output of `SHOW CHANGEFEED JOBS` was limited to show unfinished jobs and finished jobs from the last 14 days. This change makes the command show all changefeed jobs, regardless of if they finished and when they finished. Note that jobs still obey the cluster setting `jobs.retention_time`. Completed jobs older than that time are deleted. Fixes: #97883 ### jobs: add virtual index for job_type in crdb_internal.jobs This change adds a virtual index on the `job_type` column of `crdb_internal.jobs`. This change should make queries on that table which filter on job type (such as `SHOW CHANGEFEED JOBS`) more efficient. Release note: None Epic: None 98515: kvserver: deflake test store capacity after split r=andrewbaptist a=kvoli This commit defales `TestStoreCapacityAfterSplit`. Previously it was possible for the replica load stats which underpins Capacity to be reset. The reset caused the recording duration to fall short of min stats duration, which led to a 0 value being reported for writes in store capacity. This commit bumps the manual clock twice and removes redundant leaseholder checks within a retry loop. The combination of these two changes makes the test much less likely to flake. The test is now unskipped. ``` dev test pkg/kv/kvserver -f TestStoreCapacityAfterSplit -v --stress ... 4410 runs so far, 0 failures, over 6m10s ``` Resolves: #92677 Release note: None 98521: ui: don't continue polling endpoints that return 403 errors r=dhartunian a=abarganier It was brought to our attention that endpoints such as `v1/settings` would continue to be polled by DB Console even if they returned 403 errors. If an endpoint returns 403 errors, we should not continue to poll it since the required access is not present for the current user. This patch updates the polling mechanism to short-circuit the `refresh` process if a 403 error is encountered throughout the lifecycle of the poller. Release note: none Fixes: #98356 98536: kvserver: deflake learner joint cfg relocate range r=andrewbaptist a=kvoli Previously, in `TestLearnerOrJointConfigAdminRelocateRange` it was possible for there to be an in-flight snapshot towards a learner, prior to sending `AdminRelocateRange` command. When this occurred, the test would fail as `AdminRelocateRange` returns an error when finding any in-flight snapshots to learners. This situation occurred infrequently, causing the test to flake. This commit updates the `TestLearnerOrJointConfigAdminRelocateRange` test to first assert that there are the expected number of learners, then assert that there are no in-flight snapshots towards learners before beginning the main testing logic. The test is now unskipped. ``` dev test pkg/kv/kvserver \ -f TestLearnerOrJointConfigAdminRelocateRange \ -v --stress ... 5652 runs so far, 0 failures, over 12m30s ``` Resolves: #95500 Release note: None 98542: storage: remove MVCCIterator.Key method r=jbowens a=jbowens The MVCCIterator interface previously exposed two methods for accessing the current iterator postion as a MVCC key—UnsafeKey and Key. Key() was equivalent to UnsafeKey().Clone(). This commit removes the Key() variant, pushing the onus of key copying onto the caller. This reduces the interface surface area, avoids accidental key copying (some of which is addressed within this commit), and does not impose any unreasonable burden on callers. Epic: None Informs #82589. Release note: None 98543: allocator: fix lease io enforcement setting typo r=andrewbaptist a=kvoli This commit updates the "do nothing" lease IO overload enforcement (`kv.allocator.lease_io_overload_threshold_enforcement`) setting to be correctly spelled "ignore" rather than "ingore". Part of: #96508 Release note (ops change): The `kv.allocator.lease_io_overload_threshold_enforcement` setting value which disables enforcement is updated to be spelled correctly as "ignore" rather than "ingore". 98600: server: change conn close error to warning r=knz,abarganier a=dhartunian Resolves: #98523 Epic: None Release note: None Co-authored-by: Alex Barganier <abarganier@cockroachlabs.com> Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com> Co-authored-by: David Hartunian <davidh@cockroachlabs.com>

Resolves: #98523 Epic: None Release note: None

renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability-inf labels Mar 13, 2023

blathers-crl bot added the A-observability-inf label Mar 13, 2023

renatolabs mentioned this issue Mar 13, 2023

roachtest: clearrange/checks=false/rangeTs=true failed #98436

Closed

dhartunian added T-multitenant Issues owned by the multi-tenant virtual team GA-blocker labels Mar 13, 2023

dhartunian added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Mar 13, 2023

dhartunian self-assigned this Mar 13, 2023

dhartunian mentioned this issue Mar 14, 2023

server: change conn close error to warning #98600

Merged

craig bot closed this as completed in 75e6c31 Mar 14, 2023

craig bot closed this as completed in #98600 Mar 14, 2023

exalate-issue-sync bot removed the T-observability-inf label Mar 14, 2023

renatolabs mentioned this issue Mar 29, 2023

roachtest: kv95/enc=false/nodes=3/cpu=32/size=64kb/no-admission failed #99610

Closed

blathers-crl bot mentioned this issue Apr 5, 2023

release-22.2: server: change conn close error to warning #100772

Merged

blathers-crl bot pushed a commit that referenced this issue Apr 5, 2023

server: change conn close error to warning

d773e9e

Resolves: #98523 Epic: None Release note: None

dhartunian added a commit that referenced this issue Apr 11, 2023

server: change conn close error to warning

408f841

Resolves: #98523 Epic: None Release note: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: `use of closed network connection` when quiescing #98523

server: `use of closed network connection` when quiescing #98523

renatolabs commented Mar 13, 2023 •

edited

Loading

blathers-crl bot commented Mar 13, 2023

dhartunian commented Mar 13, 2023

knz commented Mar 13, 2023

nicktrav commented Mar 14, 2023

server: use of closed network connection when quiescing #98523

server: use of closed network connection when quiescing #98523

Comments

renatolabs commented Mar 13, 2023 • edited Loading

blathers-crl bot commented Mar 13, 2023

dhartunian commented Mar 13, 2023

knz commented Mar 13, 2023

nicktrav commented Mar 14, 2023

server: `use of closed network connection` when quiescing #98523

server: `use of closed network connection` when quiescing #98523

renatolabs commented Mar 13, 2023 •

edited

Loading