kv: log slow client requests with replica information #114431

nvanbenschoten · 2023-11-14T18:59:30Z

We currently log slow/hanging requests from the client at the level of a range, but not at the level of a replica:

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 1922 to 1941 in ce3f78b

    
           const slowDistSenderThreshold = time.Minute 
        
           if dur := timeutil.Since(tBegin); dur > slowDistSenderThreshold && !tBegin.IsZero() { 
        
           	{ 
        
           		var s redact.StringBuilder 
        
           		slowRangeRPCWarningStr(&s, ba, dur, attempts, routingTok.Desc(), err, reply) 
        
           		log.Warningf(ctx, "slow range RPC: %v", &s) 
        
           	} 
        
           	// If the RPC wasn't successful, defer the logging of a message once the 
        
           	// RPC is not retried any more. 
        
           	if err != nil || reply.Error != nil { 
        
           		ds.metrics.SlowRPCs.Inc(1) 
        
           		defer func(tBegin time.Time, attempts int64) { 
        
           			ds.metrics.SlowRPCs.Dec(1) 
        
           			var s redact.StringBuilder 
        
           			slowRangeRPCReturnWarningStr(&s, timeutil.Since(tBegin), attempts) 
        
           			log.Warningf(ctx, "slow RPC response: %v", &s) 
        
           		}(tBegin, attempts) 
        
           	} 
        
           	tBegin = time.Time{} // prevent reentering branch for this RPC 
        
           }

As a result, it can be difficult to determine which replica a request was executing on when it got stuck. Ex. #112373 (comment).

We should push similar logging into DistSender.sendToReplicas, surrounding the call to transport.SendNext.

Jira issue: CRDB-33510

Epic CRDB-34227

The text was updated successfully, but these errors were encountered:

Previously, slow requests were only logged at the range level, but the logs did not indicate which replica is slow. Moreover, the SlowRPC metric attempted to represent the number of requests currently being retried, however it was done on the range level and therefore missed a second level of replica-level retries being done underneath. This PR adds logging on the replica level, removes a confusing log line, and changes the metric to count the number of slow requests in a simpler manner. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-33510 Fixes: cockroachdb#114431

117117: kv: log slow requests on replica level in addition to range level r=shralex a=shralex Previously, slow requests were only logged at the range level, but the logs did not indicate which replica is slow. Moreover, the SlowRPC metric attempted to represent the number of requests currently being retried, however it was done on the range level and therefore missed a second level of replica-level retries being done underneath. This PR adds logging on the replica level, removes a confusing log line, and changes the metric to count the number of slow requests in a simpler manner. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-33510 Fixes: #114431 117693: changefeedccl: add observability metrics into sarama code r=rharding6373 a=wenyihu6 Now that this patch (#117544) has been merged, sarama now acknowledges and reacts to kafka server's throttling messages by slowing down. To provide better observability into sarama code, this patch adds a metrics registry interceptor and a new metrics `changefeed.kafka_throttling_hist_nanos` which tracks time (in nanos) spent in sarama's throttling when cockroachdb exceed the kafka quota. Fixes: #117618 Release note: changefeed.kafka_throttling_hist_nanos has now been added to metrics to monitor sarama throttling behavior resulting from exceeding kafka quota. 118372: sql: fix flake in TestTxnContentionEventsTable r=yuzefovich a=michae2 In causeContention we deliberately hold a transaction open using pg_sleep to block an update statement. The timing we're trying to achieve is: 1. transaction insert 2. update starts and blocks 3. transaction held open using pg_sleep We were using a WaitGroup to order (2) after (1), but there was no synchronization to ensure (3) came after (2). This commit adds a retry loop that checks `crdb_internal.cluster_queries` to ensure (3) comes after (2). Fixes: #118236 Release note: None 118760: builtins: allow VIEWACTIVITY priv to use crdb_internal.request_statem… r=xinhaoz a=xinhaoz …ent_bundle Previously only those with the VIEWACTIVITY role could use the crdb_internal.request_statement_bundle builtin. We should allow the VIEWACTIVITY privilege as well since role options are now deprecated. This allow also allow stmt bundle requests to be made from db-console for users with this granted privilege. Epic: none Fixes: #118759 Release note (bug fix): Those with VIEWACTIVITY privilege can now request statement bundles using crdb_internal.requets_statement_bundle or via db-console's sql activity page. 118767: release: confirm yum install r=celiala a=rail This adds `-y` flag to install `yum` without user prompt. Epic: none Release note: None 118789: jobs,application_api: replace calls to `skip.Stress` with `skip.Duress` r=celiala a=rickystewart `skip.Duress()` seems like it should have been used in this case as it gives more time under both `race` and `deadlock`. This will give these tests some extra time if they run in a heavyweight configuration but not "under stress". Epic: CRDB-8308 Release note: None 118792: kvfollowerreadsccl: skip test under `race` not `stressrace` r=celiala a=rickystewart Epic: CRDB-8308 Release note: None 118797: bincheck: do not run geos tests on Windows r=celiala a=rail In #106642 we stopped shipping libgeos on Windows, but didn't update the bincheck test to reflect the change. Epic: none Release note: None Co-authored-by: shralex <shralex@gmail.com> Co-authored-by: Wenyi Hu <wenyi@cockroachlabs.com> Co-authored-by: Michael Erickson <michae2@cockroachlabs.com> Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com> Co-authored-by: Rail Aliiev <rail@iqchoice.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>

Previously, slow requests were only logged at the range level, but the logs did not indicate which replica is slow. Moreover, the SlowRPC metric attempted to represent the number of requests currently being retried, however it was done on the range level and therefore missed a second level of replica-level retries being done underneath. This PR adds logging on the replica level, removes a confusing log line, and changes the metric to count the number of slow requests in a simpler manner. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-33510 Fixes: cockroachdb#114431

nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-observability T-kv KV Team E-quick-win Likely to be a quick win for someone experienced. labels Nov 14, 2023

nvanbenschoten mentioned this issue Nov 14, 2023

roachtest: backup-restore/mixed-version failed #112373

Closed

shralex mentioned this issue Dec 27, 2023

kv: log slow requests on replica level in addition to range level #117117

Merged

exalate-issue-sync bot assigned shralex Jan 2, 2024

craig bot closed this as completed in b034714 Feb 6, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: log slow client requests with replica information #114431

kv: log slow client requests with replica information #114431

nvanbenschoten commented Nov 14, 2023 •

edited by exalate-issue-sync bot

Loading

kv: log slow client requests with replica information #114431

kv: log slow client requests with replica information #114431

Comments

nvanbenschoten commented Nov 14, 2023 • edited by exalate-issue-sync bot Loading

nvanbenschoten commented Nov 14, 2023 •

edited by exalate-issue-sync bot

Loading