Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: only create real spans when session tracing/sampling #61777

Merged
merged 2 commits into from
Mar 13, 2021

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Mar 10, 2021

This drastically reduces the memory overhead for tracing we're observing
in #59424. This commit does a few disparate things to make it happen:

  1. We now access the tracing span through txnState.Ctx exclusively. This
    gives us a single point to hijack, which we'll later do. By default
    txn's are initialized with a no-op span. If later on session tracing is
    enabled, we'll create a real (verbose) span and swap it out with the
    txn's no-op one. This gives us the same semantics as earlier, and on the
    plus side, we're not re-using the same tracing span when session tracing
    is toggled.
  2. Hard tracing methods to work with no-op spans. Specifically
    GetRecording and TraceID.
  3. Remove a crash vector through crdb_internal.trace_id. It was
    previously reaching into the first recording to retrieve a trace ID. But
    it's not guaranteed that recordings are non-empty. This could be used to
    induce panics in the server.

This PR will need to get backported to 21.1. Fixes #59424.

Release note: None


+cc @cockroachdb/kv-east.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@asubiotto
Copy link
Contributor

Looks like the failure is an OOB on this line:

return tree.NewDInt(tree.DInt(sp.GetRecording()[0].TraceID)), nil

A non-nil span with an expected recording has none.

@asubiotto
Copy link
Contributor

Looking just at the testrace failure. Looks like it's hitting a testing assertion:

panic: unhandled field in local meta or all fields are nil

goroutine 516368 [running]:
github.com/cockroachdb/cockroach/pkg/sql/execinfrapb.LocalMetaToRemoteProducerMeta(0x7fb4fe0, 0xc0c7050750, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfrapb/data.go:317 +0x5d9
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).sendMetadata(0xc0b69d5860, 0x7fb4fe0, 0xc0c7050750, 0x7f8a255f57d0, 0xc0b1784e40, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:312 +0x3be
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).runWithStream(0xc0b69d5860, 0x7fb4fe0, 0xc0c7050750, 0x7f8a255f57d0, 0xc0b1784e40, 0xc0b3605bf0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:351 +0x18d
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).Run(0xc0b69d5860, 0x7fb4f20, 0xc0a1c60dc0, 0x7f1ad40, 0xc0b72fd620, 0x88c8b94d00000001, 0x8c0168ba99459d87, 0xce1e8947, 0xd, 0xc0b3605bf0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:194 +0x431
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreator).setupRemoteOutputStream.func1(0x7fb4f20, 0xc0a1c60dc0, 0xc0b36056f0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:663 +0x255
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreatorHelper).accumulateAsyncComponent.func1.1(0xc0c70504e0, 0x7fb4f20, 0xc0a1c601c0, 0xc0b36056f0, 0xc0b5b69520)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:1291 +0x52
created by github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreatorHelper).accumulateAsyncComponent.func1
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:1290 +0x87

My first guess is that a ProducerMetadata object is created somewhere and a component tries to set the TraceData field with the assumption that that will be non-nil but your change breaks that assumption.

@irfansharif
Copy link
Contributor Author

Heh, I saw your comment too late. I was just about to ask for your (or @yuzefovich's) second pair of eyes. I don't think the colflow outbox code correctly deals with the possibility that the span available to it is a no-op one.

https://github.com/cockroachdb/cockroach/pull/61777/files#diff-d71b6d7d7e13687ca705600b9ecb837ed0f445f80ceb78a8b939504e0d6caa36R757-R763

What do you suppose should be done? Does it make sense to force real span creation at outboxes? I suppose not, which is exactly the behavior #61380 looks to change.

@irfansharif
Copy link
Contributor Author

case *RemoteProducerMetadata_TraceData_:
meta.TraceData = v.TraceData.CollectedSpans

Is where the empty TraceData is being set.

@irfansharif
Copy link
Contributor Author

Looks like we were willing to propagate empty traces before. We'll need the following diff, I think:

diff --git a/pkg/sql/colflow/colrpc/outbox.go b/pkg/sql/colflow/colrpc/outbox.go
index f9395956b0..b97d494943 100644
--- a/pkg/sql/colflow/colrpc/outbox.go
+++ b/pkg/sql/colflow/colrpc/outbox.go
@@ -298,7 +298,7 @@ func (o *Outbox) sendMetadata(ctx context.Context, stream flowStreamClient, errT
                        o.span.RecordStructured(s)
                }
        }
-       if trace := execinfra.GetTraceData(ctx); trace != nil {
+       if trace := execinfra.GetTraceData(ctx); trace != nil && len(trace) != 0 {
                msg.Data.Metadata = append(msg.Data.Metadata, execinfrapb.RemoteProducerMetadata{
                        Value: &execinfrapb.RemoteProducerMetadata_TraceData_{
                                TraceData: &execinfrapb.RemoteProducerMetadata_TraceData{

@irfansharif irfansharif force-pushed the 210309.tpccbench-stability branch 2 times, most recently from 7159e29 to b5f52f9 Compare March 11, 2021 21:27
@yuzefovich
Copy link
Member

I don't think the colflow outbox code correctly deals with the possibility that the span available to it is a no-op one.

I'm not sure what to say here yet, but I'd encourage you to take a look at modifying execinfra.ProcessorSpan and execinfra.GetTraceData methods rather than adjusting different components (like colrpc.Outbox) in isolation since those methods are used by most components in the execution flows (the exception is ProcessorBase.moveToTrailingMeta where we call GetRecording explicitly).

What do you suppose should be done? Does it make sense to force real span creation at outboxes? I suppose not, which is exactly the behavior #61380 looks to change.

Also I'm not quite sure how remote child spans deal interact with root spans, but I imagine we do need to have "real" spans in the outboxes when the execution stats are being collected. Yeah, #61380 was trying to avoid doing that but also to fix the misuse of the API (where we started and stopped spans only for the purposes of the stats propagation). Overall, I think I might not have the right terminology and understanding of the impact here, possibly Alfonso will have some suggestions/clarifications.

@irfansharif irfansharif force-pushed the 210309.tpccbench-stability branch 2 times, most recently from f721a58 to 2b70653 Compare March 12, 2021 14:54
@irfansharif irfansharif changed the title [wip] sql: only create real spans when already recording sql: only create real spans when session tracing/sampling Mar 12, 2021
@irfansharif irfansharif requested review from yuzefovich, asubiotto and tbg and removed request for yuzefovich March 12, 2021 14:55
@irfansharif
Copy link
Contributor Author

This should now be ready for a real look. It's surprisingly small diff, but it was a bit unwieldy to get right (most of sql is unfamiliar to me). Given we're hoping to backport this to release-21.1, we should give this a close look and maybe let it bake for a few days.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'm not an expert on the SQL so my LGTM is really just for the tracing package changes.

Reviewed 1 of 1 files at r1, 11 of 11 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @yuzefovich)


pkg/sql/exec_util.go, line 1656 at r2 (raw file):

			"session tracing", tracing.WithForceRealSpan())
		sp.SetVerbose(true)
		st.ex.state.Ctx = newTxnCtx

Does it make sense to use a ctxHolder here instead of overwriting the ctx directly?


pkg/sql/instrumentation.go, line 171 at r1 (raw file):

	if sp := tracing.SpanFromContext(ctx); sp != nil {
		ih.sp = sp
		ih.finishSpan = false

I don't think we want to remove finishSpan here. cc @yuzefovich. The case I'm thinking about specifically is when statement tracing threshold is enabled. A span is created at a higher level so the instrumentation helper should not Finish it, but it is still worth it to get the span's recording and record the stats. We should probably add a regression test for this case to instrumentation_test.go


pkg/sql/logictest/testdata/logic_test/contention_event, line 30 at r2 (raw file):


statement ok
SET TRACING = on

This is more of a macro question but I thought that the idea of the background tracing work was that we would always have some sort of tracing enabled, so having to turn on session tracing to get contention events feels wrong. How should I think about this?

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @irfansharif)


pkg/sql/instrumentation.go, line 171 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

I don't think we want to remove finishSpan here. cc @yuzefovich. The case I'm thinking about specifically is when statement tracing threshold is enabled. A span is created at a higher level so the instrumentation helper should not Finish it, but it is still worth it to get the span's recording and record the stats. We should probably add a regression test for this case to instrumentation_test.go

Note that the first commit is the revert of #61532 which introduced finishSpan.

AFAICT the only remaining piece from that PR is few changes in execCmd. The revert looks good to me, but I wonder why you chose to do a "by-hand" revert?

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)


pkg/sql/exec_util.go, line 1656 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Does it make sense to use a ctxHolder here instead of overwriting the ctx directly?

Originally I didn't want to avoid a larger diff than necessary, but meh, done. PTAL.


pkg/sql/instrumentation.go, line 171 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Note that the first commit is the revert of #61532 which introduced finishSpan.

AFAICT the only remaining piece from that PR is few changes in execCmd. The revert looks good to me, but I wonder why you chose to do a "by-hand" revert?

I probably caused more confusion than deserved. I wanted to keep the comments introduced in that commit, everything else was mechanical. So if I'm reading everything correctly, I don't think there's any thing to do with this review thread?


pkg/sql/logictest/testdata/logic_test/contention_event, line 30 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

This is more of a macro question but I thought that the idea of the background tracing work was that we would always have some sort of tracing enabled, so having to turn on session tracing to get contention events feels wrong. How should I think about this?

The way to think about it now is that we'll only get contention events for sampled statements/txns. The sql.txn_stats.sample_rate knob lets us traverse a spectrum of going between "always on tracing" as you're thinking of it, and "always off". For these tests SET TRACING = on is effectively doing the same thing as sql.txn_stats.sample_rate = 1.0

@irfansharif
Copy link
Contributor Author

https://www.githubstatus.com Github is down, and my latest fixups aren't showing up in this PR.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)


pkg/sql/instrumentation.go, line 171 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I probably caused more confusion than deserved. I wanted to keep the comments introduced in that commit, everything else was mechanical. So if I'm reading everything correctly, I don't think there's any thing to do with this review thread?

Yeah, I think the only thing that might be worth doing is explicitly mentioning the desire to keep the comments in the revert commit message.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/sql/instrumentation.go, line 171 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

OK. So we should re-introduce finish span in another PR.

Yeah, I think the only thing that might be worth doing is explicitly mentioning the desire to keep the comments in the revert commit message.

Alfonso, which "finish span" are you thinking about? finishSpan boolean was introduced in #61532 only because ih.sp could be of two flavours - the txn span (which we don't want to finish) and and the sampling span (which we explicitly created in Setup and want to finish).

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/sql/instrumentation.go, line 171 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Yeah, I think the only thing that might be worth doing is explicitly mentioning the desire to keep the comments in the revert commit message.

Alfonso, which "finish span" are you thinking about? finishSpan boolean was introduced in #61532 only because ih.sp could be of two flavours - the txn span (which we don't want to finish) and and the sampling span (which we explicitly created in Setup and want to finish).

Right, and I think we still have a case where we want to collect stats from a higher level span without finishing it. Concretely, when tracing is enabled for the statement or txn threshold cases. So in the if statement below, we want to set ih.sp, return needsFinish=true but set finishSpan=false. Does that make sense? I feel like all this state can probably be simplified.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/sql/instrumentation.go, line 171 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Right, and I think we still have a case where we want to collect stats from a higher level span without finishing it. Concretely, when tracing is enabled for the statement or txn threshold cases. So in the if statement below, we want to set ih.sp, return needsFinish=true but set finishSpan=false. Does that make sense? I feel like all this state can probably be simplified.

I see, yeah, it makes sense, but that would be a new behavior, so I think it is orthogonal to the revert commit and the main commit of this PR.

@irfansharif irfansharif force-pushed the 210309.tpccbench-stability branch 2 times, most recently from 8d4f5dd to 4e48293 Compare March 13, 2021 00:20
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the txnState's Ctx lifetime is such that it's cleaned up before SessionTracing is. Specifically, we finishSQLTxn before we StopTracing:

ex.state.finishSQLTxn()

if err := ex.sessionTracing.StopTracing(); err != nil {
log.Warningf(ctx, "error stopping tracing: %s", err)
}

Does it make sense to use a ctxHolder here instead of overwriting the ctx directly?

The above makes this unwieldy. Yes, ideally we'd like to "unhijack" the context we just injected. But given it's cleaned up before we get to it, we won't have anything to clean up. This isn't try for the other usage of ctxHolder, where its lifetime is tied only to SessionTracer. For that reason, I'll keep things as was before and hijack Ctx directly instead of doing it through a ctxHolder. In the future, we can probably clean up the SessionTracer's lifecycle before we finishSQLTxn. When we do so, we can start using a ctxHolder instead. I'm not doing that in this PR to keep the blast radius as small as possible.

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

This drastically reduces the memory overhead for tracing we're observing
in cockroachdb#59424. This commit does a few disparate things to make it happen:

1. We now access the tracing span through txnState.Ctx exclusively. This
   gives us a single point to hijack, which we'll later do. By default
   txn's are initialized with a no-op span. If later on session tracing is
   enabled, we'll create a real (verbose) span and swap it out with the
   txn's no-op one. This gives us the same semantics as earlier, and on the
   plus side, we're not re-using the same tracing span when session tracing
   is toggled.
2. Hard tracing methods to work with no-op spans. Specifically
   GetRecording and TraceID.
3. Remove a crash vector through crdb_internal.trace_id. It was
   previously reaching into the first recording to retrieve a trace ID. But
   it's not guaranteed that recordings are non-empty. This could be used to
   induce panics in the server.

This PR will need to get backported to 21.1. Fixes cockroachdb#59424.

Release note: None
@irfansharif
Copy link
Contributor Author

Github statuses are being goofy, and bors is confused.

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 13, 2021

Build succeeded:

@craig craig bot merged commit 283a21e into cockroachdb:master Mar 13, 2021
@irfansharif irfansharif deleted the 210309.tpccbench-stability branch March 13, 2021 03:31
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 1, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in cockroachdb#59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in cockroachdb#62227), but they're much too invasive to backport to
21.1. It's unfortunate that we only discovered the overhead this late in
the development cycle. It was due to two major reasons:
- cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
  propagating real tracing spans across rpc boundaries). We had done
  sanity checking for the tracing overhead before this point, but failed
  to realize that cockroachdb#59992 would merit re-analysis.
- The test that alerted us to the degradation (tpccbench) had be
  persistently failing for a myriad of other reasons, so we didn't learn
  until too late that tracing was the latest offendor. tpccbench also
  doesn't deal with VM overload well (something cockroachdb#62361 hopes to
  address), and after tracing was enabled for realsies, this was the
  dominant failure mode. This resulted in perf data not making it's way
  to roachperf, which further hid possible indicators we had a major
  regression on our hands. We also didn't have a healthy process looking
  at roachperf on a continual basis, something we're looking to rectify
  going forward. We would've picked up on this regression had we been
  closely monitoring the kv95 charts.

Release note: None
craig bot pushed a commit that referenced this pull request Apr 2, 2021
62998: sql: lower default sampling rate to 1% r=irfansharif a=irfansharif

We arrived at the previous default rate of 10% back in #59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what #59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced #61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in #59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in #62227), but they're much too invasive to backport to
21.1.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 2, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in cockroachdb#59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in cockroachdb#62227), but they're much too invasive to backport to
21.1. It's unfortunate that we only discovered the overhead this late in
the development cycle. It was due to two major reasons:
- cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
  propagating real tracing spans across rpc boundaries). We had done
  sanity checking for the tracing overhead before this point, but failed
  to realize that cockroachdb#59992 would merit re-analysis.
- The test that alerted us to the degradation (tpccbench) had be
  persistently failing for a myriad of other reasons, so we didn't learn
  until too late that tracing was the latest offendor. tpccbench also
  doesn't deal with VM overload well (something cockroachdb#62361 hopes to
  address), and after tracing was enabled for realsies, this was the
  dominant failure mode. This resulted in perf data not making it's way
  to roachperf, which further hid possible indicators we had a major
  regression on our hands. We also didn't have a healthy process looking
  at roachperf on a continual basis, something we're looking to rectify
  going forward. We would've picked up on this regression had we been
  closely monitoring the kv95 charts.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: tpccbench/nodes=3/cpu=16 failed
5 participants