Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip,dnm] tracing: peephole optimizations #62227

Conversation

irfansharif
Copy link
Contributor

No description provided.

Fixes cockroachdb#62081. It had rotted after we removed the trace.mode setting. It
now makes use of a debug-only cluster setting that if enabled, installs
real spans all sql statements. We'll defer the actual investigation into
the cause of the slow-down in future PRs.

    $ make bench PKG=./pkg/bench  TESTFLAGS='-benchtime=10000x -count 20' \
        BENCHES='BenchmarkTracing' BENCHTIMEOUT=50m > benchmark-tracing.txt
    $ cat benchmark-tracing.txt | grep -v tracing=t | sed 's/tracing=f//' > old.txt
    $ cat benchmark-tracing.txt | grep -v tracing=f | sed 's/tracing=t//' > new.txt
    $ benchstat old.txt new.txt

    name                                   old time/op    new time/op    delta
    Tracing/Cockroach//Scan1-24               382µs ± 2%     412µs ± 8%   +7.68%  (p=0.000 n=10+10)
    Tracing/Cockroach//Insert-24              579µs ± 2%     609µs ± 6%   +5.10%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach//Scan1-24      799µs ± 2%     885µs ± 1%  +10.76%  (p=0.000 n=10+9)
    Tracing/MultinodeCockroach//Insert-24    1.09ms ± 1%    1.14ms ± 2%   +5.04%  (p=0.000 n=9+10)

    name                                   old alloc/op   new alloc/op   delta
    Tracing/Cockroach//Scan1-24              25.3kB ± 5%    32.0kB ± 4%  +26.55%  (p=0.000 n=10+10)
    Tracing/Cockroach//Insert-24             38.0kB ± 6%    42.2kB ± 5%  +11.02%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach//Scan1-24     77.7kB ± 8%    97.1kB ±12%  +25.05%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach//Insert-24     107kB ± 8%     115kB ± 7%   +7.66%  (p=0.023 n=10+10)

    name                                   old allocs/op  new allocs/op  delta
    Tracing/Cockroach//Scan1-24                 256 ± 1%       280 ± 1%   +9.10%  (p=0.000 n=9+9)
    Tracing/Cockroach//Insert-24                301 ± 2%       321 ± 1%   +6.64%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach//Scan1-24        787 ± 2%       921 ± 2%  +17.04%  (p=0.000 n=9+9)
    Tracing/MultinodeCockroach//Insert-24       748 ± 1%       805 ± 2%   +7.61%  (p=0.000 n=9+9)

Release note: None
Instead of the incoming ctx. Going through the ctx package is expensive,
and log tags are unlikely to change between parent and child spans.

---

Actually, drop log tags altogether.

Release note: None
- Reduce lock contention by only deleting root spans from registry.
- Don't materialize span recordings unless absolutely needed.
- Compute span duration outside of mutex

Release note: None
Tracer.AlwaysTrace is redundant. We can also avoid doing opentracing
stuff if we have no external sinks configured.

Release note: None
@irfansharif irfansharif requested a review from a team as a code owner March 19, 2021 03:59
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif changed the title 210318.tracing peephole optimizations [wip,dnm] tracing: peephole optimizations Mar 19, 2021
@irfansharif irfansharif removed the request for review from a team March 19, 2021 04:00
@irfansharif irfansharif force-pushed the 210318.tracing-peephole-optimizations branch 5 times, most recently from 4b10a0f to bc9ff32 Compare March 22, 2021 19:47
- use an optimizedCtx for collapsed spans
- avoid stepping through interface for carrier
- mutate incoming ctx's metadata.MD directly
- disable emptry registry check, we broke it a few commits ago
([wip,dnm] tracing: move from interface type; avoid string.ToLower)
- skip some common span tags

Release note: None
@irfansharif irfansharif force-pushed the 210318.tracing-peephole-optimizations branch from bc9ff32 to f0f6d58 Compare March 22, 2021 20:46
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 23, 2021
Fixes cockroachdb#62081. It had rotted after we removed the trace.mode setting. It
now makes use of a testing knob if enabled, installs real spans all sql
statements. We'll defer the actual investigation into the cause of the
slow-down in future PRs (prototyped in cockroachdb#62227).

    $ make bench PKG=./pkg/bench  TESTFLAGS='-benchtime=10000x -count 20' \
        BENCHES='BenchmarkTracing' BENCHTIMEOUT=50m > benchmark-tracing.txt
    $ cat benchmark-tracing.txt | grep -v tracing=t | sed 's/tracing=f\///' > old.txt
    $ cat benchmark-tracing.txt | grep -v tracing=f | sed 's/tracing=t\///' > new.txt
    $ benchstat old.txt new.txt

    name                                   old time/op    new time/op    delta
    Tracing/Cockroach/Scan1-24                382µs ± 2%     412µs ± 8%   +7.68%  (p=0.000 n=10+10)
    Tracing/Cockroach/Insert-24               579µs ± 2%     609µs ± 6%   +5.10%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach/Scan1-24       799µs ± 2%     885µs ± 1%  +10.76%  (p=0.000 n=10+9)
    Tracing/MultinodeCockroach/Insert-24     1.09ms ± 1%    1.14ms ± 2%   +5.04%  (p=0.000 n=9+10)

    name                                   old alloc/op   new alloc/op   delta
    Tracing/Cockroach/Scan1-24               25.3kB ± 5%    32.0kB ± 4%  +26.55%  (p=0.000 n=10+10)
    Tracing/Cockroach/Insert-24              38.0kB ± 6%    42.2kB ± 5%  +11.02%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach/Scan1-24      77.7kB ± 8%    97.1kB ±12%  +25.05%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach/Insert-24      107kB ± 8%     115kB ± 7%   +7.66%  (p=0.023 n=10+10)

    name                                   old allocs/op  new allocs/op  delta
    Tracing/Cockroach/Scan1-24                  256 ± 1%       280 ± 1%   +9.10%  (p=0.000 n=9+9)
    Tracing/Cockroach/Insert-24                 301 ± 2%       321 ± 1%   +6.64%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach/Scan1-24         787 ± 2%       921 ± 2%  +17.04%  (p=0.000 n=9+9)
    Tracing/MultinodeCockroach/Insert-24        748 ± 1%       805 ± 2%   +7.61%  (p=0.000 n=9+9)

Release note: None
@irfansharif
Copy link
Contributor Author

irfansharif commented Mar 23, 2021

The sum of these changes look promising-ish. Here are 5 runs with kv95.

roachtest run kv95/enc=false/nodes=1/cpu=32  --workload ./workload-linux --cockroach ./cockroach-linux-62227 --count 5 --cpu-quota 320
roachtest run kv95/enc=false/nodes=1/cpu=32  --workload ./workload-linux --cockroach ./cockroach-linux-pre-62227 --count 5 --cpu-quota 320
# 62227 with sample_rate=0.1 vs. pre-62227 with sample_rate=0.1

# 62227
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0       39558963        65931.4      0.8      0.7      1.6      3.3     31.5  read
  600.0s        0       38167539        63612.4      0.9      0.7      1.6      3.4     27.3  read
  600.0s        0       38060299        63433.8      0.9      0.8      1.6      3.5     32.5  read
  600.0s        0       39589903        65983.1      0.8      0.7      1.6      3.3     29.4  read
  600.0s        0       39046176        65076.9      0.9      0.8      1.6      3.4     30.4  read
  600.0s        0        2008289         3347.1      2.7      1.9      7.6     11.5     32.5  write
  600.0s        0        2083240         3472.1      2.3      1.8      5.5      8.9     30.4  write
  600.0s        0        2082753         3471.3      2.3      1.9      5.8      9.4     28.3  write
  600.0s        0        2053367         3422.3      2.2      1.7      5.5      8.9     28.3  write
  600.0s        0        2003421         3339.0      2.5      2.0      6.0      9.4     31.5  write

# pre-62227
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0       36898941        61498.0      0.9      0.8      1.6      3.8     29.4  read
  600.0s        0       34131762        56886.2      1.0      0.9      1.8      4.5     39.8  read
  600.0s        0       37442791        62404.6      0.9      0.7      1.6      3.7     26.2  read
  600.0s        0       37646562        62744.1      0.9      0.8      1.6      3.7     35.7  read
  600.0s        0       38124218        63540.3      0.9      0.8      1.6      3.5     37.7  read
  600.0s        0        2004687         3341.1      2.4      1.9      6.0      9.4     29.4  write
  600.0s        0        1796091         2993.5      2.7      2.2      6.8     11.0     58.7  write
  600.0s        0        1940652         3234.4      2.5      2.0      6.3     10.0     28.3  write
  600.0s        0        1972014         3286.7      2.8      1.9      7.9     11.5     35.7  write
  600.0s        0        1980551         3300.9      2.5      2.0      6.3     10.0     35.7  write

median (read): 65076.90 vs. 62404.60 ops/sec ------ +4.28%
median (writ):  3422.30 vs.  3286.70 ops/sec ------ +4.13%

I'll work on polish in separate PRs.

craig bot pushed a commit that referenced this pull request Mar 23, 2021
62118: sql: revive BenchmarkTracing r=irfansharif a=irfansharif

Fixes #62081. It had rotted after we removed the trace.mode setting. It
now makes use of a testing knob if enabled, installs real spans all sql
statements. We'll defer the actual investigation into the cause of the
slow-down in future PRs (prototyped in #62227).

    $ make bench PKG=./pkg/bench  TESTFLAGS='-benchtime=10000x -count 20' \
        BENCHES='BenchmarkTracing' BENCHTIMEOUT=50m > benchmark-tracing.txt
    $ cat benchmark-tracing.txt | grep -v tracing=t | sed 's/tracing=f//' > old.txt
    $ cat benchmark-tracing.txt | grep -v tracing=f | sed 's/tracing=t//' > new.txt
    $ benchstat old.txt new.txt

    name                                   old time/op    new time/op    delta
    Tracing/Cockroach//Scan1-24               382µs ± 2%     412µs ± 8%   +7.68%  (p=0.000 n=10+10)
    Tracing/Cockroach//Insert-24              579µs ± 2%     609µs ± 6%   +5.10%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach//Scan1-24      799µs ± 2%     885µs ± 1%  +10.76%  (p=0.000 n=10+9)
    Tracing/MultinodeCockroach//Insert-24    1.09ms ± 1%    1.14ms ± 2%   +5.04%  (p=0.000 n=9+10)

    name                                   old alloc/op   new alloc/op   delta
    Tracing/Cockroach//Scan1-24              25.3kB ± 5%    32.0kB ± 4%  +26.55%  (p=0.000 n=10+10)
    Tracing/Cockroach//Insert-24             38.0kB ± 6%    42.2kB ± 5%  +11.02%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach//Scan1-24     77.7kB ± 8%    97.1kB ±12%  +25.05%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach//Insert-24     107kB ± 8%     115kB ± 7%   +7.66%  (p=0.023 n=10+10)

    name                                   old allocs/op  new allocs/op  delta
    Tracing/Cockroach//Scan1-24                 256 ± 1%       280 ± 1%   +9.10%  (p=0.000 n=9+9)
    Tracing/Cockroach//Insert-24                301 ± 2%       321 ± 1%   +6.64%  (p=0.000 n=10+10)
    Tracing/MultinodeCockroach//Scan1-24        787 ± 2%       921 ± 2%  +17.04%  (p=0.000 n=9+9)
    Tracing/MultinodeCockroach//Insert-24       748 ± 1%       805 ± 2%   +7.61%  (p=0.000 n=9+9)

Release note: None

62193: geo/geomfn: add point in polygon optimization to st_intersects r=otan a=andyyang890

This patch improves the performance of st_intersects for the
common use case of testing whether a (multi)polygon and a
(multi)point intersect.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Andy Yang <ayang@cockroachlabs.com>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 1, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in cockroachdb#59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in cockroachdb#62227), but they're much too invasive to backport to
21.1. It's unfortunate that we only discovered the overhead this late in
the development cycle. It was due to two major reasons:
- cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
  propagating real tracing spans across rpc boundaries). We had done
  sanity checking for the tracing overhead before this point, but failed
  to realize that cockroachdb#59992 would merit re-analysis.
- The test that alerted us to the degradation (tpccbench) had be
  persistently failing for a myriad of other reasons, so we didn't learn
  until too late that tracing was the latest offendor. tpccbench also
  doesn't deal with VM overload well (something cockroachdb#62361 hopes to
  address), and after tracing was enabled for realsies, this was the
  dominant failure mode. This resulted in perf data not making it's way
  to roachperf, which further hid possible indicators we had a major
  regression on our hands. We also didn't have a healthy process looking
  at roachperf on a continual basis, something we're looking to rectify
  going forward. We would've picked up on this regression had we been
  closely monitoring the kv95 charts.

Release note: None
craig bot pushed a commit that referenced this pull request Apr 2, 2021
62998: sql: lower default sampling rate to 1% r=irfansharif a=irfansharif

We arrived at the previous default rate of 10% back in #59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what #59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced #61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in #59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in #62227), but they're much too invasive to backport to
21.1.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 2, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in cockroachdb#59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in cockroachdb#62227), but they're much too invasive to backport to
21.1. It's unfortunate that we only discovered the overhead this late in
the development cycle. It was due to two major reasons:
- cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
  propagating real tracing spans across rpc boundaries). We had done
  sanity checking for the tracing overhead before this point, but failed
  to realize that cockroachdb#59992 would merit re-analysis.
- The test that alerted us to the degradation (tpccbench) had be
  persistently failing for a myriad of other reasons, so we didn't learn
  until too late that tracing was the latest offendor. tpccbench also
  doesn't deal with VM overload well (something cockroachdb#62361 hopes to
  address), and after tracing was enabled for realsies, this was the
  dominant failure mode. This resulted in perf data not making it's way
  to roachperf, which further hid possible indicators we had a major
  regression on our hands. We also didn't have a healthy process looking
  at roachperf on a continual basis, something we're looking to rectify
  going forward. We would've picked up on this regression had we been
  closely monitoring the kv95 charts.

Release note: None
@irfansharif irfansharif deleted the 210318.tracing-peephole-optimizations branch April 26, 2021 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants