Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.1: sql: lower default sampling rate to 1% #63006

Merged

Commits on Apr 2, 2021

  1. sql: lower default sampling rate to 1%

    We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
    back when we were creating real tracing spans for all statements, and
    for sampled statements, we were propagating additional stats payloads.
    Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
    acceptable) for was the performance hit we would incur for propagating
    stats payloads for statements already using real tracing spans.
    
    Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
    which made it so that we were only using real tracing spans for sampled
    statements. This was done after performance analysis in cockroachdb#59424 showed
    that the use of real tracing spans in all statements resulted in
    tremendous overhead, for no real benefit.
    
    What this now leaves us with is a sampling rate that was tuned by only
    considering the stats payload overhead. What we want now is to also
    consider the overhead of using real tracing spans for sampled
    statements, vs. not. Doing this analysis gives us a very different
    picture for what the default rate should be.
    
    ---
    
    To find out what the overhead for sampled statements are currently, we
    experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
    benchmark that does little more than one-off statements, so should give
    us a concise picture of the sampling overhead. We ran six experiments
    in total (each corresponding to a pair of read+write rows), done in
    groups of three (each group corresponding to a table below). Each
    run in turn is comprised of 10 iterations of kv95, and what's varied
    between each run is the default sampling rate. We pin a sampling rate of
    0.0 as the baseline that effectively switches off sampling entirely (and
    tracing), and measure the throughput degradation as we vary the sampling
    rate.
    
                              ops/sec            ops/sec
        --------------------|------------------|------------------
        rate   op      grp  | median    diff   | mean      diff
        --------------------|------------------|------------------
        0.00 / read  / #1   | 69817.90         | 69406.37
        0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
        0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
        0.00 / write / #1   |  3672.55         |  3653.63
        0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
        0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%
    
                              ops/sec            ops/sec
        --------------------|------------------|------------------
        rate   op      grp  | median    diff   | mean      diff
        --------------------|------------------|------------------
        0.00 / read  / #2   | 69440.80          68893.24
        0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
        0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
        0.00 / write / #2   |  3652.45           3625.24
        0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
        0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%
    
    The results above suggest that the current default rate of 10% is too
    high, and a 1% rate is much more acceptable.
    
    ---
    
    The fact that the cost of sampling is largely dominated by tracing is
    extremely unfortunate. We have ideas for how that can be improved
    (prototyped in cockroachdb#62227), but they're much too invasive to backport to
    21.1. It's unfortunate that we only discovered the overhead this late in
    the development cycle. It was due to two major reasons:
    - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
      propagating real tracing spans across rpc boundaries). We had done
      sanity checking for the tracing overhead before this point, but failed
      to realize that cockroachdb#59992 would merit re-analysis.
    - The test that alerted us to the degradation (tpccbench) had be
      persistently failing for a myriad of other reasons, so we didn't learn
      until too late that tracing was the latest offendor. tpccbench also
      doesn't deal with VM overload well (something cockroachdb#62361 hopes to
      address), and after tracing was enabled for realsies, this was the
      dominant failure mode. This resulted in perf data not making it's way
      to roachperf, which further hid possible indicators we had a major
      regression on our hands. We also didn't have a healthy process looking
      at roachperf on a continual basis, something we're looking to rectify
      going forward. We would've picked up on this regression had we been
      closely monitoring the kv95 charts.
    
    Release note: None
    irfansharif committed Apr 2, 2021
    Configuration menu
    Copy the full SHA
    bfee1a6 View commit details
    Browse the repository at this point in the history