Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YCSB/E performance regression on December 4th #62198

Closed
erikgrinaker opened this issue Mar 18, 2021 · 14 comments · Fixed by #62282
Closed

YCSB/E performance regression on December 4th #62198

erikgrinaker opened this issue Mar 18, 2021 · 14 comments · Fixed by #62282
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker

Comments

@erikgrinaker
Copy link
Contributor

the ycsb/E benchmarks had a sharp drop of 52% (1580→755) between December 3rd and 5th:

https://roachperf.crdb.dev/?filter=&view=ycsb%2FE%2Fnodes%3D3&tab=gce

@erikgrinaker erikgrinaker added C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker labels Mar 18, 2021
@erikgrinaker erikgrinaker self-assigned this Mar 18, 2021
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Mar 18, 2021

This seems to have been caused by defaulting to the vectorized SQL execution engine for all queries in #55713, which was merged on December 3rd. The vectorize_row_count_threshold was subsequently removed in #62164.

Using commit 5766c37 from March 17th, I ran ycsb/E/nodes=3 benchmarks on GCE with DefaultVectorizeRowCountThreshold set to 0 and 1000, representative results below:

DefaultVectorizeRowCountThreshold = 0

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0          24692           41.2     19.0     15.7     48.2     71.3    125.8  insert

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0         467824          779.7    122.1     75.5    419.4    570.4   1006.6  scan

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0         492516          820.9    116.9     67.1    419.4    570.4   1006.6

DefaultVectorizeRowCountThreshold = 1000

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0          56267           93.8     17.8      8.4     67.1    109.1    302.0  insert

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0        1068114         1780.2     53.0     37.7    159.4    234.9    570.4  scan

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  600.0s        0        1124381         1874.0     51.2     33.6    159.4    226.5    570.4  

Reverting the setting from 0 to 1000 increases the ops/sec from 821 to 1874, in line with the YCSB/E graph drop on December 4th.

@yuzefovich @asubiotto You may want to have a look at this.

@asubiotto
Copy link
Contributor

asubiotto commented Mar 18, 2021

Hmm this is interesting and not good. AFAICT workload E is 95% reads, 5% writes which is the workload profile we tested with on KV95, I wonder what the difference is here. @yuzefovich do you mind taking a look at this? I'll label as GA blocker for now.

@yuzefovich
Copy link
Member

I can clearly see the difference there, but the profiles are quite puzzling to me - it appears as if things at the storage level become slower when the vectorized engine is used, as if the cFetcher somehow sends different batch requests (but I don't think it's the case). In 30s profile I'm looking at with threshold=0 this accounts for the largest chunk:

    1250ms  1.38% 53.00%    13610ms 15.04%  github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).nextKey

whereas in 30s profile of threshold=1000 that function call only accounts for 2s. Will look more.

@yuzefovich
Copy link
Member

yuzefovich commented Mar 18, 2021

I think the difference is that in the vectorized engine we perform a deep copy of BYTES and STRING values (when calling coldata.Bytes.Set) whereas in the row engine we don't (in rowenc.EncDatumValueFromBufferWithOffsetsAndType we just take the relevant part of the slice and put it into the EncDatum directly). And I don't know what to do here, to me the latter approach seems like possibly-unsafe optimization that is not compatible with our flat bytes representation in the vectorized engine. cc @asubiotto @jordanlewis in case you have any ideas

Note that in both cases we also perform another deep copy when creating DString object from []byte to push into the DistSQLReceiver.

@yuzefovich
Copy link
Member

vectorize_row_count_threshold=0: pprof.cockroach.samples.cpu.052.pb.gz
vectorize_row_count_threshold=1000: pprof.cockroach.samples.cpu.053.pb.gz
(I used the tip of release-21.1 branch)

@yuzefovich
Copy link
Member

Never mind, that behavior is expected (that we allocate many STRING objects, the same goes for the row engine), so I deleted the comment.

@yuzefovich
Copy link
Member

Simply adding a deep copy to rowenc.EncDatumValueFromBufferWithOffsetsAndType didn't change anything.

I think the problem might be that we still keep the reference to tree.Datum in VecToDatumConverter.convertedVecs, even when those objects are released.

@yuzefovich
Copy link
Member

Hm, unsetting those values doesn't seem to help either. My best guess at the moment is that in the vectorized engine we create more pointers which puts pressure on the GC. In the profiles we do see significant increase in GC pressure, but I'm still not sure how to alleviate that.

@jordanlewis
Copy link
Member

Investigating this further, I found a more minimal reproduction of the difference. The benchmarks were done on a GCEWorker machine, single node cluster.

./cockroach workload run ycsb  --workload=E --concurrency=1 --duration=1m --select-for-update=true --min-scan-length=500 --max-scan-length=500

Interestingly, pinning the scan-length value to 1000 detects no difference with and without vectorize, but pinning the scan-length to 1 also detects no difference.

So I'm hypothesizing that there is some kind of pathological behavior in the way that we're doing the batch resizing that is triggering some slowness at 500. 250 is also a terrible value: you see more than 2x difference with and without vectorize.

At 250, I see something like ~300 ops/s with vectorized off, and ~150 ops/s with vectorized on.

@erikgrinaker erikgrinaker changed the title ycsb/E performance regression on December 4th YCSB/E performance regression on December 4th Mar 20, 2021
@craig craig bot closed this as completed in a56498f Mar 24, 2021
@yuzefovich
Copy link
Member

yuzefovich commented Mar 24, 2021

I've just had a quick run of YCSB/E on 685751f, and it showed the following:

  • vectorize=off: 600.0s 0 1127317 1878.8 51.1 35.7 151.0 226.5 570.4
  • vectorize=on: 600.0s 0 1133363 1888.9 50.8 37.7 151.0 218.1 536.9

This shows that #62282 addressed the largest part of the regression on this benchmark that was caused by the switch to the vectorized engine back in December, so I'm considering this issue closed.

@yuzefovich
Copy link
Member

Kicked off 10 runs of YCSB/E on 3 node n1-standard-8 cluster in gce with 96 concurrency (basically using the config from the roachtest):

  • vectorize=off: 1876.24
  • vectorize=on: 1886.24, +0.53%

So it looks like on this benchmark in particular the vectorized engine even slightly outruns the row-by-row engine with the recent improvement. I think this is somewhat expected given that YCSB/E performs 95% scans with each scan randomly reading up to 1000 rows, and we know that the vectorized read path is faster after a certain threshold on the number of rows is exceeded (I don't know the exact number of though, it was on the order of 1000 before - thus, the original default value of vectorize_row_count_threshold - but I imagine it has decreased).

I'd call it a major success :) cc @jordanlewis

@erikgrinaker
Copy link
Contributor Author

Not so fast, I found the cause of the kv95 regression: #62524. Basically, if there is no estimate (i.e. 0), we're defaulting to allocating batches of size 1024, which is really expensive when we're just fetching a single row.

I'm trying out a patch that defaults to 1 if there's no estimate. It fixed kv95, I'm running ycsb/E now -- hopefully we have actual estimates there so it won't take a hit.

@yuzefovich
Copy link
Member

yuzefovich commented Mar 24, 2021

Ah, indeed. I think we want to default to the previous behavior of exponentially increasing the batch sizes starting from 1.

@erikgrinaker
Copy link
Contributor Author

Luckily didn't affect YCSB/E much: #62534

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants