Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: read-triggered compactions causing excessive write amplification #1143

Closed
sumeerbhola opened this issue May 14, 2021 · 18 comments
Closed
Assignees

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented May 14, 2021

We observed the following in the CockroachDB telemetry cluster when switching from 20.2 to 21.1: the L4 shrank in size from 2GB to 120MB and from then one we see hugely imbalanced compactions with tiny bytes in L4 and huge bytes in L5 (see example below). These keep happening despite L4 score being low and compactions into L4 only adding ~30MB every ~3min to L4. The likely cause is read-triggered compactions. The write amp keeps growing to significantly > 100.

(internal links. more details: https://cockroachlabs.atlassian.net/browse/SREOPS-2002, discussion: https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1621008284293900)

  • We need the ability to change the ReadSamplingMultiplier for Pebble running in a CockroachDB node. To do this, pebble.Options.Parse needs to support this option.
  • The compactionPicker should log the reason for choosing a compaction. Even the current limited logging does not show up in the logs, so we may have a plumbing problem too
    p.opts.Logger.Infof("pickAuto: L%d->L%d\n%s",
  • We need to better understand why we have such aggressive read-triggered compactions. The workload is not necessarily read heavy though there are periodic backups that could be reading a lot. It is possible that we have a bug where we are suggesting read triggered compactions when we shouldn't.
  • Even if there is a suggestion to do a read-triggered compactions, imbalanced levels like the ones here should block the compaction from happening.
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737  
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +    WAL         1    21 M       -    31 G       -       -       -       -    32 G       -       -       -     1.0
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +      0         4   7.3 M    2.22    32 G    39 M       3     0 B       0   4.3 G   2.5 K     0 B       1     0.1
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +      2         3    12 M    0.45   4.3 G    26 M       1    16 M       8    20 G   7.7 K    21 G       1     4.6
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +      3         4    23 M    0.21   3.1 G     0 B       0   329 M      87    28 G    11 K    29 G       1     9.1
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +      4       253   2.0 G    0.16   2.7 G     0 B       0   550 M      95   110 G    12 K   110 G       1    41.0
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +      5      1247    33 G    1.07   4.6 G    10 K       8   471 M      84    95 G   3.5 K    96 G       1    20.8
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +      6      5443   278 G       -   4.8 G     0 B       0     0 B       0    30 G     540    32 G       1     6.2
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +  total      6954   313 G       -    32 G    65 M      12   1.3 G     274   320 G    37 K   287 G       6    10.0
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +  flush       573
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +compact      3668    20 M           5.0 G  (size == estimated-debt, in = in-progress-bytes)
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 + memtbl         1    64 M
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +zmemtbl         0     0 B
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 +   ztbl         0     0 B
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 + bcache     186 K   3.0 G   82.6%  (score == hit-rate)
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 + tcache     6.8 K   4.0 M  100.0%  (score == hit-rate)
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 + titers         9
I210514 10:59:46.038948 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5483737 + filter         -       -   91.5%  (score == utility)

shrinks

I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217  
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +    WAL         1   1.7 M       -    32 G       -       -       -       -    33 G       -       -       -     1.0
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +      0         8    14 M    3.11    33 G    39 M       3     0 B       0   4.4 G   2.5 K     0 B       2     0.1
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +      2         3    12 M    0.64   4.4 G    26 M       1    16 M       8    20 G   7.7 K    21 G       1     4.6
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +      3         7    39 M    0.49   3.1 G     0 B       0   364 M      96    28 G    11 K    29 G       1     9.1
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +      4       195   120 M    0.16   2.7 G     0 B       0   550 M      95   110 G    12 K   110 G       1    40.5
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +      5      1146    33 G    1.00   6.5 G    10 K       8   471 M      84   123 G   4.4 K   123 G       1    19.0
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +      6      5438   280 G       -   6.7 G     0 B       0     0 B       0    40 G     701    42 G       1     5.9
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +  total      6797   313 G       -    33 G    65 M      12   1.4 G     283   358 G    38 K   325 G       7    11.0
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +  flush       583
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +compact      3750    26 M           4.7 G  (size == estimated-debt, in = in-progress-bytes)
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 + memtbl         1    64 M
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +zmemtbl         3   192 M
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 +   ztbl       124   2.6 G
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 + bcache     168 K   2.8 G   82.0%  (score == hit-rate)
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 + tcache     6.8 K   4.0 M  100.0%  (score == hit-rate)
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 + titers        37
I210514 11:09:46.079775 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 5586217 + filter         -       -   91.5%  (score == utility)

imbalanced compaction:

cockroach-pebble.cockroachdb-0.root.2021-05-14T08_44_37Z.000012.log:I210514 12:30:14.020728 39821675 3@vendor/github.com/cockroachdb/pebble/compaction.go:1881 ⋮ [n1,pebble] 111993  [JOB 9125] compacted L4 [18347336 18347337 18347338 18347339 18347340 18347341 18347342 18347343 18347344 18347345 18347346 18347347 18347348 18347349 18347350 18347351 18347352 18347353 18347354 18347355 18347356 18347357 18347358 18347359 18347360 18347361 18347362 18347363 18347364 18347365 18347366 18347367 18347368 18347369 18347374 18347375 18347376 18347377 18347378 18347379] (12 M) + L5 [18346572 18346573 18346574 18346575 18346576 18346577 18346578 18346579 18346580 18346581 18346582 18346583 18346584 18346585 18346586 18346587 18346588 18346589 18346590 18346591 18346592 18346593 18346594 18346595 18346596 18346597 18346598 18346599 18346600 18346601 18346602 18346603 18346604 18346605 18346606 18346607 18346608 18346609 18346610 18346611 18346612 18346613 18346614 18346615 18346616 18346617 18346618 18346619 18346620 18346621 18346622 18346623 18346624 18346625 18346626 18346627 18346628 18346629 18346630 18346631 18346632 18346633 18346634 18346635 18346636 18346637 18346638 18346639 18346640 18346641 18346642 18346643 18346644 18346645 18346646 18346647 18346648 18346649 18346650 18346651 18346652 18346653 18346654 18346655 18346656 18346657 18346658 18346659 18346660 18346661 18346662 18346663 18346664 18346665 18346666 18346667 18346668 18346669 18346670 18346671 18346672 18346673 18346674 18346675 18346676 18346677 18346678 18346679 18346680 18346681 18346682 18346683 18346684 18346685 18346686 18346687 18346688 18346689 18346690 18346691 18346692 18346693 18346694 18346695 18346696 18346697 18346698 18346699 18346700 18346701 18346702 18346703 18346704 18346705 18346706 18346707 18346708 18346709 18346710 18346711 18346712 18346713 18346714 18346715 18346716 18346717 18346718 18346719 18346720 18346721 18346722 18346723 18345093 18345094 18345095 18345096 18345097 18345098 18345099 18345100 18345101 18345102 18345103 18345104 18345110 18345111 18345112 18345113 18345114 18345115 18345116 18345117 18345118 18345119 18345120 18345121 18345122 18345123 18345124 18345125 18345126 18346983 18346984 18346985 18346986 18346987 18346988 18346989 18346990 18346991 18346992 18346993 18346994 18346995 18346996 18346997 18346998 18346999 18347000 18347001 18347002 18347003 18347004 18347005 18347006 18347007 18347008 18347009 18347010 18347011 18347012 18347013 18347014 18347015 18347016 18347017 18347018 18347030 18347031 18347032 18347033 18347034 18347035 18347036 18347037 18347038 18347039 18347040 18347041 18347042 18347043 18347044 18347045 18347046 18347047 18347048 18347049 18347050 18347051 18347052 18347053 18347054 18347055 18347056 18347057 18347058 18347059 18347060 18347061 18347062 18347063 18347064 18347065 18347066 18347067 18347068 18347069 18347070] (6.7 G) -> L5 [18347433 18347434 18347435 18347436 18347437 18347438 18347439 18347440 18347441 18347442 18347443 18347444 18347445 18347446 18347447 18347448 18347449 18347450 18347451 18347452 18347453 18347454 18347455 18347456 18347457 18347458 18347459 18347460 18347461 18347462 18347463 18347464 18347465 18347466 18347467 18347468 18347469 18347470 18347471 18347472 18347473 18347474 18347475 18347476 18347477 18347478 18347479 18347480 18347481 18347482 18347483 18347484 18347485 18347486 18347487 18347488 18347489 18347490 18347491 18347492 18347493 18347494 18347495 18347496 18347497 18347498 18347499 18347500 18347501 18347502 18347503 18347504 18347505 18347506 18347507 18347508 18347509 18347510 18347511 18347512 18347513 18347514 18347515 18347516 18347517 18347518 18347519 18347520 18347521 18347522 18347523 18347524 18347525 18347526 18347527 18347528 18347529 18347530 18347531 18347532 18347533 18347534 18347535 18347536 18347537 18347538 18347539 18347540 18347541 18347542 18347543 18347544 18347545 18347546 18347547 18347548 18347549 18347550 18347551 18347552 18347553 18347554 18347555 18347556 18347557 18347558 18347559 18347560 18347561 18347562 18347569 18347570 18347571 18347572 18347573 18347574 18347575 18347576 18347577 18347578 18347579 18347580 18347581 18347582 18347583 18347584 18347585 18347586 18347587 18347588 18347589 18347590 18347591 18347592 18347593 18347594 18347595 18347596 18347597 18347598 18347599 18347600 18347601 18347602 18347603 18347604 18347605 18347606 18347607 18347608 18347609 18347610 18347611 18347612 18347613 18347614 18347615 18347616 18347617 18347618 18347619 18347620 18347621 18347622 18347623 18347624 18347625 18347626 18347627 18347628 18347629 18347630 18347631 18347632 18347633 18347634 18347635 18347636 18347637 18347638 18347639 18347640 18347641 18347642 18347643 18347644 18347645 18347646 18347647 18347648 18347649 18347650 18347651 18347652 18347653 18347654 18347655 18347656 18347657 18347658 18347659 18347660 18347661 18347662 18347663 18347664 18347665 18347666 18347667 18347668 18347669 18347670 18347671 18347672 18347673 18347674 18347675 18347676 18347677 18347678 18347679 18347680 18347681 18347682 18347683 18347684 18347685 18347686 18347687 18347688 18347689 18347690 18347691 18347692 18347693 18347694 18347695 18347696 18347697] (6.7 G), in 140.3s, output rate 49 M/s

growth in write amp:

I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754  
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +    WAL         1   3.9 M       -    42 G       -       -       -       -    43 G       -       -       -     1.0
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +      0         9    14 M    3.45    43 G    39 M       4     0 B       0   5.7 G   3.4 K     0 B       2     0.1
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +      2         4    17 M    0.58   5.7 G    26 M       1    18 M       9    22 G   8.4 K    24 G       1     3.9
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +      3        13    63 M    0.57   3.7 G     0 B       0   790 M     219    30 G    11 K    30 G       1     8.1
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +      4       193    94 M    0.14   3.5 G   157 M       4   554 M      98   113 G    21 K   113 G       1    32.0
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +      5      1154    33 G    1.01   7.4 G    12 K       9   471 M      84   874 G    31 K   874 G       1   118.4
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +      6      5450   280 G       -   7.4 G     0 B       0     0 B       0    46 G     823    49 G       1     6.3
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +  total      6823   314 G       -    43 G   222 M      18   1.8 G     410   1.1 T    75 K   1.1 T       7    26.3
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +  flush       768
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +compact      4291    31 M           7.7 G  (size == estimated-debt, in = in-progress-bytes)
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 + memtbl         1    64 M
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +zmemtbl         0     0 B
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 +   ztbl         0     0 B
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 + bcache     183 K   3.0 G   74.6%  (score == hit-rate)
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 + tcache     6.5 K   3.8 M  100.0%  (score == hit-rate)
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 + titers         2
I210514 14:59:46.106086 20815 kv/kvserver/store.go:2662 ⋮ [n1,s1] 7432754 + filter         -       -   92.5%  (score == utility)
@sumeerbhola
Copy link
Collaborator Author

  • the L4 file sizes were small, mean < 500KB, which may not be surprising when we are using L5 boundaries to split for L3=>L4 compactions, and not letting L4 bulk up before compacting down to L5. That means allowedSeeks should clamp at the lower bound of 100, so more read activity per file before read compactions. So at least file size should not be hurting us here in terms of compaction picking.

  • DB.mu.compact.readCompactions is a simple slice of readCompaction which contains level and a key span. And Iterator appends to this slice, and it looks like no de-duping happens at append or picking time. Which means a read burst could have appended the same key span many times to the slice and we will slowly keep doing repeated compactions for that key span even after the read load is gone. There was backup load on this cluster which may have triggered this behavior. Also, updateReadCompaction, when it finds an empty overlap, starts from level 0 in trying to find overlap instead of starting from readCompaction.level -- the latter seems more principled if the file has been compacted and moved lower.

  • These huge compactions (see example in previous) tend to have 20+ L4 files (all of them tiny). This is because unlike normal auto compactions which start with a single file in pickedCompaction.startLevel.files which is carefully expanded in pickedCompaction.setupInputs taking into account pickedCompaction.maxExpandedBytes, the read compactions are using a key span that at some point belonged to a single file but may now overlap with many files.

@petermattis
Copy link
Collaborator

And Iterator appends to this slice, and it looks like no de-duping happens at append or picking time. Which means a read burst could have appended the same key span many times to the slice and we will slowly keep doing repeated compactions for that key span even after the read load is gone.

If I'm reading the code correctly, we only append to the slice the first time an sstable's AllowedSeeks drops to zero. If we keep reading that same sstable, further decrements of AllowedSeeks won't cause further appends.

@sumeerbhola
Copy link
Collaborator Author

If I'm reading the code correctly, we only append to the slice the first time an sstable's AllowedSeeks drops to zero. If we keep reading that same sstable, further decrements of AllowedSeeks won't cause further appends.

you are right, I misread!
I think we need to log whenever we append to that slice with the key interval and level. And also log when removing from that slice both the original level and what level was actually picked for the compaction. I still suspect that the large number of L4 files suggests the original addition was for a higher level. Maybe we are seeing enough file churn in L0 that the file changes, which would allow for repeated addition of the same key interval.

@tbg
Copy link
Member

tbg commented May 17, 2021

Looks as though this came up during our pre-release performance regression checks, starting around here

It's hard to piece together how we convinced ourselves that there wasn't more to look at at the time. The YCSB-A graphs definitely took a hit right when read compactions were introduced and never quite recovered:

image

@itsbilal
Copy link
Member

@tbg That was a separate issue that we later addressed: #1098 . It's hard to evaluate the benefit of that change on that graph as it's just tagged as "21.1 regressions fixed", but the PR shows how it brought back the same level of performance.

YCSB-A likely didn't have quite the same kind of workload that'd trigger this issue. The missing piece was likely a backup.

@itsbilal
Copy link
Member

Nevermind, I see you linked to the thread that talked about #1098. I guess it is worthwhile to confirm if the remaining small difference in ycsb-A is emblematic of the same issue as telemetry.

itsbilal added a commit that referenced this issue May 17, 2021
This change, which is *not* meant for merging into a release
or master branch, is solely to add a logging event for when
a read compaction is appended to the db's slice. This is to help
inform #1143.
itsbilal added a commit to itsbilal/pebble that referenced this issue May 17, 2021
This change, which is *not* meant for merging into a release
or master branch, is solely to add logging events when
a read compaction is appended to the db's slice, and when
a read compaction is picked off of the slice. This is to help
inform cockroachdb#1143.
itsbilal added a commit to cockroachdb/cockroach that referenced this issue May 17, 2021
This change is not destined for any release. It's
solely to add logging to inform
cockroachdb/pebble#1143

Pulls in:
cockroachdb/pebble#1151

Changes pulled in:

```
85e0b39a8d1b695815781eb1f1fe006179177914 *: Log info about read compactions
```

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue May 17, 2021
This change sets the ReadSamplingMultiplier to a negative
value, effectively disabling read sampling and therefore
read-triggered compactions.

This is a mitigation of
cockroachdb/pebble#1143 to unblock
the 21.1 release while we try to understand the issue.

Release note (general change): Disable read-triggered compactions
to avoid instances where the storage engine would compact excessively.
@itsbilal
Copy link
Member

Looking at the debug logs after #1147 was deployed to the telemetry cluster, all the imbalanced compactions are read compactions. That effectively confirms this as a bug, so I'll go ahead and disable read compactions on 21.1.

itsbilal added a commit to cockroachdb/cockroach that referenced this issue May 17, 2021
This change sets the ReadSamplingMultiplier to a negative
value, effectively disabling read sampling and therefore
read-triggered compactions.

This is a mitigation of
cockroachdb/pebble#1143 to unblock
the 21.1 release while we try to understand the issue.

Release note (general change): Disable read-triggered compactions
to avoid instances where the storage engine would compact excessively.
@sumeerbhola
Copy link
Collaborator Author

some quick observations from looking at the logs:

  • as speculated earlier, the level is being updated in many cases, and sometimes resulting in huge compactions: starting around 21:04:22 there is a long sequence of read compactions which were originally queued for L0, but the file is no longer there and the level gets updated. There are multiple cases of multi GB compactions from L4 => L5 due to this.
  • but there are also compactions that are huge despite the level not being updated. The original file must have been wide. At 20:34:33.779872 there is a L5 => L6 compaction with 46GB where the original level was L5.
  • there is one very odd looking huge compaction:
    I210517 20:03:35.921473 70621 3@vendor/github.com/cockroachdb/pebble/compaction.go:1846 ⋮ [n11,pebble] 905 [JOB 205] compacting(read) L5 [4490003 4490004 4490005 4490006] (53 M) + L6 [4014561 4014562 4487853 4487854 4487855 4487856 4487857 4487858 4487114] (48 G)
    there is 48GB in a handful of L6 files!! The result of this compaction produces > 500 output files, as expected, so it doesn't look like a size misconfiguration.

@itsbilal
Copy link
Member

Looking at the badly balanced compactions, I can think of a couple solutions that would be worthwhile to implement:

  1. Limiting the size of the readCompactions queue. I think 5 sounds like a good limit to have here. Any new additions beyond that remove the oldest elements in the queue. The logic here is that the more recent additions are the more relevant ones. Currently we process the oldest elements in the queue, which could be about key ranges that got compacted away anyway (due to other heuristics).

  2. Since the queue is now bounded in size, the addiiton of readCompaction -s to the queue could do a quick de-duplication check with the key ranges already in the queue, before adding itself. We already ensure that we don't add the same file twice (through AllowedSeeks semantics), but we don't confirm if we added the same key range twice across different levels, or if we already had the same key range in the queue tied to an earlier file. In the case of a duplicate key range entry being found, the new one could replace the old one, just because the newer one is likely to be more relevant.

  3. Not processing a readCompaction in the queue if the file it refers to doesn't exist in the latest Version. This likely means this file was compacted away earlier. This can be a quick check and can even be a good starter project.

And here's something worth exploring, but maybe not implementing rightaway:

  1. Scaling the starting value of AllowedSeeks by the number of read compactions in the queue at the time of file creation. This can just be an in-memory measure mostly affecting shorter-lived files; a node restart will still reset this effect down to the default. in a world where read compactions are few and far in between, and the workload is sufficiently read-heavy to warrant read compactions, we want to have a higher bar for the amount of times we read a file before we schedule a read compaction on it. Alternatively, the decrement of AllowedSeeks could be reduced in magnitude if the queue is too full, instead of changing the starting value, but this might be a bit more involved as it'd affect the read path.

Open to more ideas about this, but these were some I could think of.

@sumeerbhola
Copy link
Collaborator Author

(Just adding stuff from my notes, which overlap)

  • I think the de-duplication issue needs to tie in with "freshness of reads" that are triggering compaction, and doing a compaction only at the level that it was originally intended for (if there is no overlap at that level, then a compaction has happened and we need fresher reads to enqueue another one). A small "queue", as you mention, where each element is a (range, level, age, score) where each addition de-dups, and replaces in case of overflow, certainly is one way to accomplish that. A duplicate could prefer the newer one since that represents the current highest level with that range. When there isn't a duplicate, I'm not sure about simply replacing the oldest one -- we could maintain a hit count as a "score" whenever an addition already finds the range in the queue. And since this queue is going to be touched more often by Iterators than compactions, it should probably be structured for fast de-duplication and fast eviction.
  • There is also the problem that when we get to doing a compaction of something in the queue, even though we are not switching levels, it could overlap with many files. We could just discard this entry in that case.
  • And there is the case of the size ratio of levels being out of whack. We could simply discard entries from the queue when that is the case.

itsbilal added a commit to itsbilal/cockroach that referenced this issue Jun 7, 2021
Read-triggered compactions are already disabled on 21.1.
As the fixes to address known shortcomings with read-triggered
compactions are a bit involved (see
cockroachdb/pebble#1143 ), disable
the feature on master until that issue is fixed. That prevents
this known issue from getting in the way of performance
experiments.

Release note: None.
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Jun 8, 2021
65559: tracing,tracingservice: adds a trace service to pull clusterwide trace spans r=irfansharif,abarganier a=adityamaru

Previously, every node in the cluster had a local inflight span
registry that was aware of all the spans that were rooted on that
particular node. Child spans of a given traceID executing on a remote
node would only become visible to the local registry once execution
completes, and the span pushes its recordings over gRPC to the
"client" node.

This change introduces a `tracingservice` package.
Package tracingservice contains a gRPC service to be used for
remote inflight span access.

It is used for pulling inflight spans from all CockroachDB nodes.
Each node will run a trace service, which serves the inflight spans from the
local span registry on that node. Each node will also have a trace client
dialer, which uses the nodedialer to connect to another node's trace service,
and access its inflight spans. The trace client dialer is backed by a remote
trace client or a local trace client, which serves as the point of entry to this
service. Both clients support the `TraceClient` interface, which includes the
following functionalities:
  - GetSpanRecordings

The spans for a traceID are sorted by `StartTime` before they are
returned. The per-node trace dialer has yet to be hooked up to an
appropriate location depending on where we intend to use it.

Resolves: #60999
Informs: #64992

Release note: None

66149: cloud: fix gcs to resuming reader r=dt a=adityamaru

This change does a few things:

1. gcs_storage was not returning a resuming reader as a result of
which the Read method of the resuming reader that contains logic
to retry on certain kinds of errors was not being invoked.

2, Changes the resuming reader to take a storage specific function
that can define what errors are retryable in the resuming reader.
All storage providers use the same deciding function at the moment
and so behavior is unchanged.

Release note: None

66152: storage: Disable read sampling and read compactions r=sumeerbhola a=itsbilal

Read-triggered compactions are already disabled on 21.1.
As the fixes to address known shortcomings with read-triggered
compactions are a bit involved (see
cockroachdb/pebble#1143 ), disable
the feature on master until that issue is fixed. That prevents
this known issue from getting in the way of performance
experiments.

Release note: None.

66155: sql: drop "cluster" from EXPLAIN ANALYZE to improve readability r=maryliag a=maryliag

Remove the word "cluster" from "cluster nodes" and "cluster regions"
on EXPLAIN ANALYZE to improve readability.

Release note: None

66157: sql: add time & contention time to EXPLAIN ANALYZE. r=matthewtodd a=matthewtodd

The new fields are labeled `KV time` and `KV contention time`:

```
 > EXPLAIN ANALYZE
-> UPDATE users SET name = 'Bob Loblaw'
-> WHERE id = '32a962b7-8440-4b81-97cd-a7d7757d6eac';
                                            info
--------------------------------------------------------------------------------------------
  planning time: 353µs
  execution time: 3ms
  distribution: local
  vectorized: true
  rows read from KV: 52 (5.8 KiB)
  cumulative time spent in KV: 2ms
  maximum memory usage: 60 KiB
  network usage: 0 B (0 messages)
  cluster regions: us-east1

  • update
  │ cluster nodes: n1
  │ cluster regions: us-east1
  │ actual row count: 1
  │ table: users
  │ set: name
  │ auto commit
  │
  └── • render
      │ cluster nodes: n1
      │ cluster regions: us-east1
      │ actual row count: 1
      │ estimated row count: 0
      │
      └── • filter
          │ cluster nodes: n1
          │ cluster regions: us-east1
          │ actual row count: 1
          │ estimated row count: 0
          │ filter: id = '32a962b7-8440-4b81-97cd-a7d7757d6eac'
          │
          └── • scan
                cluster nodes: n1
                cluster regions: us-east1
                actual row count: 52
                KV time: 2ms
                KV contention time: 0µs
                KV rows read: 52
                KV bytes read: 5.8 KiB
                estimated row count: 50 (100% of the table; stats collected 3 minutes ago)
                table: users@primary
                spans: FULL SCAN
(42 rows)

Time: 4ms total (execution 4ms / network 0ms)
```

Resolves #64200

Release note (sql change): EXPLAIN ANALYZE output now includes, for each plan step, the total time spent waiting for KV requests as well as the total time those KV requests spent contending with other transactions.

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
@bananabrick bananabrick self-assigned this Jul 7, 2021
@jbowens
Copy link
Collaborator

jbowens commented Jul 9, 2021

Have we tested read-triggered compactions on workloads with heterogeneous access patterns for various regions of the keyspace?

I'm thinking about a read-heavy, write-little SQL table, flanked within the keyspace by two read-little, write-heavy SQL tables. Flushes to the write-heavy tables may increase read-amplification for the reads, while read-triggered compactions would increase the write amplification for the write tables.

@bananabrick
Copy link
Contributor

bananabrick commented Jul 28, 2021

@jbowens

Have we tested read-triggered compactions on workloads with heterogeneous access patterns for various regions of the keyspace?

I'm thinking about a read-heavy, write-little SQL table, flanked within the keyspace by two read-little, write-heavy SQL tables. Flushes to the write-heavy tables may increase read-amplification for the reads, while read-triggered compactions would increase the write amplification for the write tables.

Haven't looked into this yet.

@itsbilal
A queue size of 5, kills the read triggered compaction performance.
Here's a comparison of queue size 5 vs queue size 100.

name                old ops/sec  new ops/sec  delta
ycsb/C/values=1024    416k ± 3%    639k ± 6%  +53.57%  (p=0.008 n=5+5)
name                old read     new read     delta
ycsb/C/values=1024   42.1G ± 7%   57.6G ± 4%  +36.80%  (p=0.008 n=5+5)
name                old write    new write    delta
ycsb/C/values=1024   42.1G ± 7%   57.6G ± 4%  +36.80%  (p=0.008 n=5+5)
name                old r-amp    new r-amp    delta
ycsb/C/values=1024    5.10 ± 1%    3.16 ± 3%  -38.14%  (p=0.008 n=5+5)
name                old w-amp    new w-amp    delta
ycsb/C/values=1024    0.00         0.00          ~     (all equal)
arjunnair@Arjuns-MacBook-Pro rtc_queue % 

Both benchmarks were run for over 10 minutes

  1. Essentially, I run ycsb C, with 1e7 initial keys. When the benchmark starts, for the first ~3 minutes, compactions which aren't read compactions occur. These probably occur to balance the lsm, after the initial inserts.

  2. Then, at the ~3 minute mark, read compactions start to happen. All this while read compactions were being queued/added. So, the queue with 100 compactions, ends up doing 100+ compactions over a minute or 2. The queue with the ~5 compactions, does ~5+ compactions over 20-30 seconds.

Note that this step leads to a waste of ~100 compactions, which were otherwise providing value, cause it is those compactions which really improve the performance of ycsb bench.

  1. After this point read compactions are still added to the queue, but compactions don't really happen, anymore(from the point when read compactions finish till the 10 minute mark). So, the pickAuto function in compaction_picker.go is never called.

  2. When I look at the benchmark logs, at some point the read compaction queue becomes empty(I guess the rate at which compactions are added to the queue CAN sometimes be lower than the rate at which compactions happen). After this point, no more compactions are queued. This makes sense to me, looking at the code.

I think the solution is to make sure that we keep trying to compact, if we add to the read compaction queue.

@itsbilal
Copy link
Member

If the queue empties out, what's likely happening is that the AllowedSeeks have been decremented all the way to or below 0 and so the file can never be re-queued for read compactions. That plus a bounded queue is probably resulting in a black hole of files that will never again be eligible for read compactions; it's why the queue is clearing out.

One thing you could do to address this is to have the same reader that sees allowedSeeks == 0 in sampleReads to also reset the AllowedSeeks back to what it was at the file's creation. That way, if the file falls out of the queue due to a lot of other files being eligible for read-triggered compactions, but it keeps being read, then it'll eventually become eligible again when AllowedSeeks falls to 0 again.

Another option is to have all readers that see a negative AllowedSeeks to enqueue the file to the queue, and to let the queue's bounded size + deduplication take care of things from there. The allowedSeeks == 0 logic made sense when the queue did absolutely no deduplication, but now that it does, that constraint can be relaxed.

Did you make the change to better account for overlapping spans instead of just looking for matching start+end strings? Are these findings after that suggested change?

@bananabrick
Copy link
Contributor

bananabrick commented Jul 28, 2021

Did you make the change to better account for overlapping spans instead of just looking for matching start+end strings? Are these findings after that suggested change?

No, it was only looking for perfect overlaps.

@itsbilal
Copy link
Member

The change I was benchmarking only has a limited sized queue, and none of the other changes.

I was referring to the change I had advised on the limited size queue PR, the one where you deduplicate by key spans using the comparator and not the string method. That would result in better deduplication as well as better comparison performance.

@bananabrick
Copy link
Contributor

The change I was benchmarking only has a limited sized queue, and none of the other changes.

I was referring to the change I had advised on the limited size queue PR, the one where you deduplicate by key spans using the comparator and not the string method. That would result in better deduplication as well as better comparison performance.

Oops! Yea, I edited that comment right when you posted this.

The issue is that we stop doing compactions even though the queue has compactions. This happens because nothing is triggering the call of maybeScheduleCompactionPicker as there are no on going flushes.

I'll also change the AllowedSeeks logic a bit, to see if that helps.

@bananabrick
Copy link
Contributor

@itsbilal

The AllowedSeeks fix was a great suggestion. Fixed my issue.

bananabrick added a commit that referenced this issue Oct 18, 2021
We were seeing some issues with read compactions as described here:
#1143

To prevent the issues, we came up with the following fixes:
1. We prevent the size of the read compactions queue, and also add some
   de-duplication logic to the queue.
2. We prevent read compactions which are too wide.
3. We prevent read compactions where the file which is being compacted
   overlaps with too much of the output level, relative to its size.
4. We prevent read compactions if the file associated with a range has

   changed.

Results from ycsb/size=64 run using the pebble ycsb roachtest:

name              old ops/sec  new ops/sec  delta
ycsb/E/values=64    229k ± 4%    221k ± 3%     ~     (p=0.151 n=5+5)
ycsb/A/values=64    825k ± 6%    825k ± 6%     ~     (p=1.000 n=5+5)
ycsb/D/values=64    877k ± 6%    858k ± 6%     ~     (p=0.310 n=5+5)
ycsb/F/values=64    237k ± 0%    237k ± 0%   -0.31%  (p=0.008 n=5+5)
ycsb/C/values=64   2.63M ±17%   2.48M ±20%     ~     (p=0.310 n=5+5)
ycsb/B/values=64   1.09M ± 5%   1.10M ± 6%     ~     (p=0.690 n=5+5)

name              old read     new read     delta
ycsb/E/values=64   21.5G ± 6%   20.0G ± 6%   -6.82%  (p=0.032 n=5+5)
ycsb/A/values=64   68.1G ± 3%   67.7G ± 3%     ~     (p=0.421 n=5+5)
ycsb/D/values=64   36.8G ± 7%   34.6G ± 7%     ~     (p=0.095 n=5+5)
ycsb/F/values=64    160G ± 1%    160G ± 1%     ~     (p=1.000 n=5+5)
ycsb/C/values=64   2.75G ± 4%   2.92G ± 4%     ~     (p=0.056 n=5+5)
ycsb/B/values=64   47.4G ±15%   22.7G ± 6%  -52.10%  (p=0.008 n=5+5)

name              old write    new write    delta
ycsb/E/values=64   22.8G ± 6%   21.3G ± 6%   -6.67%  (p=0.032 n=5+5)
ycsb/A/values=64   98.9G ± 4%   98.6G ± 4%     ~     (p=0.548 n=5+5)
ycsb/D/values=64   42.2G ± 7%   39.9G ± 6%     ~     (p=0.095 n=5+5)
ycsb/F/values=64    190G ± 0%    190G ± 0%     ~     (p=1.000 n=5+5)
ycsb/C/values=64   2.67G ± 4%   2.83G ± 4%   +6.31%  (p=0.032 n=5+5)
ycsb/B/values=64   51.5G ±14%   27.0G ± 6%  -47.71%  (p=0.008 n=5+5)

name              old r-amp    new r-amp    delta
ycsb/E/values=64    2.66 ± 2%    3.09 ± 2%  +16.34%  (p=0.008 n=5+5)
ycsb/A/values=64    6.57 ± 2%    6.61 ± 2%     ~     (p=0.452 n=5+5)
ycsb/D/values=64    4.13 ± 1%    4.21 ± 0%   +1.89%  (p=0.008 n=5+5)
ycsb/F/values=64    0.00         0.00          ~     (all equal)
ycsb/C/values=64    1.28 ±27%    1.55 ±36%     ~     (p=0.190 n=5+5)
ycsb/B/values=64    3.71 ± 2%    3.75 ± 9%     ~     (p=0.579 n=5+5)

name              old w-amp    new w-amp    delta
ycsb/E/values=64    26.9 ± 3%    26.0 ± 2%     ~     (p=0.095 n=5+5)
ycsb/A/values=64    3.24 ± 2%    3.23 ± 2%     ~     (p=0.246 n=5+5)
ycsb/D/values=64    13.0 ± 1%    12.6 ± 1%   -3.19%  (p=0.008 n=5+5)
ycsb/F/values=64    10.8 ± 1%    10.8 ± 0%     ~     (p=0.198 n=5+5)
ycsb/C/values=64    0.00         0.00          ~     (all equal)
ycsb/B/values=64    12.8 ±11%     6.6 ± 1%  -48.17%  (p=0.008 n=5+5)
@bananabrick
Copy link
Contributor

Closing this out as I've tested the fixes with our benchmarks. I'll keep an eye on the telemetry cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants