Garbage collection is unacceptably slow #2806

bowenwang1996 · 2020-06-07T04:40:47Z

When the node garbage collects data, it invokes clear_data, which deletes old block and chunk data. However, in practice we have observed extraordinary slowness in this function. Under current setup, at every call to clear_data, we garbage collect 100 heights worth of data and it takes 30-60s to execute after the network has been running for a while, which is absolutely devastating and unacceptable. I suspect that this has something to do with rocksdb compaction and deletion of data might have triggered compaction, which can take a while to run, depending on the size of the data. To address this issue, we should

in the short term find temporary solution (better than fix: mitigate gc slowness by reducing step size #2807) that minimizes the time spent on garbage collection.
After the temporary fix, we should develop comprehensive benchmarks to test and fine-tune rocksdb options, as well as our garbage collection parameters, or find some other ways to fully fix this issue.

The text was updated successfully, but these errors were encountered:

bowenwang1996 · 2020-06-07T04:44:47Z

@mikhailOK @frol @SkidanovAlex please help if you know anything related to rocksdb options.

A hot fix that mitigates #2806 by reducing garbage collection step to the absolute minimum. However, even if we just advance tail by 1 height at each step, `clear_data` still takes about 0.5s to execute, which is very suboptimal. Test plan ---------- Deploy on betanet and observe that garbage collection speeds up from more than 30s to 0.5s at each step.

Kouprin · 2020-06-09T16:39:12Z

It seems I can't resolve it now. Please assign on me when becomes actual.

Garbage collection is slow because we do not persist chunk tail and therefore `clear_chunk_data` iterates from 0 to min_chunk_height every time it is called. Fixes #2806.

bowenwang1996 mentioned this issue Jun 7, 2020

fix: mitigate gc slowness by reducing step size #2807

Merged

bowenwang1996 added P-critical Priority: critical A-storage Area: storage and databases labels Jun 7, 2020

bowenwang1996 assigned ailisp and chefsale Jun 7, 2020

bowenwang1996 assigned Kouprin and unassigned chefsale Jun 7, 2020

Kouprin mentioned this issue Jun 9, 2020

fix(GC): replace gc_step_size with gc_blocks_limit #2820

Merged

Kouprin removed their assignment Jun 9, 2020

bowenwang1996 assigned mikhailOK and unassigned ailisp Jun 9, 2020

This was referenced Jun 15, 2020

The node stopped producing blocks at the beginning of the epoch #2852

Closed

fix: persist chunk tail #2875

Merged

bowenwang1996 closed this as completed in #2875 Jun 22, 2020

bowenwang1996 added a commit that referenced this issue Jun 22, 2020

fix: persist chunk tail (#2875)

9bdc065

Garbage collection is slow because we do not persist chunk tail and therefore `clear_chunk_data` iterates from 0 to min_chunk_height every time it is called. Fixes #2806.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage collection is unacceptably slow #2806

Garbage collection is unacceptably slow #2806

bowenwang1996 commented Jun 7, 2020 •

edited

Loading

bowenwang1996 commented Jun 7, 2020 •

edited

Loading

Kouprin commented Jun 9, 2020

Garbage collection is unacceptably slow #2806

Garbage collection is unacceptably slow #2806

Comments

bowenwang1996 commented Jun 7, 2020 • edited Loading

bowenwang1996 commented Jun 7, 2020 • edited Loading

Kouprin commented Jun 9, 2020

bowenwang1996 commented Jun 7, 2020 •

edited

Loading

bowenwang1996 commented Jun 7, 2020 •

edited

Loading