Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat:splitstore:Configure max space used by hotstore and GC makes best effort to respect #10391

Merged
merged 19 commits into from
Mar 9, 2023

Conversation

ZenGround0
Copy link
Contributor

@ZenGround0 ZenGround0 commented Mar 5, 2023

Related Issues

#10388

Proposed Changes

  • Add size reporting to chain walking subfunctions
  • Record sizes of the different DAGs walked during compaction
  • New config value for targeting max total space used by hotstore
  • GC logic to trigger moving GC adaptively based on hotstore size
  • Most importantly: splitstore refuses to do moving GC if disk usage would overflow the targetted max

Additional Info

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

Base automatically changed from feat/lotus-badger-gc to master March 7, 2023 07:52
@ZenGround0 ZenGround0 changed the title WIP hotstore reflects on space usage and targets max feat:splitstore:Configure max space used by hotstore and GC makes best effort to respect Mar 7, 2023
@ZenGround0 ZenGround0 marked this pull request as ready for review March 7, 2023 14:43
@ZenGround0 ZenGround0 requested a review from a team as a code owner March 7, 2023 14:43
@ZenGround0 ZenGround0 requested a review from vyzo March 7, 2023 14:44
Copy link
Contributor

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmm, ok, looks sane.

Potential concern: can we end up continuously moving gc in every compaction?

Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General idea makes a ton of sense. Couple of comments

Would be great to hook-up those sizes / size estimates to the metrics endpoint somehow - https://github.com/filecoin-project/lotus/blob/master/metrics/metrics.go#L160

documentation/en/default-lotus-config.toml Outdated Show resolved Hide resolved
blockstore/splitstore/splitstore_gc.go Outdated Show resolved Hide resolved
blockstore/splitstore/splitstore_compact.go Outdated Show resolved Hide resolved
blockstore/splitstore/splitstore_compact.go Outdated Show resolved Hide resolved
blockstore/splitstore/splitstore_gc.go Show resolved Hide resolved
Comment on lines 55 to 57
log.Warnf("Attention! Estimated moving GC size %d is not within safety buffer %d of target max %d, performing aggressive online GC to attempt to bring hotstore size down safely", copySizeApprox, targetBuffer, s.cfg.HotstoreMaxSpaceTarget)
log.Warn("If problem continues you can 1) temporarily allocate more disk space to hotstore and 2) reflect in HotstoreMaxSpaceTarget OR trigger manual move with `lotus chain prune hot-moving`")
log.Warn("If problem continues and you do not have any more disk space you can run continue to manually trigger online GC at aggressive thresholds (< 0.01) with `lotus chain prune hot`")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth using the alert system to raise a (generally much more visible / loud) alert.

See e.g. https://github.com/filecoin-project/lotus/blob/master/node/modules/alerts.go

(alerts are for things which tell the user that something either is, or is about to be severely broken, and they need to take action)

@ZenGround0
Copy link
Contributor Author

Potential concern: can we end up continuously moving gc in every compaction?

Reasonable concern, this is definitely possible. In my opinion this is more of a feature than a bug however. Misconfigurations are certainly possible but I would say we don't expose any worse footgun than allowing FullGCFreq = 1 with current config.

What I've come to believe is there are only two real problems users face with splitstore: disk overflow and chain out of sync. The only risk with too much moving GC when it is done without overflowing disk is that it gets the chain out of sync. But I think we can address chain out of sync additionally in #10392 or an extension after fixing a solution to disk overflow.

@ribasushi
Copy link
Collaborator

FullGCFreq = 1 with current config.

This is how I configure my experimental splitstore nodes, anything besides "do moving GC every time" makes no sense from an operational perspective: you want your dataset to always be as small as possible.

@ZenGround0
Copy link
Contributor Author

anything besides "do moving GC every time" makes no sense from an operational perspective

While I agree that "do moving GC every time" often makes sense from an operational perspective I think this is overstated. Moving GC is objectively more expensive than online in terms of memory, compute and disk io. Paying the cost on every compaction to keep datastore down is likely often a good idea but keeping more disk around and then doing this in batches is a reasonable point in the tradeoff space.

@ZenGround0
Copy link
Contributor Author

@magik6k ready for another round. I ran out of time so filing this follow up to get alerting and metrics in place after the merge: #10428.

Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, 2 small nits, but other than that looks good.

If it's not perfect bugs can be fixed in the RC process

blockstore/splitstore/splitstore_compact.go Outdated Show resolved Hide resolved
blockstore/splitstore/splitstore_gc.go Outdated Show resolved Hide resolved
@magik6k magik6k merged commit faedc12 into master Mar 9, 2023
@magik6k magik6k deleted the feat/record-hotstore-space branch March 9, 2023 16:41
@ZenGround0
Copy link
Contributor Author

Thanks for the make gen assist @magik6k !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants