services/horizon/internal/ingest: reap lookup tables without blocking ingestion #5405

tamirms · 2024-07-30T19:15:31Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've updated the relevant CHANGELOG (here for Horizon) if
needed with deprecations, added features, breaking changes, and DB schema changes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

Close #4870

This PR improves reaping of history lookup tables (e.g. history_accounts, history_claimable_balances) so that it can run safely in parallel with ingestion. Currently, reaping of history lookup tables is a blocking operation for ingestion so if the queries to reap history lookup tables take too long that can result in ingestion lag. With this PR, reaping of history lookup tables will be able to run concurrently to ingestion with minimal contention. Also, it is important to note that this PR does not add any performance degradation for either reingestion or live ingestion.

When reviewing this PR it would be helpful to read this design doc:

https://docs.google.com/document/d/1CGfBCS99MTEZDP4mMhV1o6Z5NE_Tlg7ENCcWTwzhlio/edit

Known limitations

After running a full vacuum on history_accounts, the reaping query sped up dramatically. Previously, the duration of reaping the history_accounts table peaked at ~1.9 seconds:

https://grafana.stellar-ops.com/d/x8xDSQQIk/stellar-horizon?orgId=1&from=1722295775773&to=1722400061302&var-environment=stg&var-cluster=pubnet&var-network=All&var-route=All&viewPanel=2531

After the vacuum, the average duration for reaping history_accounts is ~20 ms and the peak duration was ~400 ms:

https://grafana.stellar-ops.com/d/x8xDSQQIk/stellar-horizon?orgId=1&from=1724782666959&to=1724869066959&var-environment=stg&var-cluster=pubnet&var-network=All&var-route=All&viewPanel=2531

This means that the risk that reaping of history lookup tables taking so long that it introduces ingestion lag is a lot less of a concern.

Update:

After running reaping of history lookup tables on staging for 24 hours I have observed that the peak duration actually reaches 600 ms.

https://grafana.stellar-ops.com/d/x8xDSQQIk/stellar-horizon?orgId=1&from=1724866821793&to=1724953221793&var-environment=stg&var-cluster=pubnet&var-network=All&var-route=All&viewPanel=2531

services/horizon/internal/db2/history/account_loader.go

services/horizon/internal/db2/history/key_value.go

services/horizon/internal/db2/history/loader_concurrency_test.go

services/horizon/internal/db2/history/main.go

services/horizon/internal/db2/history/verify_lock.go

services/horizon/internal/ingest/main.go

sreuland · 2024-08-30T21:48:39Z

one edge case wanted to check on, if a user reingests an older range which goes further back than the retention period cutoff, and reaping for data and lookup tables has already completed for that retention period, will the next iteration of lookup reaper sense those and delete the qualified(orhpaned) lookup ids in that case? I ask b/c of the offsets for reapers that are stored in key-value, it seems like once those advance, the reaper won't inspect that older id range anymore?

Co-authored-by: shawn <sreuland@users.noreply.github.com>

tamirms · 2024-09-03T13:20:23Z

@sreuland

will the next iteration of lookup reaper sense those and delete the qualified(orhpaned) lookup ids in that case? I ask b/c of the offsets for reapers that are stored in key-value, it seems like once those advance, the reaper won't inspect that older id range anymore?

no, in that scenario those rows will not be deleted in the next iteration. However, eventually the reaper will traverse through all rows from the history lookup tables. Once it does that, the reaper will start from 0 again. So, eventually the reaper will wrap around and pickup those orphaned rows (though it might take a long time to do so for very large tables like history_claimable_balances)

services/horizon/internal/db2/history/key_value.go

tamirms · 2024-09-03T13:22:11Z

@sreuland I believe I have addressed your feedback. PTAL, thanks!

sreuland

looks great, nice work!

tamirms force-pushed the concurrent-reap branch 3 times, most recently from 37b3e5b to c671a7a Compare August 15, 2024 10:31

tamirms added 9 commits August 27, 2024 11:46

reap lookup tables without blocking ingestion

6f9397e

refactor

90e56fc

fix missing error check

acbe6e0

fix duration metrics

0d2c17d

add log

6922a39

fix delete query

a169e72

fix rows_deleted log

513ad1b

lock pre-existing rows

4f36d70

change order to fix race condition with reaper

93be300

tamirms force-pushed the concurrent-reap branch from c671a7a to 93be300 Compare August 27, 2024 10:46

tamirms added 3 commits August 27, 2024 13:14

use CTE above delete query

27afb89

add concurrency modes

2c17f4f

add tests for concurrency mode

399818f

tamirms marked this pull request as ready for review August 28, 2024 18:16

tamirms requested a review from a team August 29, 2024 17:47

sreuland reviewed Aug 29, 2024

View reviewed changes

services/horizon/internal/db2/history/account_loader.go Outdated Show resolved Hide resolved