Add Horizon metric for keeping track of slow ingestion restarts #5417

tamirms · 2024-08-08T22:06:06Z

While rolling out the limited history horizon instance to production we realized that some of the production ingesting horizon nodes were configured to run captive core with BucketDB disabled. Consequently, captive core was running with in-memory mode and this resulted in Horizon being very slow to resume ingestion after restarts.

We should have a metric which is incremented whenever captive-core has to catch up from scratch instead of quickly resuming from the LCL recorded in BucketDB. Once this metric is in place, we can modify our release testing checklist to make sure that the release branch of horizon does not regress by restarting captive-core from scratch at an unusually high rate.

tamirms added this to Platform Scrum Aug 13, 2024

github-project-automation bot moved this to Backlog in Platform Scrum Aug 13, 2024

tamirms added the cdp-horizon-scrum label Aug 13, 2024

tamirms added this to the platform sprint 50 milestone Aug 13, 2024

tamirms self-assigned this Aug 29, 2024

tamirms mentioned this issue Aug 30, 2024

ingest/ledgerbackend: Add prometheus metrics to track captive core startup time #5449

Merged

7 tasks

tamirms closed this as completed in #5449 Sep 3, 2024

github-project-automation bot moved this from Needs Review to Done in Platform Scrum Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Horizon metric for keeping track of slow ingestion restarts #5417

Add Horizon metric for keeping track of slow ingestion restarts #5417

tamirms commented Aug 8, 2024

Add Horizon metric for keeping track of slow ingestion restarts #5417

Add Horizon metric for keeping track of slow ingestion restarts #5417

Comments

tamirms commented Aug 8, 2024