ingest/ledgerbackend: Debug slow ingestion during catchup in captive-core #3132

bartekn · 2020-10-15T15:25:43Z

What version are you using?

1.10.0

What did you do?

Start Horizon + Captive Core from empty DB. Wait for state ingestion. Then it should catchup from checkpoint ledger up to the latest ledger.

What did you expect to see?

Ingestion is fast.

What did you see instead?

Every 4-5 ledgers ingestion is slower even when number of ops/changes is comparable to previous ledgers. See my catchup from earlier today:

time="2020-10-14T21:14:48.059Z" level=info msg="Processed ledger" commit=true duration=0.0413032 ledger=true pid=1 sequence=1228494 service=expingest state=true stats_accounts_created=0 stats_accounts_removed=0 stats_accounts_updated=2 stats_data_created=0 stats_data_removed=0 stats_data_updated=0 stats_offers_created=0 stats_offers_removed=0 stats_offers_updated=0 stats_operations=1 stats_operations_account_merge=0 stats_operations_allow_trust=0 stats_operations_begin_sponsoring_future_reserves=0 stats_operations_bump_sequence=0 stats_operations_change_trust=0 stats_operations_claim_claimable_balance=0 stats_operations_create_account=0 stats_operations_create_claimable_balance=0 stats_operations_create_passive_sell_offer=0 stats_operations_end_sponsoring_future_reserves=0 stats_operations_in_failed=0 stats_operations_in_successful=1 stats_operations_inflation=0 stats_operations_manage_buy_offer=0 stats_operations_manage_data=0 stats_operations_manage_sell_offer=0 stats_operations_path_payment_strict_receive=0 stats_operations_path_payment_strict_send=0 stats_operations_payment=1 stats_operations_revoke_sponsorship=0 stats_operations_set_options=0 stats_transactions=1 stats_transactions_failed=0 stats_transactions_successful=1 stats_trust_lines_created=0 stats_trust_lines_removed=0 stats_trust_lines_updated=1
time="2020-10-14T21:14:48.060Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=1228493)" next_state="resume(latestSuccessfullyProcessedLedger=1228494)" pid=1 service=expingest
GCH27 [Herder INFO] Quorum information for 1228565 : {"agree":3,"delayed":0,"disagree":0,"fail_at":2,"hash":"59d361","lag_ms":13,"ledger":1228565,"missing":0,"phase":"EXTERNALIZE"}

time="2020-10-14T21:14:48.083Z" level=info msg="Processing ledger" commit=true ledger=true pid=1 sequence=1228495 service=expingest state=true
GCH27 [Ledger INFO] Got consensus: [seq=1228567, prev=f58820, txs=8, ops=8, sv: [ SIGNED@sdf_testnet_3 txH: 8ef9b8, ct: 1602710085, upgrades: [ ] ]]
GCH27 [Ledger INFO] Close of ledger 1228567 buffered
time="2020-10-14T21:14:48.089Z" level=info msg="Trimmed offers table" offer_rows_removed=0 pid=1 service=expingest
GCH27 [Tx INFO] applying ledger 1228499 (txs:2, ops:2)
GCH27 [Tx INFO] applying ledger 1228500 (txs:5, ops:84)
time="2020-10-14T21:14:49.118Z" level=warning msg="error updating stellar-core cursor" error="Setting stellar-core cursor failed: http request errored: Get \"http://localhost:11626/setcursor?cursor=1228495&id=HORIZON\": context deadline exceeded" pid=1 service=expingest
time="2020-10-14T21:14:49.119Z" level=info msg="Processed ledger" commit=true duration=1.0357502 ledger=true pid=1 sequence=1228495 service=expingest state=true stats_accounts_created=4 stats_accounts_removed=0 stats_accounts_updated=14 stats_data_created=0 stats_data_removed=0 stats_data_updated=0 stats_offers_created=0 stats_offers_removed=0 stats_offers_updated=0 stats_operations=5 stats_operations_account_merge=0 stats_operations_allow_trust=0 stats_operations_begin_sponsoring_future_reserves=0 stats_operations_bump_sequence=0 stats_operations_change_trust=0 stats_operations_claim_claimable_balance=0 stats_operations_create_account=4 stats_operations_create_claimable_balance=0 stats_operations_create_passive_sell_offer=0 stats_operations_end_sponsoring_future_reserves=0 stats_operations_in_failed=0 stats_operations_in_successful=5 stats_operations_inflation=0 stats_operations_manage_buy_offer=0 stats_operations_manage_data=0 stats_operations_manage_sell_offer=0 stats_operations_path_payment_strict_receive=0 stats_operations_path_payment_strict_send=0 stats_operations_payment=1 stats_operations_revoke_sponsorship=0 stats_operations_set_options=0 stats_transactions=5 stats_transactions_failed=0 stats_transactions_successful=5 stats_trust_lines_created=0 stats_trust_lines_removed=0 stats_trust_lines_updated=1

Ledger 1228494 was ingested in 0.04 but 1228495 with almost the same number of ops/changes in more than 1s!

I think it can be related to buffering of ledgers loaded in captive core backend which I was thinking had been fixed in #2763 but apparently it still can be super slow.

The text was updated successfully, but these errors were encountered:

bartekn · 2020-10-30T22:04:53Z

I wrote a small script based on captive-core-start-tester to help me debug this. I think it comes down to the read-ahead buffer size (now just 2 ledgers). There are two situations in which this issue can occur:

If ingestion is faster than Stellar-Core applies ledgers then Horizon waits for next available ledger for 1s because GetLedger is not blocking in online mode (instead of blocking it just returns false in first return argument). Horizon could ingest more but there are only 2 in a buffer.
If Stellar-Core is faster it can't apply ledgers because ingestion package doesn't read more than 2.

I think we should increase the buffer size but it's not trivial because there are two constraints that must be met:

The buffer shouldn't exceed X of RAM but...
It should fill at least one ledger no matter how big it is (even if exceeds allow memory constraint).

This is because some ledgers can exceed allow size and having a memory constraint only would result in a deadlock.

I'll try to code a solution for this next week.

…ore backend (#3187) This commit introduces `bufferedLedgerMetaReader` which decouples buffering and unmarshaling from `stellarCoreRunner` and `CaptiveStellarCore`. `bufferedLedgerMetaReader` fixes multiple issues: * It fixes #3132 by increasing internal buffers' sizes to hold more ledgers. It makes catchup code much faster. * It fixes #3158 - `bufferedLedgerMetaReader` allowed rewriting shutdown code to a much simpler version. Now `bufferedLedgerMetaReader` and `CaptiveStellarCore` listen to a single shutdown signal: `stellarCoreRunner.getProcessExitChan()`. When Stellar-Core process terminates `bufferedLedgerMetaReader.Start` go routine will stop and `CaptiveStellarCore` will return a user friendly error in `PrepareRange` and `GetLedger` methods. When `CaptiveStellarCore.Close()` is called, it kills the Stellar-Core processing triggering shutdown code explained above. * Decouple buffering and unmarshaling into a single struct. This makes `stellarCoreRunner` and `CaptiveStellarCore` simpler. * It fixes a possible OOM issue when network closes a series of large ledgers. In such case `bufferedLedgerMetaReader` will wait for a buffer to be consumed first before reading more ledgers into memory preventing an increased memory usage.

bartekn added fast-txmeta bug labels Oct 15, 2020

bartekn self-assigned this Oct 29, 2020

bartekn mentioned this issue Nov 3, 2020

ingest/ledgerbackend: Decouple meta pipe buffering in CaptiveStellarCore backend #3187

Merged

7 tasks

bartekn closed this as completed in #3187 Nov 9, 2020

leighmcculloch mentioned this issue Nov 20, 2021

services/horizon: slower catch up with captive core vs non-captive core #4090

Closed

tamirms mentioned this issue Nov 21, 2023

Ensure captive core LCL is not ahead of horizon last ingested ledger during restarts #5123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest/ledgerbackend: Debug slow ingestion during catchup in captive-core #3132

ingest/ledgerbackend: Debug slow ingestion during catchup in captive-core #3132

bartekn commented Oct 15, 2020 •

edited

Loading

bartekn commented Oct 30, 2020

ingest/ledgerbackend: Debug slow ingestion during catchup in captive-core #3132

ingest/ledgerbackend: Debug slow ingestion during catchup in captive-core #3132

Comments

bartekn commented Oct 15, 2020 • edited Loading

What version are you using?

What did you do?

What did you expect to see?

What did you see instead?

bartekn commented Oct 30, 2020

bartekn commented Oct 15, 2020 •

edited

Loading