Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest/ledgerbackend: Debug slow ingestion during catchup in captive-core #3132

Closed
bartekn opened this issue Oct 15, 2020 · 1 comment · Fixed by #3187
Closed

ingest/ledgerbackend: Debug slow ingestion during catchup in captive-core #3132

bartekn opened this issue Oct 15, 2020 · 1 comment · Fixed by #3187
Assignees

Comments

@bartekn
Copy link
Contributor

bartekn commented Oct 15, 2020

What version are you using?

1.10.0

What did you do?

Start Horizon + Captive Core from empty DB. Wait for state ingestion. Then it should catchup from checkpoint ledger up to the latest ledger.

What did you expect to see?

Ingestion is fast.

What did you see instead?

Every 4-5 ledgers ingestion is slower even when number of ops/changes is comparable to previous ledgers. See my catchup from earlier today:

time="2020-10-14T21:14:48.059Z" level=info msg="Processed ledger" commit=true duration=0.0413032 ledger=true pid=1 sequence=1228494 service=expingest state=true stats_accounts_created=0 stats_accounts_removed=0 stats_accounts_updated=2 stats_data_created=0 stats_data_removed=0 stats_data_updated=0 stats_offers_created=0 stats_offers_removed=0 stats_offers_updated=0 stats_operations=1 stats_operations_account_merge=0 stats_operations_allow_trust=0 stats_operations_begin_sponsoring_future_reserves=0 stats_operations_bump_sequence=0 stats_operations_change_trust=0 stats_operations_claim_claimable_balance=0 stats_operations_create_account=0 stats_operations_create_claimable_balance=0 stats_operations_create_passive_sell_offer=0 stats_operations_end_sponsoring_future_reserves=0 stats_operations_in_failed=0 stats_operations_in_successful=1 stats_operations_inflation=0 stats_operations_manage_buy_offer=0 stats_operations_manage_data=0 stats_operations_manage_sell_offer=0 stats_operations_path_payment_strict_receive=0 stats_operations_path_payment_strict_send=0 stats_operations_payment=1 stats_operations_revoke_sponsorship=0 stats_operations_set_options=0 stats_transactions=1 stats_transactions_failed=0 stats_transactions_successful=1 stats_trust_lines_created=0 stats_trust_lines_removed=0 stats_trust_lines_updated=1
time="2020-10-14T21:14:48.060Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=1228493)" next_state="resume(latestSuccessfullyProcessedLedger=1228494)" pid=1 service=expingest
GCH27 [Herder INFO] Quorum information for 1228565 : {"agree":3,"delayed":0,"disagree":0,"fail_at":2,"hash":"59d361","lag_ms":13,"ledger":1228565,"missing":0,"phase":"EXTERNALIZE"}

time="2020-10-14T21:14:48.083Z" level=info msg="Processing ledger" commit=true ledger=true pid=1 sequence=1228495 service=expingest state=true
GCH27 [Ledger INFO] Got consensus: [seq=1228567, prev=f58820, txs=8, ops=8, sv: [ SIGNED@sdf_testnet_3 txH: 8ef9b8, ct: 1602710085, upgrades: [ ] ]]
GCH27 [Ledger INFO] Close of ledger 1228567 buffered
time="2020-10-14T21:14:48.089Z" level=info msg="Trimmed offers table" offer_rows_removed=0 pid=1 service=expingest
GCH27 [Tx INFO] applying ledger 1228499 (txs:2, ops:2)
GCH27 [Tx INFO] applying ledger 1228500 (txs:5, ops:84)
time="2020-10-14T21:14:49.118Z" level=warning msg="error updating stellar-core cursor" error="Setting stellar-core cursor failed: http request errored: Get \"http://localhost:11626/setcursor?cursor=1228495&id=HORIZON\": context deadline exceeded" pid=1 service=expingest
time="2020-10-14T21:14:49.119Z" level=info msg="Processed ledger" commit=true duration=1.0357502 ledger=true pid=1 sequence=1228495 service=expingest state=true stats_accounts_created=4 stats_accounts_removed=0 stats_accounts_updated=14 stats_data_created=0 stats_data_removed=0 stats_data_updated=0 stats_offers_created=0 stats_offers_removed=0 stats_offers_updated=0 stats_operations=5 stats_operations_account_merge=0 stats_operations_allow_trust=0 stats_operations_begin_sponsoring_future_reserves=0 stats_operations_bump_sequence=0 stats_operations_change_trust=0 stats_operations_claim_claimable_balance=0 stats_operations_create_account=4 stats_operations_create_claimable_balance=0 stats_operations_create_passive_sell_offer=0 stats_operations_end_sponsoring_future_reserves=0 stats_operations_in_failed=0 stats_operations_in_successful=5 stats_operations_inflation=0 stats_operations_manage_buy_offer=0 stats_operations_manage_data=0 stats_operations_manage_sell_offer=0 stats_operations_path_payment_strict_receive=0 stats_operations_path_payment_strict_send=0 stats_operations_payment=1 stats_operations_revoke_sponsorship=0 stats_operations_set_options=0 stats_transactions=5 stats_transactions_failed=0 stats_transactions_successful=5 stats_trust_lines_created=0 stats_trust_lines_removed=0 stats_trust_lines_updated=1

Ledger 1228494 was ingested in 0.04 but 1228495 with almost the same number of ops/changes in more than 1s!

I think it can be related to buffering of ledgers loaded in captive core backend which I was thinking had been fixed in #2763 but apparently it still can be super slow.

@bartekn
Copy link
Contributor Author

bartekn commented Oct 30, 2020

I wrote a small script based on captive-core-start-tester to help me debug this. I think it comes down to the read-ahead buffer size (now just 2 ledgers). There are two situations in which this issue can occur:

  • If ingestion is faster than Stellar-Core applies ledgers then Horizon waits for next available ledger for 1s because GetLedger is not blocking in online mode (instead of blocking it just returns false in first return argument). Horizon could ingest more but there are only 2 in a buffer.
  • If Stellar-Core is faster it can't apply ledgers because ingestion package doesn't read more than 2.

I think we should increase the buffer size but it's not trivial because there are two constraints that must be met:

  • The buffer shouldn't exceed X of RAM but...
  • It should fill at least one ledger no matter how big it is (even if exceeds allow memory constraint).

This is because some ledgers can exceed allow size and having a memory constraint only would result in a deadlock.

I'll try to code a solution for this next week.

bartekn added a commit that referenced this issue Nov 9, 2020
…ore backend (#3187)

This commit introduces `bufferedLedgerMetaReader`  which decouples
buffering and unmarshaling from `stellarCoreRunner` and
`CaptiveStellarCore`.

`bufferedLedgerMetaReader` fixes multiple issues:

* It fixes #3132 by increasing internal buffers' sizes to hold more
ledgers. It makes catchup code much faster.
* It fixes #3158 - `bufferedLedgerMetaReader` allowed rewriting shutdown
code to a much simpler version. Now `bufferedLedgerMetaReader` and
`CaptiveStellarCore` listen to a single shutdown signal:
`stellarCoreRunner.getProcessExitChan()`. When Stellar-Core process
terminates `bufferedLedgerMetaReader.Start` go routine will stop and
`CaptiveStellarCore` will return a user friendly error in `PrepareRange`
and `GetLedger` methods. When `CaptiveStellarCore.Close()` is called, it
kills the Stellar-Core processing triggering shutdown code explained
above.
* Decouple buffering and unmarshaling into a single struct. This makes
`stellarCoreRunner` and `CaptiveStellarCore` simpler.
* It fixes a possible OOM issue when network closes a series of large
ledgers. In such case `bufferedLedgerMetaReader` will wait for a buffer
to be consumed first before reading more ledgers into memory preventing
an increased memory usage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant