Change ingest.ledgerbackend.LedgerBackend.GetLedger to be always blocking #3549

bartekn · 2021-04-16T11:53:41Z

What problem does your feature solve?

Currently LedgerBackend interface has two GetLedger methods:

GetLedger that has a different behaviour depending on how the range was prepared:
- For UnboundedRange it's non-blocking,
- For BoundedRange it is blocking.
GetLedgerBlocking that is always blocking no matter how the range was prepared (added just recently in services/horizon: Removed LedgerBackend calls inside ingest critical section #3518).

After a short discussion with @tamirms we realized that maybe we don't need a non-blocking version at all.

There are several advantages of blocking GetLedger:

It will return ledger data faster. Currently, when using GetLedger in a non-blocking mode, caller usually calls time.Sleep between the consecutive calls to GetLedger. This extends time between a request of the ledger data and the response.
An extra code was required in catchup code to fast-forward to the latest ledger. Users had to call GetLedger requesting the latest ledger sequence in a buffer to be able to make Captive-Core to proceed. Without it, Captive-Core was stopped because it contains only a small buffer of ledger data it can return to the caller.
The usage was confusing because the behaviour depend on the range and required extra caution. It was the primary reason why the bug fixed in services/horizon/internal/ingest: Add retries in historyRangeState when ledger is not available #3531 was introduced.
In general GetLedger was created to be non-blocking because it was using Stellar-Core DB. Right now, with Captive-Core we don't have this requirement.

What would you like to see?

We remove GetLedgerBlocking method.
We change GetLedger:
- New definition: GetLedger(ctx context.Context, sequence uint32) (xdr.LedgerCloseMeta, error)
- Remove bool return value and add ctx argument.
- Make it always blocking.
- If ctx is cancelled or LedgerBackend.Close() is called GetLedger returns empty xdr.LedgerCloseMeta and context.Canceled or context.DeadlineExceeded.

This breaking change should be released in accordance with semver.

What alternatives are there?

The text was updated successfully, but these errors were encountered:

paulbellamy · 2021-04-19T16:45:24Z

So, turns out we do still need a non-blocking version for serving web requests at https://github.com/stellar/go/blob/master/exp/services/captivecore/internal/api.go#L173. Or we need to change how that is handled.

We could either:

Rename:
- GetLedger -> GetLedgerAsync (or some other name. MaybeGetLedger?)
- GetLedgerBlocking -> GetLedger
Leave the names, and just change the places we call it to use GetLedgerBlocking
- Leaves a bit of legacy stuff, but should we do this anyway?
Add the context.Context param into the GetLedger call, so that if the request is cancelled, the blocking goroutine is also cancelled.

bartekn · 2021-04-20T14:41:41Z

Add the context.Context param into the GetLedger call, so that if the request is cancelled, the blocking goroutine is also cancelled.

Yeah, this is what I suggested in the first comment. So a blocking GetLedger can be changed to non-blocking if needed.

paulbellamy · 2021-04-23T17:17:39Z

Thinking out loud

Ok, so in ingest/ledgerbackend.RemoteCaptiveCore, we set an http request timeout of 5s, which seems reasonable.

We can either:

set a timeout <5s on the server-side, to make it "non-blocking"
- But then we need to catch that on the client and retry? In that case it should be an http 408? 504? status, so we could catch and handle that.
- Means the server timeout must be < the client timeout, which is a bit spooky action...
- Seems like the best solution, though.
handle client-side timeouts and retry
- tricky, as it could mask a real failure
remove the client-side timeout specifically on the /ledger/{sequence} call, so it becomes long-polling, where the connection stays open until the ledger is ready...
- Would need to set up a separate http.Client ensuring we still have connection/keep-alive timeouts. Seems like a faff.

tamirms · 2021-04-26T07:54:12Z

@paulbellamy PrepareRange() is a potentially long running operation. The captive core api has a strategy for making that operation asynchronous by spawning a goroutine. I think you could mimic that strategy for a blocking GetLedger() call as well.

bartekn · 2021-04-26T13:48:04Z

We could also do something similar to DatabaseBackend which is request the value from remote captive core in a loop and return when it's there (similar to your "handle client-side timeouts and retry" point?).

paulbellamy · 2021-04-26T17:56:08Z

@paulbellamy PrepareRange() is a potentially long running operation. The captive core api has a strategy for making that operation asynchronous by spawning a goroutine. I think you could mimic that strategy for a blocking GetLedger() call as well.

How does that help? Unlike PrepareRange, GetLedger needs to return data back to the client, so if the client is gone there's no reason to keep the GetLedger call alive.

We could also do something similar to DatabaseBackend which is request the value from remote captive core in a loop and return when it's there (similar to your "handle client-side timeouts and retry" point?).

Yeah, agreed. But RemoteCaptiveCore need some way to differentiate "timed out because the data isn't there" (http 408 maybe?) from "timed out because the server is broken" (http 504 maybe?).

Is RemoteCaptiveCore ever used to talk to "captive cores" outside the horizon process? Specifically, could RemoteCaptiveCore ever be talking to a core of a different version from the horizon instance?

tamirms · 2021-04-26T19:28:42Z

How does that help? Unlike PrepareRange, GetLedger needs to return data back to the client, so if the client is gone there's no reason to keep the GetLedger call alive.

The remote captive core API does return data back to the client for PrepareRange.

Also, there is a reason to keep the GetLedger call alive even if the client request ends. GetLedger calls are always non-decreasing. The client wants to keep advancing the ledger stream. Once captive core reaches desired ledger from the GetLedger call, that ledger is cached so that subsequent calls for the same ledger return instantly.

bartekn added the ingest New ingestion system label Apr 16, 2021

bartekn mentioned this issue Apr 16, 2021

all: Refactor db.Session to take ctx into each call, instead of storing it on the session level #3545

Merged

9 tasks

paulbellamy self-assigned this Apr 19, 2021

paulbellamy closed this as completed May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change ingest.ledgerbackend.LedgerBackend.GetLedger to be always blocking #3549

Change ingest.ledgerbackend.LedgerBackend.GetLedger to be always blocking #3549

bartekn commented Apr 16, 2021

paulbellamy commented Apr 19, 2021 •

edited

Loading

bartekn commented Apr 20, 2021

paulbellamy commented Apr 23, 2021

tamirms commented Apr 26, 2021

bartekn commented Apr 26, 2021

paulbellamy commented Apr 26, 2021 •

edited

Loading

tamirms commented Apr 26, 2021

Change ingest.ledgerbackend.LedgerBackend.GetLedger to be always blocking #3549

Change ingest.ledgerbackend.LedgerBackend.GetLedger to be always blocking #3549

Comments

bartekn commented Apr 16, 2021

What problem does your feature solve?

What would you like to see?

What alternatives are there?

paulbellamy commented Apr 19, 2021 • edited Loading

bartekn commented Apr 20, 2021

paulbellamy commented Apr 23, 2021

tamirms commented Apr 26, 2021

bartekn commented Apr 26, 2021

paulbellamy commented Apr 26, 2021 • edited Loading

tamirms commented Apr 26, 2021

paulbellamy commented Apr 19, 2021 •

edited

Loading

paulbellamy commented Apr 26, 2021 •

edited

Loading