Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change ingest.ledgerbackend.LedgerBackend.GetLedger to be always blocking #3549

Closed
bartekn opened this issue Apr 16, 2021 · 7 comments
Closed
Assignees
Labels
ingest New ingestion system

Comments

@bartekn
Copy link
Contributor

bartekn commented Apr 16, 2021

What problem does your feature solve?

Currently LedgerBackend interface has two GetLedger methods:

After a short discussion with @tamirms we realized that maybe we don't need a non-blocking version at all.

There are several advantages of blocking GetLedger:

  • It will return ledger data faster. Currently, when using GetLedger in a non-blocking mode, caller usually calls time.Sleep between the consecutive calls to GetLedger. This extends time between a request of the ledger data and the response.
  • An extra code was required in catchup code to fast-forward to the latest ledger. Users had to call GetLedger requesting the latest ledger sequence in a buffer to be able to make Captive-Core to proceed. Without it, Captive-Core was stopped because it contains only a small buffer of ledger data it can return to the caller.
  • The usage was confusing because the behaviour depend on the range and required extra caution. It was the primary reason why the bug fixed in services/horizon/internal/ingest: Add retries in historyRangeState when ledger is not available #3531 was introduced.
  • In general GetLedger was created to be non-blocking because it was using Stellar-Core DB. Right now, with Captive-Core we don't have this requirement.

What would you like to see?

  • We remove GetLedgerBlocking method.
  • We change GetLedger:
    • New definition: GetLedger(ctx context.Context, sequence uint32) (xdr.LedgerCloseMeta, error)
    • Remove bool return value and add ctx argument.
    • Make it always blocking.
    • If ctx is cancelled or LedgerBackend.Close() is called GetLedger returns empty xdr.LedgerCloseMeta and context.Canceled or context.DeadlineExceeded.

This breaking change should be released in accordance with semver.

What alternatives are there?

@paulbellamy
Copy link
Contributor

paulbellamy commented Apr 19, 2021

So, turns out we do still need a non-blocking version for serving web requests at https://github.com/stellar/go/blob/master/exp/services/captivecore/internal/api.go#L173. Or we need to change how that is handled.

We could either:

  • Rename:
    • GetLedger -> GetLedgerAsync (or some other name. MaybeGetLedger?)
    • GetLedgerBlocking -> GetLedger
  • Leave the names, and just change the places we call it to use GetLedgerBlocking
    • Leaves a bit of legacy stuff, but should we do this anyway?
  • Add the context.Context param into the GetLedger call, so that if the request is cancelled, the blocking goroutine is also cancelled.

@bartekn
Copy link
Contributor Author

bartekn commented Apr 20, 2021

Add the context.Context param into the GetLedger call, so that if the request is cancelled, the blocking goroutine is also cancelled.

Yeah, this is what I suggested in the first comment. So a blocking GetLedger can be changed to non-blocking if needed.

@paulbellamy
Copy link
Contributor

Thinking out loud

Ok, so in ingest/ledgerbackend.RemoteCaptiveCore, we set an http request timeout of 5s, which seems reasonable.

We can either:

  • set a timeout <5s on the server-side, to make it "non-blocking"
    • But then we need to catch that on the client and retry? In that case it should be an http 408? 504? status, so we could catch and handle that.
    • Means the server timeout must be < the client timeout, which is a bit spooky action...
    • Seems like the best solution, though.
  • handle client-side timeouts and retry
    • tricky, as it could mask a real failure
  • remove the client-side timeout specifically on the /ledger/{sequence} call, so it becomes long-polling, where the connection stays open until the ledger is ready...
    • Would need to set up a separate http.Client ensuring we still have connection/keep-alive timeouts. Seems like a faff.

@tamirms
Copy link
Contributor

tamirms commented Apr 26, 2021

@paulbellamy PrepareRange() is a potentially long running operation. The captive core api has a strategy for making that operation asynchronous by spawning a goroutine. I think you could mimic that strategy for a blocking GetLedger() call as well.

@bartekn
Copy link
Contributor Author

bartekn commented Apr 26, 2021

We could also do something similar to DatabaseBackend which is request the value from remote captive core in a loop and return when it's there (similar to your "handle client-side timeouts and retry" point?).

@paulbellamy
Copy link
Contributor

paulbellamy commented Apr 26, 2021

@paulbellamy PrepareRange() is a potentially long running operation. The captive core api has a strategy for making that operation asynchronous by spawning a goroutine. I think you could mimic that strategy for a blocking GetLedger() call as well.

How does that help? Unlike PrepareRange, GetLedger needs to return data back to the client, so if the client is gone there's no reason to keep the GetLedger call alive.

We could also do something similar to DatabaseBackend which is request the value from remote captive core in a loop and return when it's there (similar to your "handle client-side timeouts and retry" point?).

Yeah, agreed. But RemoteCaptiveCore need some way to differentiate "timed out because the data isn't there" (http 408 maybe?) from "timed out because the server is broken" (http 504 maybe?).

Is RemoteCaptiveCore ever used to talk to "captive cores" outside the horizon process? Specifically, could RemoteCaptiveCore ever be talking to a core of a different version from the horizon instance?

@tamirms
Copy link
Contributor

tamirms commented Apr 26, 2021

How does that help? Unlike PrepareRange, GetLedger needs to return data back to the client, so if the client is gone there's no reason to keep the GetLedger call alive.

The remote captive core API does return data back to the client for PrepareRange.

Also, there is a reason to keep the GetLedger call alive even if the client request ends. GetLedger calls are always non-decreasing. The client wants to keep advancing the ledger stream. Once captive core reaches desired ledger from the GetLedger call, that ledger is cached so that subsequent calls for the same ledger return instantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingest New ingestion system
Projects
None yet
Development

No branches or pull requests

3 participants