-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pr/2.2.2 rc2 #5106
Pr/2.2.2 rc2 #5106
Conversation
InboundLedger::Reason reason) override | ||
{ | ||
std::unique_lock lock(acquiresMutex_); | ||
if (pendingAcquires_.contains(hash)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of acquireLedger
we use ledger ID and ledger hash interchangeably. Is it possible that in the case of using ID, we might be skipping a ledger update that is different from the one that already applied if we use ID as a differentiator?
i.e. adaptor_.acquireLedger(prevLedgerID) in Consensus.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ledger ID and ledger hash are the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline with Mark and I don't have any reservations about ledger identity.
{ | ||
std::unique_lock lock(acquiresMutex_); | ||
if (pendingAcquires_.contains(hash)) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a debug trace here to check what is being skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already log messages at the caller sites. Is that sufficient?
src/ripple/app/misc/NetworkOPs.cpp
Outdated
if (pendingValidations_.contains(val->getLedgerHash())) | ||
return false; | ||
pendingValidations_.insert(val->getLedgerHash()); | ||
lock.unlock(); | ||
handleNewValidation(app_, val, source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtrippled, my understanding is this is meant to prevent later on the calls to checkAccept
inhandleNewValidation
that may end up calling starting of a chain of InboundLedger
requests here https://github.com/ximinez/rippled/blob/098546f0b1ba259d144e3e60fc9b3274f85e5d05/src/ripple/app/ledger/impl/LedgerMaster.cpp#L1070-L1071.
Asking to understand if you instead considered change the call at this LedgerMaster
callsite to use the new acquireAsync
method you added, and if so, why you went this way? I'm guessing you are trying to filter out redundant validations at the highest level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bachase close. I want to prevent concurrent calls to checkAccept() for the same validation mainly because of the call to InboundLedgers::acquire(), which then calls InboundLedger::update(). InboundLedger::update() is where I've seen lock contention for duplicate. If it also stops redundantly sending out peer messages, that's fine, too. But the main goal is to not get stuck on locks. InboundLedger::update() gets stuck on a mutex unique to each inbound ledger. So, multiple calls to process the same inbound ledger at the same time get stuck if that mutex is locked. That's what saturates the job queue.
In addition to avoiding the mutex on the individual InboundLedger object, this change also avoids the mutex on the shared InboundLedgers object. I have not seen it being contended, so I'm not sure it will help us, but I thought it would make sense to avoid using that lock also.
Yes, I am trying to filter out redundant validations at the highest level once submitted to the job queue. I actually implemented these changes on 2 different days, each as a point solution. They work the same way, but the code paths are distinct. It didn't occur to me to use the same filtering function.
@@ -42,6 +42,12 @@ class InboundLedgers | |||
virtual std::shared_ptr<Ledger const> | |||
acquire(uint256 const& hash, std::uint32_t seq, InboundLedger::Reason) = 0; | |||
|
|||
virtual void | |||
acquireAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a comment on under what circumstances users should call acquireAsync
vs acquire
? Or something to consider post hotfix of making all calls go through async?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtrippled Yes, I can add comments, though I'm not sure if this is in a "freeze" state yet, but can add comments. I made the new function for inbound ledger requests of "advanceLedger" type to be submitted via the job queue. This is actually the job that seems to pile up the worst during events, causing job queue saturation, more than validations. But validations are also a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its also ok if that goes when we fold this into the mainline versus hotfix.
if (pendingAcquires_.contains(hash)) | ||
return; | ||
pendingAcquires_.insert(hash); | ||
lock.unlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: for this and the similar callsite for validations --- should we use RAII/scoped locks vs. this explicit lock and unlock? I get it doesn't matter now, but if code drifts over time, that might be a stronger sign to the future code writer to not break this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bachase You're right, it could be in a try .. except block.
src/ripple/app/misc/NetworkOPs.cpp
Outdated
@@ -2311,13 +2311,14 @@ NetworkOPsImp::recvValidation( | |||
<< "recvValidation " << val->getLedgerHash() << " from " << source; | |||
|
|||
std::unique_lock lock(validationsMutex_); | |||
if (pendingValidations_.contains(val->getLedgerHash())) | |||
if (pendingValidations_.contains( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other fields that might uniquely identify a validation that this would suppress? For example, the sequence number, or the cookie that might be important to detect faulty validators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bachase I don't know. Also, I don't know how many validations saturating the job queue are duplicates from the same validator vs all the validations coming from distinct validators. In the latter case, none would be suppressed. I don't think the sequence number is useful if the hashes are distinct--also, I think some place else checks that the sequence number matches the ledger we're trying to validate. I don't know anything about the validation cookie either.
1) refactor filtering of validations to specifically avoid concurrent checkAccept() calls for the same validation hash. 2) Log when duplicate concurrent inbound ledger and validation requests are filtered. 3) RAII for containers that track concurrent inbound ledger and validation requests. 4) Comment on when to asynchronously acquire inbound ledgers, which is possible to be always OK, but should have further review.
I'm closing this PR in favor of #5115, which is ready to merge, once testing, etc. is complete. |
High Level Overview of Change
Context of Change
Type of Change
.gitignore
, formatting, dropping support for older tooling)API Impact
libxrpl
change (any change that may affectlibxrpl
or dependents oflibxrpl
)