Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pr/2.2.2 rc2 #5106

Closed
wants to merge 10 commits into from
Closed

Pr/2.2.2 rc2 #5106

wants to merge 10 commits into from

Conversation

ximinez
Copy link
Collaborator

@ximinez ximinez commented Aug 26, 2024

High Level Overview of Change

Context of Change

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking change that only restructures code)
  • Performance (increase or change in throughput and/or latency)
  • Tests (you added tests for code that already exists, or your new feature included in this PR)
  • Documentation update
  • Chore (no impact to binary, e.g. .gitignore, formatting, dropping support for older tooling)
  • Release

API Impact

  • Public API: New feature (new methods and/or new fields)
  • Public API: Breaking change (in general, breaking changes should only impact the next api_version)
  • libxrpl change (any change that may affect libxrpl or dependents of libxrpl)
  • Peer protocol change (must be backward compatible or bump the peer protocol version)

@Bronek Bronek self-requested a review August 28, 2024 11:48
InboundLedger::Reason reason) override
{
std::unique_lock lock(acquiresMutex_);
if (pendingAcquires_.contains(hash))
Copy link
Collaborator

@vlntb vlntb Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of acquireLedger we use ledger ID and ledger hash interchangeably. Is it possible that in the case of using ID, we might be skipping a ledger update that is different from the one that already applied if we use ID as a differentiator?
i.e. adaptor_.acquireLedger(prevLedgerID) in Consensus.h.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ledger ID and ledger hash are the same thing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline with Mark and I don't have any reservations about ledger identity.

{
std::unique_lock lock(acquiresMutex_);
if (pendingAcquires_.contains(hash))
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a debug trace here to check what is being skipped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already log messages at the caller sites. Is that sufficient?

if (pendingValidations_.contains(val->getLedgerHash()))
return false;
pendingValidations_.insert(val->getLedgerHash());
lock.unlock();
handleNewValidation(app_, val, source);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtrippled, my understanding is this is meant to prevent later on the calls to checkAccept inhandleNewValidation that may end up calling starting of a chain of InboundLedger requests here https://github.com/ximinez/rippled/blob/098546f0b1ba259d144e3e60fc9b3274f85e5d05/src/ripple/app/ledger/impl/LedgerMaster.cpp#L1070-L1071.

Asking to understand if you instead considered change the call at this LedgerMaster callsite to use the new acquireAsync method you added, and if so, why you went this way? I'm guessing you are trying to filter out redundant validations at the highest level?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bachase close. I want to prevent concurrent calls to checkAccept() for the same validation mainly because of the call to InboundLedgers::acquire(), which then calls InboundLedger::update(). InboundLedger::update() is where I've seen lock contention for duplicate. If it also stops redundantly sending out peer messages, that's fine, too. But the main goal is to not get stuck on locks. InboundLedger::update() gets stuck on a mutex unique to each inbound ledger. So, multiple calls to process the same inbound ledger at the same time get stuck if that mutex is locked. That's what saturates the job queue.

In addition to avoiding the mutex on the individual InboundLedger object, this change also avoids the mutex on the shared InboundLedgers object. I have not seen it being contended, so I'm not sure it will help us, but I thought it would make sense to avoid using that lock also.

Yes, I am trying to filter out redundant validations at the highest level once submitted to the job queue. I actually implemented these changes on 2 different days, each as a point solution. They work the same way, but the code paths are distinct. It didn't occur to me to use the same filtering function.

@@ -42,6 +42,12 @@ class InboundLedgers
virtual std::shared_ptr<Ledger const>
acquire(uint256 const& hash, std::uint32_t seq, InboundLedger::Reason) = 0;

virtual void
acquireAsync(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a comment on under what circumstances users should call acquireAsync vs acquire? Or something to consider post hotfix of making all calls go through async?

Copy link
Collaborator

@mtrippled mtrippled Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtrippled Yes, I can add comments, though I'm not sure if this is in a "freeze" state yet, but can add comments. I made the new function for inbound ledger requests of "advanceLedger" type to be submitted via the job queue. This is actually the job that seems to pile up the worst during events, causing job queue saturation, more than validations. But validations are also a problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its also ok if that goes when we fold this into the mainline versus hotfix.

if (pendingAcquires_.contains(hash))
return;
pendingAcquires_.insert(hash);
lock.unlock();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: for this and the similar callsite for validations --- should we use RAII/scoped locks vs. this explicit lock and unlock? I get it doesn't matter now, but if code drifts over time, that might be a stronger sign to the future code writer to not break this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bachase You're right, it could be in a try .. except block.

@@ -2311,13 +2311,14 @@ NetworkOPsImp::recvValidation(
<< "recvValidation " << val->getLedgerHash() << " from " << source;

std::unique_lock lock(validationsMutex_);
if (pendingValidations_.contains(val->getLedgerHash()))
if (pendingValidations_.contains(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other fields that might uniquely identify a validation that this would suppress? For example, the sequence number, or the cookie that might be important to detect faulty validators.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bachase I don't know. Also, I don't know how many validations saturating the job queue are duplicates from the same validator vs all the validations coming from distinct validators. In the latter case, none would be suppressed. I don't think the sequence number is useful if the hashes are distinct--also, I think some place else checks that the sequence number matches the ledger we're trying to validate. I don't know anything about the validation cookie either.

mtrippled and others added 3 commits August 29, 2024 19:47
1) refactor filtering of validations to specifically avoid
   concurrent checkAccept() calls for the same validation hash.
2) Log when duplicate concurrent inbound ledger and validation requests
   are filtered.
3) RAII for containers that track concurrent inbound ledger and
   validation requests.
4) Comment on when to asynchronously acquire inbound ledgers, which
   is possible to be always OK, but should have further review.
@ximinez ximinez changed the title Pr/2.2.2 rc1 Pr/2.2.2 rc2 Aug 30, 2024
@ximinez ximinez mentioned this pull request Aug 31, 2024
2 tasks
@ximinez
Copy link
Collaborator Author

ximinez commented Aug 31, 2024

I'm closing this PR in favor of #5115, which is ready to merge, once testing, etc. is complete.

@ximinez ximinez closed this Aug 31, 2024
@ximinez ximinez deleted the pr/2.2.2-rc1 branch September 5, 2024 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants