Pr/2.2.2 rc2 #5106

ximinez · 2024-08-26T19:58:56Z

High Level Overview of Change

Context of Change

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (non-breaking change that only restructures code)
Performance (increase or change in throughput and/or latency)
Tests (you added tests for code that already exists, or your new feature included in this PR)
Documentation update
Chore (no impact to binary, e.g. .gitignore, formatting, dropping support for older tooling)
Release

API Impact

Public API: New feature (new methods and/or new fields)
Public API: Breaking change (in general, breaking changes should only impact the next api_version)
libxrpl change (any change that may affect libxrpl or dependents of libxrpl)
Peer protocol change (must be backward compatible or bump the peer protocol version)

vlntb · 2024-08-28T15:10:17Z

src/ripple/app/ledger/impl/InboundLedgers.cpp

+        InboundLedger::Reason reason) override
+    {
+        std::unique_lock lock(acquiresMutex_);
+        if (pendingAcquires_.contains(hash))


In the context of acquireLedger we use ledger ID and ledger hash interchangeably. Is it possible that in the case of using ID, we might be skipping a ledger update that is different from the one that already applied if we use ID as a differentiator?
i.e. adaptor_.acquireLedger(prevLedgerID) in Consensus.h.

ledger ID and ledger hash are the same thing.

We discussed this offline with Mark and I don't have any reservations about ledger identity.

vlntb · 2024-08-28T15:12:17Z

src/ripple/app/ledger/impl/InboundLedgers.cpp

+    {
+        std::unique_lock lock(acquiresMutex_);
+        if (pendingAcquires_.contains(hash))
+            return;


Can we add a debug trace here to check what is being skipped?

There are already log messages at the caller sites. Is that sufficient?

bachase · 2024-08-29T14:52:35Z

src/ripple/app/misc/NetworkOPs.cpp

+    if (pendingValidations_.contains(val->getLedgerHash()))
+        return false;
+    pendingValidations_.insert(val->getLedgerHash());
+    lock.unlock();
    handleNewValidation(app_, val, source);


@mtrippled, my understanding is this is meant to prevent later on the calls to checkAccept inhandleNewValidation that may end up calling starting of a chain of InboundLedger requests here https://github.com/ximinez/rippled/blob/098546f0b1ba259d144e3e60fc9b3274f85e5d05/src/ripple/app/ledger/impl/LedgerMaster.cpp#L1070-L1071.

Asking to understand if you instead considered change the call at this LedgerMaster callsite to use the new acquireAsync method you added, and if so, why you went this way? I'm guessing you are trying to filter out redundant validations at the highest level?

@bachase close. I want to prevent concurrent calls to checkAccept() for the same validation mainly because of the call to InboundLedgers::acquire(), which then calls InboundLedger::update(). InboundLedger::update() is where I've seen lock contention for duplicate. If it also stops redundantly sending out peer messages, that's fine, too. But the main goal is to not get stuck on locks. InboundLedger::update() gets stuck on a mutex unique to each inbound ledger. So, multiple calls to process the same inbound ledger at the same time get stuck if that mutex is locked. That's what saturates the job queue.

In addition to avoiding the mutex on the individual InboundLedger object, this change also avoids the mutex on the shared InboundLedgers object. I have not seen it being contended, so I'm not sure it will help us, but I thought it would make sense to avoid using that lock also.

Yes, I am trying to filter out redundant validations at the highest level once submitted to the job queue. I actually implemented these changes on 2 different days, each as a point solution. They work the same way, but the code paths are distinct. It didn't occur to me to use the same filtering function.

bachase · 2024-08-29T14:53:31Z

src/ripple/app/ledger/InboundLedgers.h

@@ -42,6 +42,12 @@ class InboundLedgers
    virtual std::shared_ptr<Ledger const>
    acquire(uint256 const& hash, std::uint32_t seq, InboundLedger::Reason) = 0;

+    virtual void
+    acquireAsync(


Do we need a comment on under what circumstances users should call acquireAsync vs acquire? Or something to consider post hotfix of making all calls go through async?

@mtrippled Yes, I can add comments, though I'm not sure if this is in a "freeze" state yet, but can add comments. I made the new function for inbound ledger requests of "advanceLedger" type to be submitted via the job queue. This is actually the job that seems to pile up the worst during events, causing job queue saturation, more than validations. But validations are also a problem.

Its also ok if that goes when we fold this into the mainline versus hotfix.

bachase · 2024-08-29T14:56:33Z

src/ripple/app/ledger/impl/InboundLedgers.cpp

+        if (pendingAcquires_.contains(hash))
+            return;
+        pendingAcquires_.insert(hash);
+        lock.unlock();


Nit: for this and the similar callsite for validations --- should we use RAII/scoped locks vs. this explicit lock and unlock? I get it doesn't matter now, but if code drifts over time, that might be a stronger sign to the future code writer to not break this.

@bachase You're right, it could be in a try .. except block.

bachase · 2024-08-29T15:10:26Z

src/ripple/app/misc/NetworkOPs.cpp

@@ -2311,13 +2311,14 @@ NetworkOPsImp::recvValidation(
        << "recvValidation " << val->getLedgerHash() << " from " << source;

    std::unique_lock lock(validationsMutex_);
-    if (pendingValidations_.contains(val->getLedgerHash()))
+    if (pendingValidations_.contains(


Are there any other fields that might uniquely identify a validation that this would suppress? For example, the sequence number, or the cookie that might be important to detect faulty validators.

@bachase I don't know. Also, I don't know how many validations saturating the job queue are duplicates from the same validator vs all the validations coming from distinct validators. In the latter case, none would be suppressed. I don't think the sequence number is useful if the hashes are distinct--also, I think some place else checks that the sequence number matches the ledger we're trying to validate. I don't know anything about the validation cookie either.

1) refactor filtering of validations to specifically avoid concurrent checkAccept() calls for the same validation hash. 2) Log when duplicate concurrent inbound ledger and validation requests are filtered. 3) RAII for containers that track concurrent inbound ledger and validation requests. 4) Comment on when to asynchronously acquire inbound ledgers, which is possible to be always OK, but should have further review.

ximinez · 2024-08-31T19:51:39Z

I'm closing this PR in favor of #5115, which is ready to merge, once testing, etc. is complete.

vlntb and others added 6 commits August 26, 2024 19:03

Track latencies of certain code blocks, and log if they take too long

00ed7c9

Allow only 1 job queue slot for acquiring inbound ledger.

340323f

Allow only 1 job queue slot for each validation hash.

7bff59a

Correctly identify unique validations.

dea5ab9

Set version to 2.2.2-rc1

c5ad675

[FOLD] Tweak LCL logging

a4c0303

ximinez force-pushed the pr/2.2.2-rc1 branch from 46d8423 to a4c0303 Compare August 27, 2024 00:05

[FOLD] Log IncomingLedger::trigger message as debug

098546f

Bronek self-requested a review August 28, 2024 11:48

vlntb reviewed Aug 28, 2024

View reviewed changes

bachase reviewed Aug 29, 2024

View reviewed changes

mtrippled and others added 3 commits August 29, 2024 19:47

fixup! Review fixes and refactor: (#6)

3a76bd7

Set version to 2.2.2-rc2

72b4853

ximinez changed the title ~~Pr/2.2.2 rc1~~ Pr/2.2.2 rc2 Aug 30, 2024

ximinez mentioned this pull request Aug 31, 2024

Proposed version 2.2.2 #5115

Merged

2 tasks

ximinez closed this Aug 31, 2024

ximinez deleted the pr/2.2.2-rc1 branch September 5, 2024 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pr/2.2.2 rc2 #5106

Pr/2.2.2 rc2 #5106

ximinez commented Aug 26, 2024

vlntb Aug 28, 2024 •

edited

Loading

mtrippled Aug 28, 2024

vlntb Aug 29, 2024

vlntb Aug 28, 2024

ximinez Aug 28, 2024

bachase Aug 29, 2024

mtrippled Aug 29, 2024

bachase Aug 29, 2024

mtrippled Aug 29, 2024 •

edited

Loading

bachase Aug 29, 2024

bachase Aug 29, 2024

mtrippled Aug 29, 2024

bachase Aug 29, 2024

mtrippled Aug 29, 2024

ximinez commented Aug 31, 2024 •

edited

Loading

Pr/2.2.2 rc2 #5106

Pr/2.2.2 rc2 #5106

Conversation

ximinez commented Aug 26, 2024

High Level Overview of Change

Context of Change

Type of Change

API Impact

vlntb Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtrippled Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ximinez commented Aug 31, 2024 • edited Loading

vlntb Aug 28, 2024 •

edited

Loading

mtrippled Aug 29, 2024 •

edited

Loading

ximinez commented Aug 31, 2024 •

edited

Loading