Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RPC Webhook queue dropping #5163

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

WietseWind
Copy link
Member

@WietseWind WietseWind commented Oct 24, 2024

Problem:

When using subscribe at admin RPC port to send webhooks for the transaction stream to a backend, on large(r) ledgers the endpoint was consistently receiving fewer HTTP POSTs with TX information than the amount of transactions in a ledger.

This resulted in some XamanWallet users, on larger ledgers, not always receiving their transaction push notifications.

Details

Admin command RPC Post to URL had a 32 queue length (hardcoded) resulting in dropping TX notifications.

  1. As this is an admin-command only, I stripped out the entire queue length check. If admin, you should know what you are doing. If your endpoint can't efficiently handle the TPS, your problem.

  2. Also: shorter TTL for outgoing RPC HTTP calls: was 10 minutes PER REQUEST, now is 30 seconds (still too long, but 10 minutes is a guaranteed shit show if the calls keep on hanging and stack up, especially since the 32 queue length for HTTP calls is now removed).

While dropping the queue length limit on sent WebHooks could be considered dangerous, it's guarded by admin-RPC port anyway:

if (context.role != Role::ADMIN)

Finally:

  1. Change timeout const to be less ambiguous.

Copy link

codecov bot commented Oct 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.8%. Comparing base (1fbf8da) to head (4c4527b).
Report is 33 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #5163     +/-   ##
=========================================
+ Coverage     76.2%   77.8%   +1.7%     
=========================================
  Files          760     783     +23     
  Lines        61568   66674   +5106     
  Branches      8126    8125      -1     
=========================================
+ Hits         46909   51902   +4993     
- Misses       14659   14772    +113     
Files with missing lines Coverage Δ
src/xrpld/net/detail/RPCCall.cpp 94.2% <ø> (+0.4%) ⬆️
src/xrpld/net/detail/RPCSub.cpp 45.5% <ø> (+3.1%) ⬆️

... and 625 files with indirect coverage changes

Impacted file tree graph

Comment on lines 1609 to 1611
// Wietse: used to be 10m, but which backend
// ever requires 10 minutes to respond?
// Lower = prevent stacking pending calls
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is more of a PR comment than something that needs to be in the code

JLOG(j_.warn()) << "RPCCall::fromNetwork drop";
mDeque.pop_back();
}
// Wietse: we're not going to limit this, this is admin-port only, scale
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@@ -182,7 +186,11 @@ class RPCSubImp : public RPCSub
}

private:
enum { eventQueueMax = 32 };
// Wietse: we're not going to limit this, this is admin-port only, scale
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Collaborator

@godexsoft godexsoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @mvadari that the code comments should be removed. Also added a question.

@@ -78,12 +78,16 @@ class RPCSubImp : public RPCSub
{
std::lock_guard sl(mLock);

if (mDeque.size() >= eventQueueMax)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbound queues are often trouble. In this particular case, because it's protected by the ADMIN role, perhaps this is acceptable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about increasing eventQueueMax to a more reasonable number instead of removing this block?

@@ -1623,7 +1626,7 @@ fromNetwork(
std::placeholders::_2,
j),
RPC_REPLY_MAX_BYTES,
RPC_NOTIFY,
RPC_WEBHOOK_TIMEOUT,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that no activity for 30s means client gets disconnected? Is there at least ping-pong going on in the background to avoid disconnecting clients who are waiting for some subscription?

All this looks very different to Clio and i'm not familiar with how this works so i just wanted to make sure i understand what the side-effect of this change can be

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means it's giving up after 30 seconds of trying to deliver the webhook. E.g. the recipient receives the HTTP call but starts some processing, and doesn't respond... After 30 seconds, just give up instead of keeping the outbound HTTP connection open.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, got it. Does this code guaranteed to only be used with the webhook stuff or does this affect any other calls?

@WietseWind
Copy link
Member Author

@mvadari comments removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants