-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HOLD for payment 2024-08-13] [$250] Delay reconnect callbacks to prevent thundering herd DDOS #46143
Comments
Triggered auto assignment to @sonialiap ( |
|
Triggered auto assignment to Design team member for new feature review - @dannymcclain ( |
Definitely don't need design support on this one |
ProposalPlease re-state the problem that we are trying to solve in this issue.Delay reconnect callbacks when server is back after being down. What is the root cause of that problem?New change. What changes do you think we should make in order to solve the problem?We can update the
We'll create a new variable In the reachabilityTest, when Now, whenever |
@ShridharGoel the problem with making this external is that it's not clear how you'd test the server being unreachable (i.e: |
Can network response mocking help?
|
Can you mock responses in the chrome dev tools network tab? |
Yes. |
Job added to Upwork: https://www.upwork.com/jobs/~012f66d59aecc97ac2 |
Triggered auto assignment to Contributor-plus team member for initial proposal review - @allroundexperts ( |
Current assignee @sonialiap is eligible for the Bug assigner, not assigning anyone new. |
📣 @ShridharGoel 🎉 An offer has been automatically sent to your Upwork account for the Contributor role 🎉 Thanks for contributing to the Expensify app! Offer link |
Random delay of clients' reconnection should not be the perfect solution, imagine that the amount of clients surge in future - then increase the 20 seconds to 40 seconds? Need to update server side code, e.g. add a message queue(simple FIFO should be ok) to cache the callback reconnections, even you are not using any load balancing it's still ok for the server to just store the reconnection meta data in RAM then process them in asynchronous way(e.g. use a thread pool to traverse and handle these callback reconnections later). |
📣 @wfdong! 📣
|
Thanks for your feedback @wfdong. We agree that this solution won't scale forever, and are constantly working to improve the performance and reliability of our back-end. That said, we still think this change will be beneficial to our systems overall and serve us well for the foreseeable future |
|
The solution for this issue has been 🚀 deployed to production 🚀 in version 9.0.16-8 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue: If no regressions arise, payment will be issued on 2024-08-13. 🎊 For reference, here are some details about the assignees on this issue:
|
BugZero Checklist: The PR fixing this issue has been merged! The following checklist (instructions) will need to be completed before the issue can be closed:
|
@roryabraham I'm not sure on how we can write a regression test for this. Do we need it? If so, can you suggest something? Thanks! |
Payment summary:
|
@roryabraham bumping Sibtain's question about whether we need a regression test and if yes, how it should be written for this issue #46143 (comment) |
I don't think so, what would it be anyways?
…On Fri, Aug 16, 2024 at 11:54 AM Sonia Liapounova ***@***.***> wrote:
@roryabraham <https://github.com/roryabraham> bumping Sibtain's question
about whether we need a regression test and if yes, how it should be
written for this issue #46143 (comment)
<#46143 (comment)>
—
Reply to this email directly, view it on GitHub
<#46143 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7Y2OLQ2SVNRX2PALOSRZDZRW44ZAVCNFSM6AAAAABLNHMBDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTGEYTCMBUGI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
That's the question 😂 Since Flo doesn't think we need one, closing out |
Ah, I forgot to mention that in the most recent downtime, I've observed that this problem is fixed for now 👍 |
$250 approved for @allroundexperts |
Coming from https://expensify.slack.com/archives/C05CBC62HGW/p1721844511149789
Problem
When the site is fully offline and comes back up, App web clients DDOS the API. This "thundering herd" ⚡🦬 problem occurs because (on web) App considers itself offline if Auth is not reachable. When Auth comes back online, all NewDot clients call ReconnectApp at close to the same time, leading to an artificial surge in traffic right when Auth comes back.
Note: this does not affect iOS or Android because on those platforms there are native platform APIs for network connectivity information, and we just trust those without considering whether the server is reachable.
Solution
Update the App network layer to separately track whether internet is reachable, and if the Expensify servers are down. If the server was down and comes back online, add a randomized delay of between 0 and 20 seconds before executing reconnect callbacks to space out the reconnect callbacks across many devices.
To do this, we'll make the following changes:
isInternetReachable
andisServerReachable
. We can do this by updating the reachabilityTest config of the @react-native-community/netinfo package such that:isInternetReachable
should be set to falseisServerReachable
should be set to falseisOffline
(the Onyx field) will be updated to be!isInternetReachable || !isServerReachable
. This is essentially equivalent to what we have todayUpwork Automation - Do Not Edit
Issue Owner
Current Issue Owner: @roryabrahamThe text was updated successfully, but these errors were encountered: