[HOLD for payment 2024-08-13] [$250] Delay reconnect callbacks to prevent thundering herd DDOS #46143

roryabraham · 2024-07-24T20:46:39Z

Coming from https://expensify.slack.com/archives/C05CBC62HGW/p1721844511149789

Problem

When the site is fully offline and comes back up, App web clients DDOS the API. This "thundering herd" ⚡🦬 problem occurs because (on web) App considers itself offline if Auth is not reachable. When Auth comes back online, all NewDot clients call ReconnectApp at close to the same time, leading to an artificial surge in traffic right when Auth comes back.

Note: this does not affect iOS or Android because on those platforms there are native platform APIs for network connectivity information, and we just trust those without considering whether the server is reachable.

Solution

Update the App network layer to separately track whether internet is reachable, and if the Expensify servers are down. If the server was down and comes back online, add a randomized delay of between 0 and 20 seconds before executing reconnect callbacks to space out the reconnect callbacks across many devices.

To do this, we'll make the following changes:

Start separately tracking isInternetReachable and isServerReachable. We can do this by updating the reachabilityTest config of the @react-native-community/netinfo package such that:
- if http code is != 200, internet is down and isInternetReachable should be set to false
- if jsonCode != 200 then auth is down, and isServerReachable should be set to false
isOffline (the Onyx field) will be updated to be !isInternetReachable || !isServerReachable. This is essentially equivalent to what we have today
Then, we'll address the thundering herd. To do this, we'll update triggerReconnectionCallbacks to add some special handling for a new reason which we can simply call serversStandingUp. If the serversStandingUp reason is provided, then we'll add a random delay of between 0 and 20 seconds (inclusive) before executing reconnection callbacks. This means that in aggregate we'll space out the reconnection callbacks and not all clients will attempt to reconnect at the same time.

Upwork Automation - Do Not Edit

Upwork Job URL: https://www.upwork.com/jobs/~012f66d59aecc97ac2
Upwork Job ID: 1816608269060407566
Last Price Increase: 2024-07-25
Automatic offers:

ShridharGoel | Contributor | 103272034

Issue Owner

Current Issue Owner: @roryabraham

The text was updated successfully, but these errors were encountered:

melvin-bot · 2024-07-24T20:46:49Z

Triggered auto assignment to @sonialiap (NewFeature), see https://stackoverflowteams.com/c/expensify/questions/14418#:~:text=BugZero%20process%20steps%20for%20feature%20requests for more details. Please add this Feature request to a GH project, as outlined in the SO.

melvin-bot · 2024-07-24T20:46:51Z

⚠️ It looks like this issue is labelled as a New Feature but not tied to any GitHub Project. Keep in mind that all new features should be tied to GitHub Projects in order to properly track external CAP software time ⚠️

melvin-bot · 2024-07-24T20:46:58Z

Triggered auto assignment to Design team member for new feature review - @dannymcclain (NewFeature)

roryabraham · 2024-07-24T20:48:13Z

Definitely don't need design support on this one

ShridharGoel · 2024-07-25T09:51:56Z

Proposal

Please re-state the problem that we are trying to solve in this issue.

Delay reconnect callbacks when server is back after being down.

What is the root cause of that problem?

New change.

What changes do you think we should make in order to solve the problem?

We can update the triggerReconnectionCallbacks method:

const triggerReconnectionCallbacks = throttle(
    (reason) => {
        let delay = 0

        if (reason === 'serversStandingUp') {
            delay = Math.floor(Math.random() * 21000); // Random delay between 0 and 20 seconds
        }
        setTimeout(() => {
            Log.info(`[NetworkConnection] Firing reconnection callbacks because ${reason}`);
            Object.values(reconnectionCallbacks).forEach((callback) => {
                callback();
            });
        }, delay);
    },
    5000,
    {trailing: false},
);

We'll create a new variable isServerReachable.

In the reachabilityTest, when !response.ok is true, then we can return Promise.resolve(false);.
When json.jsonCode !== 200 we'll set isServerReachable as false and return Promise.resolve(true).

Now, whenever setOfflineStatus is called, we'll pass !isInternetReachable || !isServerReachable.
The reason would be passed as serversStandingUp, if isServerReachable was false before and is true now.

roryabraham · 2024-07-25T18:27:27Z

@ShridharGoel the problem with making this external is that it's not clear how you'd test the server being unreachable (i.e: Ping responding with a non-zero jsonCode)

ShridharGoel · 2024-07-25T18:38:04Z

Can network response mocking help?

roryabraham · 2024-07-25T19:20:30Z

Can you mock responses in the chrome dev tools network tab?

ShridharGoel · 2024-07-25T20:03:34Z

Can you mock responses in the chrome dev tools network tab?

Yes.

melvin-bot · 2024-07-25T22:55:39Z

Job added to Upwork: https://www.upwork.com/jobs/~012f66d59aecc97ac2

melvin-bot · 2024-07-25T22:55:44Z

Triggered auto assignment to Contributor-plus team member for initial proposal review - @allroundexperts (External)

melvin-bot · 2024-07-25T22:55:55Z

Current assignee @sonialiap is eligible for the Bug assigner, not assigning anyone new.

melvin-bot · 2024-07-25T22:56:01Z

📣 @ShridharGoel 🎉 An offer has been automatically sent to your Upwork account for the Contributor role 🎉 Thanks for contributing to the Expensify app!

Offer link
Upwork job
Please accept the offer and leave a comment on the Github issue letting us know when we can expect a PR to be ready for review 🧑‍💻
Keep in mind: Code of Conduct | Contributing 📖

wfdong · 2024-07-26T12:03:41Z

Random delay of clients' reconnection should not be the perfect solution, imagine that the amount of clients surge in future - then increase the 20 seconds to 40 seconds? Need to update server side code, e.g. add a message queue(simple FIFO should be ok) to cache the callback reconnections, even you are not using any load balancing it's still ok for the server to just store the reconnection meta data in RAM then process them in asynchronous way(e.g. use a thread pool to traverse and handle these callback reconnections later).

melvin-bot · 2024-07-26T12:03:44Z

📣 @wfdong! 📣
Hey, it seems we don’t have your contributor details yet! You'll only have to do this once, and this is how we'll hire you on Upwork.
Please follow these steps:

Make sure you've read and understood the contributing guidelines.
Get the email address used to login to your Expensify account. If you don't already have an Expensify account, create one here. If you have multiple accounts (e.g. one for testing), please use your main account email.
Get the link to your Upwork profile. It's necessary because we only pay via Upwork. You can access it by logging in, and then clicking on your name. It'll look like this. If you don't already have an account, sign up for one here.
Copy the format below and paste it in a comment on this issue. Replace the placeholder text with your actual details.

Format:

Contributor details
Your Expensify account email: <REPLACE EMAIL HERE>
Upwork Profile Link: <REPLACE LINK HERE>

roryabraham · 2024-07-29T17:52:49Z

Thanks for your feedback @wfdong. We agree that this solution won't scale forever, and are constantly working to improve the performance and reliability of our back-end. That said, we still think this change will be beneficial to our systems overall and serve us well for the foreseeable future

melvin-bot · 2024-08-06T03:50:23Z

Reviewing label has been removed, please complete the "BugZero Checklist".

melvin-bot · 2024-08-06T03:50:27Z

The solution for this issue has been 🚀 deployed to production 🚀 in version 9.0.16-8 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue:

Add delay before calling reconnect when server is back up #46399

If no regressions arise, payment will be issued on 2024-08-13. 🎊

For reference, here are some details about the assignees on this issue:

@allroundexperts requires payment through NewDot Manual Requests
@ShridharGoel requires payment automatic offer (Contributor)

melvin-bot · 2024-08-06T03:50:28Z

BugZero Checklist: The PR fixing this issue has been merged! The following checklist (instructions) will need to be completed before the issue can be closed:

[@allroundexperts] The PR that introduced the bug has been identified. Link to the PR:
[@allroundexperts] The offending PR has been commented on, pointing out the bug it caused and why, so the author and reviewers can learn from the mistake. Link to comment:
[@allroundexperts] A discussion in #expensify-bugs has been started about whether any other steps should be taken (e.g. updating the PR review checklist) in order to catch this type of bug sooner. Link to discussion:
[@allroundexperts] Determine if we should create a regression test for this bug.
[@allroundexperts] If we decide to create a regression test for the bug, please propose the regression test steps to ensure the same bug will not reach production again.
[@sonialiap] Link the GH issue for creating/updating the regression test once above steps have been agreed upon:

allroundexperts · 2024-08-11T20:54:55Z

@roryabraham I'm not sure on how we can write a regression test for this. Do we need it? If so, can you suggest something? Thanks!

sonialiap · 2024-08-13T08:36:11Z

Payment summary:

@allroundexperts $250 - please request in ND
@ShridharGoel $250 - paid in upwork ✔️

sonialiap · 2024-08-16T08:54:14Z

@roryabraham bumping Sibtain's question about whether we need a regression test and if yes, how it should be written for this issue #46143 (comment)

flodnv · 2024-08-16T09:54:19Z

I don't think so, what would it be anyways?

…

On Fri, Aug 16, 2024 at 11:54 AM Sonia Liapounova ***@***.***> wrote: @roryabraham <https://github.com/roryabraham> bumping Sibtain's question about whether we need a regression test and if yes, how it should be written for this issue #46143 (comment) <#46143 (comment)> — Reply to this email directly, view it on GitHub <#46143 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA7Y2OLQ2SVNRX2PALOSRZDZRW44ZAVCNFSM6AAAAABLNHMBDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTGEYTCMBUGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

sonialiap · 2024-08-16T12:11:57Z

That's the question 😂 Since Flo doesn't think we need one, closing out

flodnv · 2024-08-19T08:43:04Z

Ah, I forgot to mention that in the most recent downtime, I've observed that this problem is fixed for now 👍

JmillsExpensify · 2024-10-04T11:11:17Z

$250 approved for @allroundexperts

roryabraham added Weekly KSv2 NewFeature Something to build that is a new item. labels Jul 24, 2024

melvin-bot bot assigned sonialiap Jul 24, 2024

roryabraham changed the title ~~Delay reconnect callbacks to prevent thundering herd ddos~~ Delay reconnect callbacks to prevent thundering herd DDOS Jul 24, 2024

melvin-bot bot assigned dannymcclain Jul 24, 2024

roryabraham unassigned dannymcclain Jul 24, 2024

melvin-bot bot added Daily KSv2 and removed Weekly KSv2 labels Jul 25, 2024

roryabraham self-assigned this Jul 25, 2024

roryabraham added the External Added to denote the issue can be worked on by a contributor label Jul 25, 2024

melvin-bot bot changed the title ~~Delay reconnect callbacks to prevent thundering herd DDOS~~ [$250] Delay reconnect callbacks to prevent thundering herd DDOS Jul 25, 2024

melvin-bot bot added the Help Wanted Apply this label when an issue is open to proposals by contributors label Jul 25, 2024

melvin-bot bot assigned allroundexperts Jul 25, 2024

roryabraham added the Bug Something is broken. Auto assigns a BugZero manager. label Jul 25, 2024

roryabraham assigned ShridharGoel and unassigned allroundexperts Jul 25, 2024

melvin-bot bot removed the Help Wanted Apply this label when an issue is open to proposals by contributors label Jul 25, 2024

roryabraham assigned allroundexperts Jul 25, 2024

ShridharGoel mentioned this issue Jul 29, 2024

Add delay before calling reconnect when server is back up #46399

Merged

48 tasks

melvin-bot bot added Reviewing Has a PR in review Weekly KSv2 and removed Daily KSv2 Weekly KSv2 labels Jul 29, 2024

melvin-bot bot added Weekly KSv2 Awaiting Payment Auto-added when associated PR is deployed to production and removed Weekly KSv2 labels Aug 6, 2024

melvin-bot bot changed the title ~~[$250] Delay reconnect callbacks to prevent thundering herd DDOS~~ [HOLD for payment 2024-08-13] [$250] Delay reconnect callbacks to prevent thundering herd DDOS Aug 6, 2024

melvin-bot bot removed the Reviewing Has a PR in review label Aug 6, 2024

melvin-bot bot added Daily KSv2 and removed Weekly KSv2 labels Aug 12, 2024

melvin-bot bot added the Overdue label Aug 15, 2024

sonialiap closed this as completed Aug 16, 2024

melvin-bot bot removed the Overdue label Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HOLD for payment 2024-08-13] [$250] Delay reconnect callbacks to prevent thundering herd DDOS #46143

[HOLD for payment 2024-08-13] [$250] Delay reconnect callbacks to prevent thundering herd DDOS #46143

roryabraham commented Jul 24, 2024 •

edited by sonialiap

Loading

melvin-bot bot commented Jul 24, 2024

melvin-bot bot commented Jul 24, 2024

melvin-bot bot commented Jul 24, 2024

roryabraham commented Jul 24, 2024

ShridharGoel commented Jul 25, 2024 •

edited

Loading

roryabraham commented Jul 25, 2024

ShridharGoel commented Jul 25, 2024 via email

roryabraham commented Jul 25, 2024

ShridharGoel commented Jul 25, 2024

melvin-bot bot commented Jul 25, 2024

melvin-bot bot commented Jul 25, 2024

melvin-bot bot commented Jul 25, 2024

melvin-bot bot commented Jul 25, 2024

wfdong commented Jul 26, 2024

melvin-bot bot commented Jul 26, 2024

roryabraham commented Jul 29, 2024

melvin-bot bot commented Aug 6, 2024

melvin-bot bot commented Aug 6, 2024

melvin-bot bot commented Aug 6, 2024

allroundexperts commented Aug 11, 2024

sonialiap commented Aug 13, 2024

sonialiap commented Aug 16, 2024

flodnv commented Aug 16, 2024 via email

sonialiap commented Aug 16, 2024

flodnv commented Aug 19, 2024

JmillsExpensify commented Oct 4, 2024

[HOLD for payment 2024-08-13] [$250] Delay reconnect callbacks to prevent thundering herd DDOS #46143

[HOLD for payment 2024-08-13] [$250] Delay reconnect callbacks to prevent thundering herd DDOS #46143

Comments

roryabraham commented Jul 24, 2024 • edited by sonialiap Loading

Problem

Solution

melvin-bot bot commented Jul 24, 2024

melvin-bot bot commented Jul 24, 2024

melvin-bot bot commented Jul 24, 2024

roryabraham commented Jul 24, 2024

ShridharGoel commented Jul 25, 2024 • edited Loading

Proposal

Please re-state the problem that we are trying to solve in this issue.

What is the root cause of that problem?

What changes do you think we should make in order to solve the problem?

roryabraham commented Jul 25, 2024

ShridharGoel commented Jul 25, 2024 via email

roryabraham commented Jul 25, 2024

ShridharGoel commented Jul 25, 2024

melvin-bot bot commented Jul 25, 2024

melvin-bot bot commented Jul 25, 2024

melvin-bot bot commented Jul 25, 2024

melvin-bot bot commented Jul 25, 2024

wfdong commented Jul 26, 2024

melvin-bot bot commented Jul 26, 2024

roryabraham commented Jul 29, 2024

melvin-bot bot commented Aug 6, 2024

melvin-bot bot commented Aug 6, 2024

melvin-bot bot commented Aug 6, 2024

allroundexperts commented Aug 11, 2024

sonialiap commented Aug 13, 2024

sonialiap commented Aug 16, 2024

flodnv commented Aug 16, 2024 via email

sonialiap commented Aug 16, 2024

flodnv commented Aug 19, 2024

JmillsExpensify commented Oct 4, 2024

roryabraham commented Jul 24, 2024 •

edited by sonialiap

Loading

ShridharGoel commented Jul 25, 2024 •

edited

Loading