Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOLD for payment 2024-08-13] [$250] Delay reconnect callbacks to prevent thundering herd DDOS #46143

Closed
roryabraham opened this issue Jul 24, 2024 · 26 comments
Assignees
Labels
Awaiting Payment Auto-added when associated PR is deployed to production Bug Something is broken. Auto assigns a BugZero manager. Daily KSv2 External Added to denote the issue can be worked on by a contributor NewFeature Something to build that is a new item.

Comments

@roryabraham
Copy link
Contributor

roryabraham commented Jul 24, 2024

Coming from https://expensify.slack.com/archives/C05CBC62HGW/p1721844511149789

Problem

When the site is fully offline and comes back up, App web clients DDOS the API. This "thundering herd" ⚡🦬 problem occurs because (on web) App considers itself offline if Auth is not reachable. When Auth comes back online, all NewDot clients call ReconnectApp at close to the same time, leading to an artificial surge in traffic right when Auth comes back.

Note: this does not affect iOS or Android because on those platforms there are native platform APIs for network connectivity information, and we just trust those without considering whether the server is reachable.

Solution

Update the App network layer to separately track whether internet is reachable, and if the Expensify servers are down. If the server was down and comes back online, add a randomized delay of between 0 and 20 seconds before executing reconnect callbacks to space out the reconnect callbacks across many devices.

To do this, we'll make the following changes:

  • Start separately tracking isInternetReachable and isServerReachable. We can do this by updating the reachabilityTest config of the @react-native-community/netinfo package such that:
    • if http code is != 200, internet is down and isInternetReachable should be set to false
    • if jsonCode != 200 then auth is down, and isServerReachable should be set to false
  • isOffline (the Onyx field) will be updated to be !isInternetReachable || !isServerReachable. This is essentially equivalent to what we have today
  • Then, we'll address the thundering herd. To do this, we'll update triggerReconnectionCallbacks to add some special handling for a new reason which we can simply call serversStandingUp. If the serversStandingUp reason is provided, then we'll add a random delay of between 0 and 20 seconds (inclusive) before executing reconnection callbacks. This means that in aggregate we'll space out the reconnection callbacks and not all clients will attempt to reconnect at the same time.
Upwork Automation - Do Not Edit
  • Upwork Job URL: https://www.upwork.com/jobs/~012f66d59aecc97ac2
  • Upwork Job ID: 1816608269060407566
  • Last Price Increase: 2024-07-25
  • Automatic offers:
    • ShridharGoel | Contributor | 103272034
Issue OwnerCurrent Issue Owner: @roryabraham
@roryabraham roryabraham added Weekly KSv2 NewFeature Something to build that is a new item. labels Jul 24, 2024
Copy link

melvin-bot bot commented Jul 24, 2024

Triggered auto assignment to @sonialiap (NewFeature), see https://stackoverflowteams.com/c/expensify/questions/14418#:~:text=BugZero%20process%20steps%20for%20feature%20requests for more details. Please add this Feature request to a GH project, as outlined in the SO.

Copy link

melvin-bot bot commented Jul 24, 2024

⚠️ It looks like this issue is labelled as a New Feature but not tied to any GitHub Project. Keep in mind that all new features should be tied to GitHub Projects in order to properly track external CAP software time ⚠️

@roryabraham roryabraham changed the title Delay reconnect callbacks to prevent thundering herd ddos Delay reconnect callbacks to prevent thundering herd DDOS Jul 24, 2024
Copy link

melvin-bot bot commented Jul 24, 2024

Triggered auto assignment to Design team member for new feature review - @dannymcclain (NewFeature)

@roryabraham
Copy link
Contributor Author

Definitely don't need design support on this one

@ShridharGoel
Copy link
Contributor

ShridharGoel commented Jul 25, 2024

Proposal

Please re-state the problem that we are trying to solve in this issue.

Delay reconnect callbacks when server is back after being down.

What is the root cause of that problem?

New change.

What changes do you think we should make in order to solve the problem?

We can update the triggerReconnectionCallbacks method:

const triggerReconnectionCallbacks = throttle(
    (reason) => {
        let delay = 0

        if (reason === 'serversStandingUp') {
            delay = Math.floor(Math.random() * 21000); // Random delay between 0 and 20 seconds
        }
        setTimeout(() => {
            Log.info(`[NetworkConnection] Firing reconnection callbacks because ${reason}`);
            Object.values(reconnectionCallbacks).forEach((callback) => {
                callback();
            });
        }, delay);
    },
    5000,
    {trailing: false},
);

We'll create a new variable isServerReachable.

In the reachabilityTest, when !response.ok is true, then we can return Promise.resolve(false);.
When json.jsonCode !== 200 we'll set isServerReachable as false and return Promise.resolve(true).

Now, whenever setOfflineStatus is called, we'll pass !isInternetReachable || !isServerReachable.
The reason would be passed as serversStandingUp, if isServerReachable was false before and is true now.

@melvin-bot melvin-bot bot added Daily KSv2 and removed Weekly KSv2 labels Jul 25, 2024
@roryabraham
Copy link
Contributor Author

@ShridharGoel the problem with making this external is that it's not clear how you'd test the server being unreachable (i.e: Ping responding with a non-zero jsonCode)

@ShridharGoel
Copy link
Contributor

ShridharGoel commented Jul 25, 2024 via email

@roryabraham
Copy link
Contributor Author

Can you mock responses in the chrome dev tools network tab?

@ShridharGoel
Copy link
Contributor

Can you mock responses in the chrome dev tools network tab?

Yes.

@roryabraham roryabraham self-assigned this Jul 25, 2024
@roryabraham roryabraham added the External Added to denote the issue can be worked on by a contributor label Jul 25, 2024
@melvin-bot melvin-bot bot changed the title Delay reconnect callbacks to prevent thundering herd DDOS [$250] Delay reconnect callbacks to prevent thundering herd DDOS Jul 25, 2024
Copy link

melvin-bot bot commented Jul 25, 2024

Job added to Upwork: https://www.upwork.com/jobs/~012f66d59aecc97ac2

@melvin-bot melvin-bot bot added the Help Wanted Apply this label when an issue is open to proposals by contributors label Jul 25, 2024
Copy link

melvin-bot bot commented Jul 25, 2024

Triggered auto assignment to Contributor-plus team member for initial proposal review - @allroundexperts (External)

@roryabraham roryabraham added the Bug Something is broken. Auto assigns a BugZero manager. label Jul 25, 2024
Copy link

melvin-bot bot commented Jul 25, 2024

Current assignee @sonialiap is eligible for the Bug assigner, not assigning anyone new.

@melvin-bot melvin-bot bot removed the Help Wanted Apply this label when an issue is open to proposals by contributors label Jul 25, 2024
Copy link

melvin-bot bot commented Jul 25, 2024

📣 @ShridharGoel 🎉 An offer has been automatically sent to your Upwork account for the Contributor role 🎉 Thanks for contributing to the Expensify app!

Offer link
Upwork job
Please accept the offer and leave a comment on the Github issue letting us know when we can expect a PR to be ready for review 🧑‍💻
Keep in mind: Code of Conduct | Contributing 📖

@wfdong
Copy link

wfdong commented Jul 26, 2024

Random delay of clients' reconnection should not be the perfect solution, imagine that the amount of clients surge in future - then increase the 20 seconds to 40 seconds? Need to update server side code, e.g. add a message queue(simple FIFO should be ok) to cache the callback reconnections, even you are not using any load balancing it's still ok for the server to just store the reconnection meta data in RAM then process them in asynchronous way(e.g. use a thread pool to traverse and handle these callback reconnections later).

Copy link

melvin-bot bot commented Jul 26, 2024

📣 @wfdong! 📣
Hey, it seems we don’t have your contributor details yet! You'll only have to do this once, and this is how we'll hire you on Upwork.
Please follow these steps:

  1. Make sure you've read and understood the contributing guidelines.
  2. Get the email address used to login to your Expensify account. If you don't already have an Expensify account, create one here. If you have multiple accounts (e.g. one for testing), please use your main account email.
  3. Get the link to your Upwork profile. It's necessary because we only pay via Upwork. You can access it by logging in, and then clicking on your name. It'll look like this. If you don't already have an account, sign up for one here.
  4. Copy the format below and paste it in a comment on this issue. Replace the placeholder text with your actual details.
    Screen Shot 2022-11-16 at 4 42 54 PM
    Format:
Contributor details
Your Expensify account email: <REPLACE EMAIL HERE>
Upwork Profile Link: <REPLACE LINK HERE>

@roryabraham
Copy link
Contributor Author

Thanks for your feedback @wfdong. We agree that this solution won't scale forever, and are constantly working to improve the performance and reliability of our back-end. That said, we still think this change will be beneficial to our systems overall and serve us well for the foreseeable future

@melvin-bot melvin-bot bot added Weekly KSv2 Awaiting Payment Auto-added when associated PR is deployed to production and removed Weekly KSv2 labels Aug 6, 2024
@melvin-bot melvin-bot bot changed the title [$250] Delay reconnect callbacks to prevent thundering herd DDOS [HOLD for payment 2024-08-13] [$250] Delay reconnect callbacks to prevent thundering herd DDOS Aug 6, 2024
Copy link

melvin-bot bot commented Aug 6, 2024

Reviewing label has been removed, please complete the "BugZero Checklist".

@melvin-bot melvin-bot bot removed the Reviewing Has a PR in review label Aug 6, 2024
Copy link

melvin-bot bot commented Aug 6, 2024

The solution for this issue has been 🚀 deployed to production 🚀 in version 9.0.16-8 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue:

If no regressions arise, payment will be issued on 2024-08-13. 🎊

For reference, here are some details about the assignees on this issue:

Copy link

melvin-bot bot commented Aug 6, 2024

BugZero Checklist: The PR fixing this issue has been merged! The following checklist (instructions) will need to be completed before the issue can be closed:

  • [@allroundexperts] The PR that introduced the bug has been identified. Link to the PR:
  • [@allroundexperts] The offending PR has been commented on, pointing out the bug it caused and why, so the author and reviewers can learn from the mistake. Link to comment:
  • [@allroundexperts] A discussion in #expensify-bugs has been started about whether any other steps should be taken (e.g. updating the PR review checklist) in order to catch this type of bug sooner. Link to discussion:
  • [@allroundexperts] Determine if we should create a regression test for this bug.
  • [@allroundexperts] If we decide to create a regression test for the bug, please propose the regression test steps to ensure the same bug will not reach production again.
  • [@sonialiap] Link the GH issue for creating/updating the regression test once above steps have been agreed upon:

@allroundexperts
Copy link
Contributor

@roryabraham I'm not sure on how we can write a regression test for this. Do we need it? If so, can you suggest something? Thanks!

@melvin-bot melvin-bot bot added Daily KSv2 and removed Weekly KSv2 labels Aug 12, 2024
@sonialiap
Copy link
Contributor

Payment summary:

@melvin-bot melvin-bot bot added the Overdue label Aug 15, 2024
@sonialiap
Copy link
Contributor

@roryabraham bumping Sibtain's question about whether we need a regression test and if yes, how it should be written for this issue #46143 (comment)

@flodnv
Copy link
Contributor

flodnv commented Aug 16, 2024 via email

@sonialiap
Copy link
Contributor

That's the question 😂 Since Flo doesn't think we need one, closing out

@melvin-bot melvin-bot bot removed the Overdue label Aug 16, 2024
@flodnv
Copy link
Contributor

flodnv commented Aug 19, 2024

Ah, I forgot to mention that in the most recent downtime, I've observed that this problem is fixed for now 👍

@JmillsExpensify
Copy link

$250 approved for @allroundexperts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Payment Auto-added when associated PR is deployed to production Bug Something is broken. Auto assigns a BugZero manager. Daily KSv2 External Added to denote the issue can be worked on by a contributor NewFeature Something to build that is a new item.
Projects
None yet
Development

No branches or pull requests

8 participants