Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspending accounts cause huge delays in federation #9377

Closed
ChatonneLibertaire opened this issue Nov 27, 2018 · 18 comments
Closed

Suspending accounts cause huge delays in federation #9377

ChatonneLibertaire opened this issue Nov 27, 2018 · 18 comments
Labels
partially a bug Architecture or design-imposed shortcomings performance Runtime performance

Comments

@ChatonneLibertaire
Copy link

Expected behaviour

Suspending accounts should be smooth and shouldn't create any delays in the federation

Actual behaviour

anarchism.space is currently (as with many other instances) a spambot surge. However suspending them makes my instances to have huge delays in terms of posting things to the federation (not receiving). The CPU is fine (8 cores working at 5% each), the nginx log is filled with attempted accesses to the suspended account from all the other instances I'm federated with (I'm not sure if it's the cause of this but maybe)

Steps to reproduce the problem

  1. Have a decently federated instance
  2. Suspend >20 accounts
  3. Send a DM to someone on another instance
  4. Wait 30 minutes for the DM to be received

Specifications

Mastodon: 2.6.1

@Gargron
Copy link
Member

Gargron commented Nov 27, 2018

Would I be correct to guess that you participate in a relay, and do not have proxy caching configured in nginx?

@ChatonneLibertaire
Copy link
Author

@Gargron what do you mean by relay? Also, I have exactly the same nginx configuration as noted in the old Production guide (https://github.com/tootsuite/documentation/blob/master/Running-Mastodon/Production-guide.md#nginx-configuration) and the default configuration otherwise, which I guess doesn't have proxy caching?

@Gargron
Copy link
Member

Gargron commented Nov 27, 2018

Compare your configuration with: https://github.com/tootsuite/mastodon/blob/master/dist/nginx.conf

I don't see a way for new accounts to cause requests from other servers unless they get followers from those servers or you are using a relay service which broadcasts their posts. So new accounts are likely unrelated to your issues.

@ChatonneLibertaire
Copy link
Author

so, indeed there was no proxy caching. I patched the nginx config, restarted nginx, retried to suspend just one ad-bot (no followers, no toots, nothing) and I get flooded with requests from all the federation or something. This is an example of the line I get in the nginx log:

[27/Nov/2018:22:43:37 +0000] "GET /users/<bot name> HTTP/1.1" 410 36 "-" "http.rb/3.3.0 (Mastodon/2.6.2; +<mastodon instance>)"

And the same behavior stays when I send a message to another instance (i.e. very very long delay)

@Gargron
Copy link
Member

Gargron commented Nov 27, 2018

Ah, I know what's happening. You're right after all. When an account is deleted, we want to make sure that everyone deletes it. So, we forward the delete to every known server. Ironically, for new accounts, most servers don't know them, and have to look them up to get the public key to even read the delete message.

That is a consequence of #8305

@nightpool
Copy link
Member

nightpool commented Nov 27, 2018 via email

@nightpool
Copy link
Member

nightpool commented Nov 27, 2018 via email

@ClearlyClaire
Copy link
Contributor

ClearlyClaire commented Nov 27, 2018

I guess the issue isn't so much the fact that the remote servers do a useless query, but that mass-suspending users queues a hell lot of Delete delivery jobs (one for every suspended account and unique known remote inbox).

@nightpool
Copy link
Member

i guess? I wouldn't expect Delete jobs to be particularly more expensive/numerous then other types of jobs. Maybe anarchism.space has a lot of known remote inboxes but comparatively few remote followers?

@ClearlyClaire
Copy link
Contributor

On my single-user instance, Account.inboxes returns 1802 entries. I assume it's slightly higher for anarchism.space, so that makes tens of thousands of network-bound jobs when suspending 20 accounts, which can definitely cause some delay in job processing.

I'm not too sure how we can make this more efficient, as we have no way to track who has seen (and copied) accounts. We have the same issue with toot deletion to be honest.

@ChatonneLibertaire
Copy link
Author

ChatonneLibertaire commented Feb 11, 2019

Can someone take care of this or something, I can't do this anymore... I got more than a hundred spam bots to remove this morning and now my instance is cut from federation for probably the whole day because of this issue.

Please do something. Who do I have to implore to do something to either this issue of the spam bots issue in general?

@ClearlyClaire
Copy link
Contributor

At the very least, I think we should move Delete sent to instances without followers to the lowest-level priority queue (i.e., pull at the time of writing).

@penartur
Copy link

penartur commented Mar 20, 2019

So what should we do when we're flooded with these GET requests for suspended users? Just wait it out?

An odd thing is that I see some instances doing several requests for a single suspension, e.g. toot.cafe, masto.themimitoof.fr and mastodon.huloop.com on this screenshot:

image

(Also, it's been like 4 hours since I suspended some dozens of bots on my small unpopular instance, and the torrent does not seem to fade... if only cloudflare supported caching 410 Gone responses and spared me some CPU load...

image

@ClearlyClaire
Copy link
Contributor

I guess we could add a shortcut to not try fetching the key for an account deletion when we don't know the account… it would be a bit awkward to add such special casing at this point in the flow, but it would definitely make sense

@penartur
Copy link

penartur commented Mar 20, 2019

So the wave has finally abated, after ~5 hours of flooding my instance with requests at 100x (!) the average rate, and making my CPU work at 100% load. Usually I'm getting that number of requests (170k) on a good month...

And I've only suspended ~100 bots with zero followers and zero posts.

image

Thankfully my instance runs on a decent hardware, and was able to survive this without significant disruption of service. However, I believe that for some other instances it could be more similar to a DDoS attack.

So the fix is definitely warranted IMHO, even if it will look a bit awkward.

@ClearlyClaire
Copy link
Contributor

The awkward fix is in master: #10326
It won't stop other software from performing such requests though.

@angristan
Copy link
Contributor

Getting hit with the issue:

root@mstdn ~# grep -c "/users/Eagleeyeadventures" /var/log/nginx/mstdn-access.log
22290

My instance became unavailable because of one account.

Screenshot 2019-03-21 at 22 47 32
Screenshot 2019-03-21 at 22 47 53

ClearlyClaire added a commit to ClearlyClaire/mastodon that referenced this issue Mar 21, 2019
This will help a great deal with mastodon#9377 when a caching reverse proxy is
configured.
Gargron pushed a commit that referenced this issue Mar 21, 2019
This will help a great deal with #9377 when a caching reverse proxy is
configured.
@Gargron Gargron added the partially a bug Architecture or design-imposed shortcomings label May 1, 2019
hiyuki2578 pushed a commit to ProjectMyosotis/mastodon that referenced this issue Oct 2, 2019
…on#10339)

This will help a great deal with mastodon#9377 when a caching reverse proxy is
configured.
messenjahofchrist pushed a commit to Origin-Creative/mastodon that referenced this issue Jul 30, 2021
…on#10339)

This will help a great deal with mastodon#9377 when a caching reverse proxy is
configured.
@mjankowski
Copy link
Contributor

Last update here was ~6 years ago ... I'm going to close this on the assumption that the previously reference commit, while not a full/exhaustive solution, is an "as good as we can up with" solution ... please reopen/comment if there's something specific that still exists on current versions and could use renewed, robust, respectful contemplation.

@trwnh trwnh added the performance Runtime performance label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
partially a bug Architecture or design-imposed shortcomings performance Runtime performance
Projects
None yet
Development

No branches or pull requests

8 participants