Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search rate limits #33

Merged
merged 2 commits into from
Feb 21, 2023
Merged

Search rate limits #33

merged 2 commits into from
Feb 21, 2023

Conversation

carkod
Copy link
Contributor

@carkod carkod commented Feb 8, 2023

After doing some tests, I realised that doing a per minute limit will only block the request per minute (after 1 minute, it will allow it send requests again). So I will change it to maybe a per day basis e.g. 500/day, so that means after 500 requests it will ban that specific user for that endpoint for the rest of the day, which adheres to the quota on Google's side.

Writing a test also made me think that we should pass the rate limit as an argument, so that in different endpoints we can optionally change the limit.

QA

Use canonical/ubuntu.com#12514

Fixes https://warthogs.atlassian.net/browse/WD-1859?atlOrigin=eyJpIjoiMTYxMzI5NGRlODFjNGIxY2E5NWZiOGYzZmFlMjRiNGMiLCJwIjoiaiJ9

@carkod carkod changed the title Search rate limits WIP Search rate limits Feb 8, 2023
@carkod carkod force-pushed the search-rate-limits branch 3 times, most recently from 71e8dff to 886fc6e Compare February 8, 2023 14:34
@nottrobin
Copy link
Contributor

Do you think maybe we should be adding this site-wide instead? I know the problem is with search right now, but maybe we should be providing this sort of protection to all HTTP endpoints? So maybe it should be an after_request function that checks the mime type and is applied to anything of type text/html (meaning it's presumably not an API endpoint). What do you think?

@carkod
Copy link
Contributor Author

carkod commented Feb 8, 2023

Do you think maybe we should be adding this site-wide instead? I know the problem is with search right now, but maybe we should be providing this sort of protection to all HTTP endpoints? So maybe it should be an after_request function that checks the mime type and is applied to anything of type text/html (meaning it's presumably not an API endpoint). What do you think?

Ohh, that's a very different approach... Do you mean on flask_base?

@carkod
Copy link
Contributor Author

carkod commented Feb 8, 2023

I'd rather do it for search first and then we can do a phase 2 with all the endpoints? Seems like it's covering too much and we need to think it through. I mean, what would be the purpose of rate-limiting 404 pages? or do we have endpoints that we actually want to allow people to request a higher limit than search? I am not sure the idea is mature enough. There could be unforseen issues affecting endpoints and it could be hard to control.

Also doing this small PR can be some kind of test, see if it actually works at stopping spam, maybe it's not as efficient or the in-memory storage proves not enough? and maybe we need the IS approach or use Redis or some sort of storage?

@carkod carkod force-pushed the search-rate-limits branch 5 times, most recently from 7a63853 to c0b893a Compare February 9, 2023 13:18
@carkod carkod force-pushed the search-rate-limits branch 2 times, most recently from 5eb749c to 6740195 Compare February 9, 2023 13:24
setup.py Show resolved Hide resolved
@carkod carkod changed the title WIP Search rate limits Search rate limits Feb 9, 2023
@nottrobin
Copy link
Contributor

I'd rather do it for search first and then we can do a phase 2 with all the endpoints? Seems like it's covering too much and we need to think it through. I mean, what would be the purpose of rate-limiting 404 pages? or do we have endpoints that we actually want to allow people to request a higher limit than search? I am not sure the idea is mature enough. There could be unforseen issues affecting endpoints and it could be hard to control.

Also doing this small PR can be some kind of test, see if it actually works at stopping spam, maybe it's not as efficient or the in-memory storage proves not enough? and maybe we need the IS approach or use Redis or some sort of storage?

👍 I agree. Thanks.

@carkod carkod force-pushed the search-rate-limits branch 2 times, most recently from 34d9b8e to c86d7b9 Compare February 16, 2023 16:10
Copy link
Contributor

@nottrobin nottrobin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. You just need to fix the linter.

Looking at graylog, a lot of the traffic comes from Googlebot and
binbot, both in the Most common user agents data and the most common IPs
(top 5 are all Mountain View IPs). This is also confirmed by looking at
Google search console.
Due to excess spamming, we are applying rate limits to search endpoints.
Which means, for instance, a given user that requests this same endpoint 200 times in one minute
will get 429 (200/minute). This is applied globally to all search,
because otherwise same IP hitting the different endpoint could still
consume Search API quota and thus take down search.

In terms of rate limit strategy, fixed window was chosen mainly because
it consumes the least memory. Other strategies may be needed if this is
not effective.

At the moment, we don't have storage infrastructure, so we are using
in-memory storage to track requests. Ideally, some backend storage
service (Redis, Memcache) can be used.
@carkod carkod merged commit 41670f7 into canonical:main Feb 21, 2023
@carkod carkod deleted the search-rate-limits branch February 21, 2023 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants