Search rate limits #33

carkod · 2023-02-08T14:00:42Z

After doing some tests, I realised that doing a per minute limit will only block the request per minute (after 1 minute, it will allow it send requests again). So I will change it to maybe a per day basis e.g. 500/day, so that means after 500 requests it will ban that specific user for that endpoint for the rest of the day, which adheres to the quota on Google's side.

Writing a test also made me think that we should pass the rate limit as an argument, so that in different endpoints we can optionally change the limit.

QA

Use canonical/ubuntu.com#12514

Fixes https://warthogs.atlassian.net/browse/WD-1859?atlOrigin=eyJpIjoiMTYxMzI5NGRlODFjNGIxY2E5NWZiOGYzZmFlMjRiNGMiLCJwIjoiaiJ9

nottrobin · 2023-02-08T14:50:01Z

Do you think maybe we should be adding this site-wide instead? I know the problem is with search right now, but maybe we should be providing this sort of protection to all HTTP endpoints? So maybe it should be an after_request function that checks the mime type and is applied to anything of type text/html (meaning it's presumably not an API endpoint). What do you think?

carkod · 2023-02-08T15:05:58Z

Do you think maybe we should be adding this site-wide instead? I know the problem is with search right now, but maybe we should be providing this sort of protection to all HTTP endpoints? So maybe it should be an after_request function that checks the mime type and is applied to anything of type text/html (meaning it's presumably not an API endpoint). What do you think?

Ohh, that's a very different approach... Do you mean on flask_base?

carkod · 2023-02-08T15:10:39Z

I'd rather do it for search first and then we can do a phase 2 with all the endpoints? Seems like it's covering too much and we need to think it through. I mean, what would be the purpose of rate-limiting 404 pages? or do we have endpoints that we actually want to allow people to request a higher limit than search? I am not sure the idea is mature enough. There could be unforseen issues affecting endpoints and it could be hard to control.

Also doing this small PR can be some kind of test, see if it actually works at stopping spam, maybe it's not as efficient or the in-memory storage proves not enough? and maybe we need the IS approach or use Redis or some sort of storage?

canonicalwebteam/search/views.py

setup.py

nottrobin · 2023-02-09T15:20:45Z

I'd rather do it for search first and then we can do a phase 2 with all the endpoints? Seems like it's covering too much and we need to think it through. I mean, what would be the purpose of rate-limiting 404 pages? or do we have endpoints that we actually want to allow people to request a higher limit than search? I am not sure the idea is mature enough. There could be unforseen issues affecting endpoints and it could be hard to control.

Also doing this small PR can be some kind of test, see if it actually works at stopping spam, maybe it's not as efficient or the in-memory storage proves not enough? and maybe we need the IS approach or use Redis or some sort of storage?

👍 I agree. Thanks.

.github/workflows/pr.yaml

nottrobin

LGTM. You just need to fix the linter.

Looking at graylog, a lot of the traffic comes from Googlebot and binbot, both in the Most common user agents data and the most common IPs (top 5 are all Mountain View IPs). This is also confirmed by looking at Google search console.

Due to excess spamming, we are applying rate limits to search endpoints. Which means, for instance, a given user that requests this same endpoint 200 times in one minute will get 429 (200/minute). This is applied globally to all search, because otherwise same IP hitting the different endpoint could still consume Search API quota and thus take down search. In terms of rate limit strategy, fixed window was chosen mainly because it consumes the least memory. Other strategies may be needed if this is not effective. At the moment, we don't have storage infrastructure, so we are using in-memory storage to track requests. Ideally, some backend storage service (Redis, Memcache) can be used.

carkod changed the title ~~Search rate limits~~ WIP Search rate limits Feb 8, 2023

carkod mentioned this pull request Feb 8, 2023

Update search to include rate limits canonical/ubuntu.com#12514

Merged

carkod force-pushed the search-rate-limits branch 3 times, most recently from 71e8dff to 886fc6e Compare February 8, 2023 14:34

carkod force-pushed the search-rate-limits branch 5 times, most recently from 7a63853 to c0b893a Compare February 9, 2023 13:18

carkod commented Feb 9, 2023

View reviewed changes

canonicalwebteam/search/views.py Show resolved Hide resolved

carkod commented Feb 9, 2023

View reviewed changes

canonicalwebteam/search/views.py Outdated Show resolved Hide resolved

carkod force-pushed the search-rate-limits branch 2 times, most recently from 5eb749c to 6740195 Compare February 9, 2023 13:24

carkod commented Feb 9, 2023

View reviewed changes

setup.py Show resolved Hide resolved

carkod force-pushed the search-rate-limits branch from 6740195 to b4bcd60 Compare February 9, 2023 14:18

carkod changed the title ~~WIP Search rate limits~~ Search rate limits Feb 9, 2023

carkod added Review: Code needed Review: QA needed labels Feb 9, 2023

nottrobin reviewed Feb 9, 2023

View reviewed changes

.github/workflows/pr.yaml Show resolved Hide resolved

nottrobin added Review: Code +1 and removed Review: Code needed labels Feb 9, 2023

carkod force-pushed the search-rate-limits branch 2 times, most recently from fd1ad1d to ee83466 Compare February 10, 2023 15:04

petesfrench assigned tbille Feb 14, 2023

carkod unassigned tbille Feb 14, 2023

carkod force-pushed the search-rate-limits branch 2 times, most recently from 34d9b8e to c86d7b9 Compare February 16, 2023 16:10

nottrobin approved these changes Feb 17, 2023

View reviewed changes

nottrobin added Review: QA +1 and removed Review: QA needed labels Feb 17, 2023

carkod force-pushed the search-rate-limits branch from c86d7b9 to d99b0ae Compare February 20, 2023 19:30

carkod added 2 commits February 20, 2023 19:34

Block search engine bots

61fc08a

Looking at graylog, a lot of the traffic comes from Googlebot and binbot, both in the Most common user agents data and the most common IPs (top 5 are all Mountain View IPs). This is also confirmed by looking at Google search console.

carkod force-pushed the search-rate-limits branch from d99b0ae to 638d9d4 Compare February 20, 2023 19:36

carkod merged commit 41670f7 into canonical:main Feb 21, 2023

carkod deleted the search-rate-limits branch February 21, 2023 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search rate limits #33

Search rate limits #33

carkod commented Feb 8, 2023 •

edited

Loading

nottrobin commented Feb 8, 2023

carkod commented Feb 8, 2023

carkod commented Feb 8, 2023 •

edited

Loading

nottrobin commented Feb 9, 2023

nottrobin left a comment

Search rate limits #33

Search rate limits #33

Conversation

carkod commented Feb 8, 2023 • edited Loading

QA

nottrobin commented Feb 8, 2023

carkod commented Feb 8, 2023

carkod commented Feb 8, 2023 • edited Loading

nottrobin commented Feb 9, 2023

nottrobin left a comment

Choose a reason for hiding this comment

carkod commented Feb 8, 2023 •

edited

Loading

carkod commented Feb 8, 2023 •

edited

Loading