-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search rate limits #33
Conversation
71e8dff
to
886fc6e
Compare
Do you think maybe we should be adding this site-wide instead? I know the problem is with search right now, but maybe we should be providing this sort of protection to all HTTP endpoints? So maybe it should be an |
Ohh, that's a very different approach... Do you mean on flask_base? |
I'd rather do it for search first and then we can do a phase 2 with all the endpoints? Seems like it's covering too much and we need to think it through. I mean, what would be the purpose of rate-limiting 404 pages? or do we have endpoints that we actually want to allow people to request a higher limit than search? I am not sure the idea is mature enough. There could be unforseen issues affecting endpoints and it could be hard to control. Also doing this small PR can be some kind of test, see if it actually works at stopping spam, maybe it's not as efficient or the in-memory storage proves not enough? and maybe we need the IS approach or use Redis or some sort of storage? |
7a63853
to
c0b893a
Compare
5eb749c
to
6740195
Compare
6740195
to
b4bcd60
Compare
👍 I agree. Thanks. |
fd1ad1d
to
ee83466
Compare
34d9b8e
to
c86d7b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. You just need to fix the linter.
c86d7b9
to
d99b0ae
Compare
Looking at graylog, a lot of the traffic comes from Googlebot and binbot, both in the Most common user agents data and the most common IPs (top 5 are all Mountain View IPs). This is also confirmed by looking at Google search console.
Due to excess spamming, we are applying rate limits to search endpoints. Which means, for instance, a given user that requests this same endpoint 200 times in one minute will get 429 (200/minute). This is applied globally to all search, because otherwise same IP hitting the different endpoint could still consume Search API quota and thus take down search. In terms of rate limit strategy, fixed window was chosen mainly because it consumes the least memory. Other strategies may be needed if this is not effective. At the moment, we don't have storage infrastructure, so we are using in-memory storage to track requests. Ideally, some backend storage service (Redis, Memcache) can be used.
d99b0ae
to
638d9d4
Compare
After doing some tests, I realised that doing a per minute limit will only block the request per minute (after 1 minute, it will allow it send requests again). So I will change it to maybe a per day basis e.g. 500/day, so that means after 500 requests it will ban that specific user for that endpoint for the rest of the day, which adheres to the quota on Google's side.
Writing a test also made me think that we should pass the rate limit as an argument, so that in different endpoints we can optionally change the limit.
QA
Use canonical/ubuntu.com#12514
Fixes https://warthogs.atlassian.net/browse/WD-1859?atlOrigin=eyJpIjoiMTYxMzI5NGRlODFjNGIxY2E5NWZiOGYzZmFlMjRiNGMiLCJwIjoiaiJ9