-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
‘Rate-Limit’ FediFetcher #123
Comments
It is probably best to add rate limits, and to work on a robots.txt courtesy. #84. There clearly is a need now for this option so FediFetcher can be stopped so instances who run it are not ip blocked. |
Agreed. I’ve begun work on this one. #84 is proving to be tricky as parsing robots.txt is very very hard, and I haven’t yet found a good python package that simplifies this. (Nor do I think that fetching robots.txt over and over again would be a great solution, so we’d need to cache that too…) |
Agreed, this sounds like a great first step. Would there also be a way to make this the default behavior of some sort but then allow additional parameters to adjust the frequency manually for these three as well? And/or adjust the hour ranges of these as well? |
Thanks @colinstu12 I'm not sure on configurability: It would seem like we'd need a lot of configuration options to make this work, which would seem very confusing. Can you think of a way of implementing configurability without being confusing / overwhelming? FYI: I've just merged #124 which implements a first stab at this. Would be great if people could try this out and give feedbac |
This will at least allow adjustment of the number of times per each of these already committed time ranges. As far as adjusting the time ranges themselves I worry about folks potentially having overlap in times, or gaps in time, causing god-knows-what kind of issues... so maybe it's best to skip. (unless this error checking isn't terrible/difficult to implement, or if there are gaps it can fall back to default behavior, or something). Actually maybe it could be something like this, and it would eliminate the possibility of gaps or overlap.
I guess there isn't any need to adjust the "long" one, because it will always be whatever is older / falls out of the range of the "medium" time period. |
Thanks @colinstu12 I must say that I'm still not convinced this adds much value, or is easy to understand. Will give it some more thought. |
FYI, I am paul on Mastodon. Re my high traffic and numbers of users, You probably are going to like what I have to say. You can disregard the stats I posted in that thread about my instance. After the recent update to FediFetcher, where the server name is added to the User-Agent, the crawls are coming from within. It also appears each time it is run, it is being counted as a unique visitor. I don't know what is going on with leah at chaos.social, maybe their users have it setup via api dev and the high number of crawls are coming from their users within and not outside. Only time will tell when the server address is added to the fetcher in their log, if users update. You probably need to change the user agent to include, ran by user from +servername so people don't think it is masking their server's name, esp since it can be run as a github action. Something like, |
This is extremely helpful @p37307 ! Thanks for that! I was actually almost expecting this to be the case, but I'm relieved to have this confirmed. I guess this at least means instance admins can make informed decisions to disallow their own users' use of FediFetcher, if they feel that's needed, or potentially prod those users into collaborating (Running one FediFetcher instance with multiple Api Keys is much more efficient, than every user running their own instance.)
Valid point. I was actually thinking of including the user's name. But then that raises potentially serious privacy concerns. I don't like your suggestion too much though - it's just so verbose. 🤔 |
...esp since the IP and then the username would be paired. Not an issue if it's only appearing in the logs of the user's server since they already see the logged in users IP but then when you have it writing it as a pair in a access log at other instances... that would give one side, esp if hostile, the connection to the users account and IP, could be bad esp if the Fedifetcher user is running it from their home computer, like on Windows 10 via python. Not optimal.
I can be a dull person. :) |
Hi leah from chaos.social here. These are the numbers of requests per endpoint for 19 hours of chaos.social using the UA FediFetcher:
|
Thanks @leahoswald . Again, appreciate you taking the time to respond! |
So, I thought I'd mention a couple of stats here as well: I'm currently testing this, and it had the following impact:
|
I’m closing this issue now, as I’ve just released v7.1.0 which includes this. @leahoswald @p37307 id love to have some feedback from you in a few days/weeks time, to see what difference (if any) these changes have made. |
This is essentially a follow up to #122
During that discussion the question was raised whether FediFetcher needs to re-fetch context for the same post over and over again.
Since context can change when a new reply is added we can’t simply never re-request context. But we also don’t need to re-check for new replies 100 times per hour when a year old post has been boosted into someone’s timeline.
So, in the ideal world we’ll have some sort of back off mechanism where
I have no idea how feasible this is (some architectural decisions taken early on in FediFetcher‘s development do make this challenging) but clearly the current status cannot continue.
I’d appreciate any input both by users and by admins who may be fed up with FediFetcher querying their instance too much.
(Mentioning @colinstu12 and @p37307 as you’ve both shown interest in this discussion previously)
The text was updated successfully, but these errors were encountered: