-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: add a library identifier to the default user-agent header #1219
Comments
Conceptually, I think that makes a lot of sense - if a website doesn't want to be scraped the library should respect that. From a selfish standpoint we aren't using the network enabled version of cc: @vabene1111 |
Interesting discussion, I remember talking about this a while ago. My personal opinion, although probably not officially the right thing, is that this library is mostly used for small scale personal and selective (manual) downloads of recipes. This means that I probably would not consider it scraping in the traditional sense of automatically taking everything. Thus one could argue that there is not really a difference to browsing the page in a normal browser. Any "malicious" actor (trying to download /steal recipes) will just circumvent any restrictive header filtering, by using a generic user agent/request library, so a change like this, even tough I agree that probably no or only a very small number of pages will implement any filtering, will only impact those mostly manual users. In the end I think both ways make no significant difference. Typing on my mobile so I hope this ramble makes sense. |
I think this makes sense. Theoretically if we changed our mind we could always override the header anyway, so I don't see an issue with making the default more neighborly |
Thanks all for your feedback; the next step here is for me to evaluate the effects of a possible adjusted user-agent string format (see #1221) - after finding a more-accurately-descriptive one that also isn't egregiously blocked by recipe sites, I'll update to use that instead. |
Suggestion
In the
README.rst
file, we advise users of this library to be respectful of upstreamrobots.txt
rules - and to be responsible and careful about usage in general (in other words: to follow good netiquette). However: the user-agent string that this library sends is fairly generic. To make it easier for recipe websites to apply selective rules for it (something that could be reasonable for those sites to choose to do, even if it could also be considered unfair), I think we should include a library identifier within the user-agent string.Implications
User-agent strings already typically include various items of information about the browser, OS and platform environment -- so I think that a reference to the library name could be added in a format-compliant manner.
I'd expect (but cannot guarantee) that such an identifier would not significantly alter the way that most recipe websites treat this client library.
Additional reasoning
Some recent bugreports here that have made me consider such a thing as useful are #1214 and #1206 -- as a result I've been considering updating our
README.rst
examples to use the default headers -- but that doesn't feel great to me if the headers themselves could be considered as evasive due to being generic. Addingrecipe-scrapers
in there somewhere would make me feel more comfortable about it.When using
recipe-scrapers
in network-enabled mode, I also think it's possible to consider it as a form of domain-specific microbrowser: enter a URL (similar to typing a web address into a browser address bar), and, provided that a suitable network response is received, you are able to read a recipe. That's probably debatable to some extent, but I think there are similarities - and if viewed that way I think it also fits that the library should mention itself in the user-agent string.cc @hay-kot @smilerz @michael-genson as downstream consumers of the library who might encounter user feedback about this if we change it
The text was updated successfully, but these errors were encountered: