Suggestion: add a library identifier to the default user-agent header #1219

jayaddison · 2024-08-15T13:50:11Z

Suggestion

In the README.rst file, we advise users of this library to be respectful of upstream robots.txt rules - and to be responsible and careful about usage in general (in other words: to follow good netiquette). However: the user-agent string that this library sends is fairly generic. To make it easier for recipe websites to apply selective rules for it (something that could be reasonable for those sites to choose to do, even if it could also be considered unfair), I think we should include a library identifier within the user-agent string.

Implications

User-agent strings already typically include various items of information about the browser, OS and platform environment -- so I think that a reference to the library name could be added in a format-compliant manner.

I'd expect (but cannot guarantee) that such an identifier would not significantly alter the way that most recipe websites treat this client library.

Additional reasoning

Some recent bugreports here that have made me consider such a thing as useful are #1214 and #1206 -- as a result I've been considering updating our README.rst examples to use the default headers -- but that doesn't feel great to me if the headers themselves could be considered as evasive due to being generic. Adding recipe-scrapers in there somewhere would make me feel more comfortable about it.

When using recipe-scrapers in network-enabled mode, I also think it's possible to consider it as a form of domain-specific microbrowser: enter a URL (similar to typing a web address into a browser address bar), and, provided that a suitable network response is received, you are able to read a recipe. That's probably debatable to some extent, but I think there are similarities - and if viewed that way I think it also fits that the library should mention itself in the user-agent string.

cc @hay-kot @smilerz @michael-genson as downstream consumers of the library who might encounter user feedback about this if we change it

The text was updated successfully, but these errors were encountered:

smilerz · 2024-08-15T14:03:55Z

Conceptually, I think that makes a lot of sense - if a website doesn't want to be scraped the library should respect that.

From a selfish standpoint we aren't using the network enabled version of recipe-scrapers any longer, so any such change wouldn't affect us. Though, if you implemented a standard header that includes the library we might consider using it.

cc: @vabene1111

vabene1111 · 2024-08-15T14:37:03Z

Interesting discussion, I remember talking about this a while ago. My personal opinion, although probably not officially the right thing, is that this library is mostly used for small scale personal and selective (manual) downloads of recipes. This means that I probably would not consider it scraping in the traditional sense of automatically taking everything. Thus one could argue that there is not really a difference to browsing the page in a normal browser.

Any "malicious" actor (trying to download /steal recipes) will just circumvent any restrictive header filtering, by using a generic user agent/request library, so a change like this, even tough I agree that probably no or only a very small number of pages will implement any filtering, will only impact those mostly manual users.

In the end I think both ways make no significant difference. Typing on my mobile so I hope this ramble makes sense.

michael-genson · 2024-08-15T14:42:05Z

I think this makes sense. Theoretically if we changed our mind we could always override the header anyway, so I don't see an issue with making the default more neighborly

jayaddison · 2024-08-17T15:22:05Z

Thanks all for your feedback; the next step here is for me to evaluate the effects of a possible adjusted user-agent string format (see #1221) - after finding a more-accurately-descriptive one that also isn't egregiously blocked by recipe sites, I'll update to use that instead.

jayaddison added the enhancement label Aug 15, 2024

jayaddison mentioned this issue Aug 15, 2024

internals: add identifying library-plus-version string to HEADERS #1221

Merged

jayaddison closed this as completed in #1221 Aug 26, 2024

jayaddison mentioned this issue Sep 4, 2024

https://akispetretzikis.com stopped working #1235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: add a library identifier to the default user-agent header #1219

Suggestion: add a library identifier to the default user-agent header #1219

jayaddison commented Aug 15, 2024

smilerz commented Aug 15, 2024

vabene1111 commented Aug 15, 2024

michael-genson commented Aug 15, 2024

jayaddison commented Aug 17, 2024

Suggestion: add a library identifier to the default user-agent header #1219

Suggestion: add a library identifier to the default user-agent header #1219

Comments

jayaddison commented Aug 15, 2024

Suggestion

Implications

Additional reasoning

smilerz commented Aug 15, 2024

vabene1111 commented Aug 15, 2024

michael-genson commented Aug 15, 2024

jayaddison commented Aug 17, 2024