[SCRAPER] - foodnetwork.com returns 403 due to user-agent string #4024

williamkray · 2024-08-12T21:10:19Z

First Check

I used the GitHub search to find a similar issue and didn't find it.
I have verified that this issue is not related to the underlying library
hhyrsev/recipe-scrapers by 1) checking
the debugger and data is returned, 2)
verifying that there are errors in the log related to application level code, or
3) verified that the site provides recipe data, or is otherwise supported by
hhyrsev/recipe-scrapers
This issue can be replicated on the demo site (https://demo.mealie.io/)

Please provide 1-5 example URLs that are having errors

Upon troubleshooting the underlying recipe-scrapers library in a docker container with interactive python shell, i identified that it is the request that's being denied (using normal python requests library). Entering the user-agent string from the documentation on the recipe-scrapers readme, i received a 403, while running the same get request with the following user agent string succeeded (copied from my Firefox session):

User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0

urls that i've tested (have tested more, but these are the last couple):

https://www.foodnetwork.com/recipes/ina-garten/garlic-roasted-potatoes-recipe-1913067
https://www.foodnetwork.com/recipes/ina-garten/1770-house-meatloaf-recipe-2109034

Please provide your logs for the Mealie container `docker logs <container-id> > mealie.logs`

apps-mealie-1  | INFO     2024-08-12T13:41:56 - HTTP Request: GET https://www.foodnetwork.com/recipes/ina-garten/garlic-roasted-potatoes-recipe-1913067 "HTTP/1.1 403 Forbidden"
apps-mealie-1  | INFO     2024-08-12T13:41:57 - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
apps-mealie-1  | INFO     2024-08-12T13:41:57 - [192.168.11.1:0] 400 Bad Request "POST /api/recipes/create-url HTTP/1.1"

in an interactive python shell:

>>> requests.get("https://www.foodnetwork.com/recipes/ina-garten/garlic-roasted-potatoes-recipe-1913067", headers={"User-Agent": "Mealie API request for import"})
<Response [403]>
>>> requests.get("https://www.foodnetwork.com/recipes/ina-garten/garlic-roasted-potatoes-recipe-1913067", headers={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0"})
<Response [200]>
>>>

Deployment

Docker (Linux)

The text was updated successfully, but these errors were encountered:

williamkray · 2024-08-13T15:10:22Z

i scoured the code a bit and found the firefox user-agent string that mealie is using, and confirmed it is also responsible for throwing the 403 error, whereas the user-agent string from my linux firefox install is getting a 200:

root@0f0411e5a883:/# python
Python 3.12.5 (main, Aug  7 2024, 19:13:43) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.get("https://www.foodnetwork.com/recipes/ina-garten/garlic-roasted-potatoes-recipe-1913067", headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0"})
<Response [403]>
>>> requests.get("https://www.foodnetwork.com/recipes/ina-garten/garlic-roasted-potatoes-recipe-1913067", headers={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0"})
<Response [200]>
>>>

williamkray · 2024-08-13T15:51:03Z

seems like i still run into the problem even when dropping in my updated user-agent string, presumably because the headers being imported from recipe_scrapers are also problematic. if i eliminate the logic around trying to import those headers, and force it to use my updated firefox UA, i can successfully scrape the recipe, however it still seems to be getting 403s on image gets (must be using a different piece of code and different UA headers).

i don't have time to keep digging into this but hopefully i've given enough info to help get someone on their way to fix this, since this is a major source of recipes and it's a real drag not being able to import them!

jayaddison · 2024-08-13T22:38:56Z

It's really tricky to figure out what a solution to these kind of HTTP forbidden responses could be, because we can't really determine what the logic is that the host sites are using to determine why one client (not necessarily one person!) is worth blocking, while another is deemed worth providing a (potentially dynamic, personalized) response to.

JeffCrum1 · 2024-08-17T04:31:49Z

I have the same issue.

URL:
https://www.foodnetwork.com/recipes/food-network-kitchen/air-fryer-crab-cakes-with-chipotle-sauce-9429207

Duplicated on demo site.

Docker on Synology

Docker log:

INFO: 16-Aug-24 23:14:29        HTTP Request: GET https://www.foodnetwork.com/recipes/food-network-kitchen/air-fryer-crab-cakes-with-chipotle-sauce-9429207 "HTTP/1.1 403 Forbidden"
ERROR: 16-Aug-24 23:14:29       Recipe Scraper was unable to extract a recipe from https://www.foodnetwork.com/recipes/food-network-kitchen/air-fryer-crab-cakes-with-chipotle-sauce-9429207

Let me know if you need anything else to help.

ryanwohara · 2024-08-22T16:44:54Z

I faced this, too, so I swapped out the user agent and that solved my issue.

I replaced it in both these files:

./scraper/scraper_strategies.py
./recipe/recipe_data_service.py

github-actions · 2024-09-22T02:02:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ryanwohara · 2024-09-23T14:31:56Z

Bumping this as it's a genuine issue

ant385525 · 2024-10-09T04:12:36Z

Would using something like fake-useragent be a solution here? Then that could be wrapped in a try-except block like so

for _ in range(3): try: ua = UserAgent() user_agent = ua.random
I'm not sure if fake useragents is considered bad practice

Starminder · 2024-10-10T16:54:42Z

following

williamkray added bug Something isn't working scraper triage labels Aug 12, 2024

williamkray mentioned this issue Aug 13, 2024

foodnetwork.com not scraping hhursev/recipe-scrapers#1206

Open

2 tasks

github-actions bot added the stale label Sep 22, 2024

github-actions bot removed the stale label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SCRAPER] - foodnetwork.com returns 403 due to user-agent string #4024

[SCRAPER] - foodnetwork.com returns 403 due to user-agent string #4024

williamkray commented Aug 12, 2024

williamkray commented Aug 13, 2024

williamkray commented Aug 13, 2024

jayaddison commented Aug 13, 2024

JeffCrum1 commented Aug 17, 2024 •

edited

Loading

ryanwohara commented Aug 22, 2024

github-actions bot commented Sep 22, 2024

ryanwohara commented Sep 23, 2024

ant385525 commented Oct 9, 2024

Starminder commented Oct 10, 2024

[SCRAPER] - foodnetwork.com returns 403 due to user-agent string #4024

[SCRAPER] - foodnetwork.com returns 403 due to user-agent string #4024

Comments

williamkray commented Aug 12, 2024

First Check

Please provide 1-5 example URLs that are having errors

Please provide your logs for the Mealie container docker logs <container-id> > mealie.logs

Deployment

williamkray commented Aug 13, 2024

williamkray commented Aug 13, 2024

jayaddison commented Aug 13, 2024

JeffCrum1 commented Aug 17, 2024 • edited Loading

ryanwohara commented Aug 22, 2024

github-actions bot commented Sep 22, 2024

ryanwohara commented Sep 23, 2024

ant385525 commented Oct 9, 2024

Starminder commented Oct 10, 2024

Please provide your logs for the Mealie container `docker logs <container-id> > mealie.logs`

JeffCrum1 commented Aug 17, 2024 •

edited

Loading