Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

several 4chan archives appear to have blocked gallery-dl #5399

Open
stubkan opened this issue Mar 29, 2024 · 28 comments
Open

several 4chan archives appear to have blocked gallery-dl #5399

stubkan opened this issue Mar 29, 2024 · 28 comments

Comments

@stubkan
Copy link

stubkan commented Mar 29, 2024

A few weeks ago I noticed that some archival sites, such as thebarchive and archived.moe were unable to be scraped by gallery-dl and were blocked. I decided to wait a while, to see if the issue went away, but it appears to still be present. Also, since it is occuring with more than one archival site, I think it may be a new security update of some kind that blocks robots?

Accessing the thread normally, in a browser works. But attempting to use gallery-dl to collect images fails with the following error message;

Scraping thread 916074222... 1/1 [archivedmoe][warning] Cloudflare challenge [archivedmoe][error] HttpError: '403 Forbidden' for 'https://archived.moe/_/api/chan/thread/'

I tested multiple 4chan archival sites to see which are working and which throw the cloudflare challenge and block;

boards.4chan.org - WORKS
archive.4plebs.org - WORKS
archived.moe - BLOCKED
thebarchive.com - BLOCKED
desuarchive.org - WORKS
archive.palanq.win - WORKS

arch.b4k.co - BLOCKED
however, the cloudflare notification is absent from arch.b4k.co, not sure if it is the same

Scraping thread 671665397... 1/1 [b4k][error] HttpError: '403 Forbidden' for 'https://arch.b4k.co/_/api/chan/thread/'

@arisboch
Copy link

I take it a spoofed user agent didn't help?

@stubkan
Copy link
Author

stubkan commented Mar 29, 2024

I am not sure what that is, I dont see any mention of it on the documentation? I did have to set referer to blank to get one of the sites working prior to this block. Had been using gallery-dl without issue for half a year before this.

I tried adding gallery-dl --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0" but it throws the same cloudflare block

@mikf
Copy link
Owner

mikf commented Mar 29, 2024

You need to provide cookies and user agent of the browser that can access these blocked sites:
#4844 (comment)

@stubkan
Copy link
Author

stubkan commented Mar 29, 2024

Can you generate a success on your end? I do not seem to be able to.

   gallery-dl --user-agent browser --cookies-from-browser chromium https://thebarchive.com/b/thread/916060069
   [cookies][error] Failed to read from GNOME keyring
   [cookies][info] Extracted 3091 cookies from Chromium
   [thebarchive][warning] Cloudflare challenge
   [thebarchive][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/'

I tried firefox instead;

   gallery-dl --user-agent browser --cookies-from-browser firefox https://thebarchive.com/b/thread/916060069
   [cookies][info] Extracted 375 cookies from Firefox
   [thebarchive][warning] Cloudflare challenge
   [thebarchive][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/'

Thought, maybe I have to manually update the cookies myself, by visiting the site... did so... retried, and it imported 377 cookies instead of 375, but still failed

   gallery-dl --user-agent browser --cookies-from-browser firefox https://thebarchive.com/b/thread/916060069
   [cookies][info] Extracted 377 cookies from Firefox
   [thebarchive][warning] Cloudflare challenge
   [thebarchive][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/'

@mikf
Copy link
Owner

mikf commented Mar 29, 2024

"When there's no Cloudflare challenge for your browser and/or there's no cf_clearance cookie, you are out of luck." ... and there isn't on for thebarchive.com

It doesn't seem to work for archived.moe either, even though there is a cf_clearance cookie present.

This does work for sites with a "Verifying you are human" check like nhentai, but apparently not here.

@stubkan
Copy link
Author

stubkan commented Mar 29, 2024

I got a cloudflare human check on thebarchive, and then retried - and it still did not pass.

It seems likely to me, these changes likely will eventually propagate to all the 4chan archives if they are apparently successful.

Here's a post by the maintainer of archived.moe;

image

@stubkan
Copy link
Author

stubkan commented Mar 29, 2024

"When there's no Cloudflare challenge for your browser and/or there's no cf_clearance cookie, you are out of luck." ... and there isn't on for thebarchive.com

I checked my cookies, and there is a cf_clearance cookie for thebarchive - as well as the site requesting I pass a human check, which I clicked on.

But, gallery-dl is still blocked, unfortunately.

@cheese529
Copy link

Do you think there might be a possible work around for this or are we out of luck here @mikf ?
I know yt-dlp uses user agent + cookies to bypass cloudfare issues so maybe we could look at how they're able to scrape sites to see if we can get any info on how to deal with this.

@Hrxn
Copy link
Contributor

Hrxn commented Apr 1, 2024

Not sure, I tried the barchives example link from above, here's my log content:

[2024-04-01T09:38:51][info] Extracted 3140 cookies from Chrome
[2024-04-01T09:38:52][warning] Cloudflare challenge [Source URL: https://thebarchive.com/b/thread/916060069]
[2024-04-01T09:38:52][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/' [Source URL: https://thebarchive.com/b/thread/916060069]

But looking in Chrome DevTools with this thread, I have these cookies (among some others);

__cf_bm
cf_clearance
csrftoken
foolframe_KmD_csrf_token

And I think cookie extraction from the browser should work... at least that's what my log says?

@mikf
Copy link
Owner

mikf commented Apr 1, 2024

I know yt-dlp uses user agent + cookies to bypass cloudfare issues

That's what I recommended doing #5399 (comment), but it doesn't work in this situation #5399 (comment).

@Hrxn
Copy link
Contributor

Hrxn commented Apr 1, 2024

Maybe the API is completely blocked, i.e. always returning a 403 no matter what?

@mikf
Copy link
Owner

mikf commented Apr 1, 2024

It is still accessible with a browser. It does, however, show an actual "Verify you are human" check. Solving it and using cookies and user agent afterwards does allow gallery-dl to access it, at least it did for me.

So go to https://thebarchive.com/_/api/chan/thread/, let your browser solve the challenge, and then do the cookie/user-agent thingy. This should allow using the API and downloading from thebarchive and archivedmoe.

@Hrxn
Copy link
Contributor

Hrxn commented Apr 1, 2024

Right, maybe I should've "used" the site more basically, because the "verify human" check did not pop for me in the browser.

I still saw the cf_clearance cookie in my browser devtools, but maybe it was simply a default or randomized nonsensical value. Should have considered the possibility of it being invalid, true.

Opening https://thebarchive.com/_/api/chan/thread/ in the browser immediately opens the "verify human" check, I toggled the mark, closed the browser and immediately tried the thread URL from earlier again (https://thebarchive.com/b/thread/916060069), and it was actually working this time.

It downloaded the grand total of one image, but I also only see one image in this thread when viewing in my browser, so this is probably correct, don't know what you guys see here..

So, @stubkan , given that you actually have the correct cookie in your browser, extraction (at least from thebarchive.com) seems to work - to answer the issue in this thread here.

@mikf
Copy link
Owner

mikf commented Apr 1, 2024

maybe I should've "used" the site more basically, because the "verify human" check did not pop for me in the browser.

The problem here is that the "verify human" check does not appear when just regularly browsing the site in question, but only when explicitly visiting an API URL.

@stubkan
Copy link
Author

stubkan commented Apr 2, 2024

Unfortunately, I have been doing these steps. I have passed cloudflare human check in browser (both firefox and chromium) and checking with both archivedmoe and thebarchive. I also tried a few different threads, and it gives me the cloudflare challenge block every time.

Can you please outline your command process that leads to success? I am doing it in this way in my previous comment;

@mikf
Copy link
Owner

mikf commented Apr 2, 2024

  • I visit https://thebarchive.com/_/api/chan/thread/ and let my browser solve the CF challenge
  • I visit the thread page I want to download
  • Export Cookies (--cookies-from-browser works as well)
  • Use cookies and browser's user agent
$ gallery-dl -C cookies-thebarchive-com.txt --user-agent browser https://thebarchive.com/b/thread/739772332/
thebarchive/b/739772332 Won't you look …gy face.../1500782886885 Smug face 0.png
thebarchive/b/739772332 Won't you look …r smuggy face.../1500783064989 image.jpg
...
$ gallery-dl --cookies-from firefox:/tmp/.firefox --user-agent browser https://archived.moe/gd/thread/309639/
archivedmoe/gd/309639 Which Adobe progr… Adobe_Systems_logo_and_wordmark.svg.png
archivedmoe/gd/309639 Which Adobe progr…hic design?/1495922648056 Sans titre.png

@Hrxn
Copy link
Contributor

Hrxn commented Apr 2, 2024

Yep, simply extracting the cookies from the browser definitely works.
So, either --cookies-from-browser or via config, like this, for example.

            "cookies": ["chrome", "Profile 4"]

You only have to make sure that the cookies from the browser match the "user-agent" setting used by gallery-dl, like this for example:

            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"

Or, like mikf suggested, by using "browser" as the string value for "user-agent", which will try to automatically use the UA information from your system's default browser. Also fine, if that browser is the one which has the correct cookies.

The problem here is that the "verify human" check does not appear when just regularly browsing the site in question, but only when explicitly visiting an API URL.

I see, but honestly, I don't think this is fundamentally different to how cookie-based auth works for any other site, yet.
Might be possible that the problem here is quick expiration of the cookies, but I fail to understand how visiting this special API URL in your browser, making sure it succeeds so that you've got your correct cookies is more of a problem than the usual visiting the site in your browser, and sign-in with your credentials to get your correct cookies steps.

@stubkan
Copy link
Author

stubkan commented Apr 2, 2024

export cookies in netscape.txt and add with gallery-dl -C

I tried this method, same cloudflare block. Perhaps the cookies are different for us, or is not matching up somewhere?

You only have to make sure that the cookies from the browser match the "user-agent" setting used by gallery-dl, like this for example:

my browser cookies for the API site in netscape format appear to consist of three cookies. There is no user-agent equivalent?

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by Cookie-Editor
thebarchive.com	FALSE	/	FALSE	1714316255	foolframe_KmD_search_latest_5	<something containing - board talk text scrape start 2024-1-1 board talk>
thebarchive.com	FALSE	/	FALSE	1712495038	foolframe_KmD_csrf_token	<short string of letters/numbs>
#HttpOnly_.thebarchive.com	TRUE	/	TRUE	1743587157	cf_clearance	<long string of letters and numbers>

I have done some brief testing, and it appears I can use other scraping methods, such as ripme, wget and python requests etc to scrape thebarchive without requiring any cookie nonsense. This may be unique to gallery-dl's method of using API?

@Hrxn
Copy link
Contributor

Hrxn commented Apr 2, 2024

No, the "user-agent" setting of gallery-dl must match the user-agent info of the browser with the cookies.

If you haven't set "user-agent" in your gallery-dl config, make sure to do so, because otherwise you'd be using the built-in default of gallery-dl, which is a recent version of Firefox ESR, but with a linux style. Unless you're using the exact same browser, the exact same version, on the exact same platform, you've got to change the "user-agent" setting.

Edit

if useragent is None:
useragent = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64; "
"rv:109.0) Gecko/20100101 Firefox/115.0")
elif useragent == "browser":

@stubkan
Copy link
Author

stubkan commented Apr 2, 2024

Thank you for your patience with me. I have successfully got gallery-dl to download a thread now...

gallery-dl -C cookies-thebarchive-com.txt -o "user-agent=Mozilla/5.0 (X11; Linux x86_64; rv:123.0) Firefox/123.0" firefox https://thebarchive.com/b/thread/739772332/
[1/2] firefox
[gallery-dl][error] Unsupported URL 'firefox'
[2/2] https://thebarchive.com/b/thread/739772332/
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500782886885 Smug face 0.png
./gallery-dl/thebarchive/b/739772332 Won't you loo….Look t her smuggy face.../1500783064989 image.jpg
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783269175 Smug face 2.png
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783400678 Smug face 3.png
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783588112 Smug face 4.png
./gallery-dl/thebarchive/b/739772332 Won't you loo….Look t her smuggy face.../1500783596174 image.jpg
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783824410 Smug face 6.png
./gallery-dl/thebarchive/b/739772332 Won't you loo….Look t her smuggy face.../1500784125831 image.jpg

I did accidentally leave in 'firefox' but still...

@stubkan
Copy link
Author

stubkan commented Apr 2, 2024

Since one must now watch for cookie expiration and manually create cookie files as well as clicking the human verification button... It is kind of a troublesome solution... Should I leave this issue open?

I think it is possible to come up with a solution that does not require cookies, since alternative downloaders do work without requiring them?

@stubkan
Copy link
Author

stubkan commented Jun 23, 2024

@mikf @Hrxn - Hope its ok to @, I thought since this issue is old, a comment may be missed.

It seems that the situation has changed again. While 4plebs used to not require any authentication - There began to be difficulties last month, with some requests randomly getting blocked. It now appears worse this month and has some form of cloudflare that is blocking gallery-dl, but still allowing browsers to pass without cloudflare.

I can access other foolfoku archive sites by creating a cloudflare cookie in firefox, by visiting the _/api/chan/thread/ and letting the browser create the cookie - and then extracting that. Then using it in the command line.

However, trying this method for 4plebs in the browser does not seem to create any cloudflare cookies or validation requests, all browser visits pass successfully without invoking cloudflare. It creates a foolframe_5SU_csrf_token cookie and if you click on 'accept' the cookie conditions on the non api site - it will create a second cookie called foolframe_5SU_cookie_hasConsent. These cookies do not appear to allow gallery-dl access to scraping unfortunately.

Attempts to delete or reset all cookies to try to get accessing 4plebs to re-create fresh cloudflare cookies doesn't seem to do much, as it does not seem to require cloudflare authentication at all if using a browser.

I tried combining cookies from another foolfoku site (cross cookies do seem to work for other archive sites, ie using thebarchive cookie to download from archived.moe, for example) but that doesnt work for 4plebs.

At this point, I am stumped and it seems like gallery-dl has a hard gallery-dl api block, even though it returns a 403 cloudflare error and the api is free to use via browser.

[4plebs][warning] Cloudflare challenge [4plebs][error] HttpError: '403 Forbidden' for 'https://archive.4plebs.org/_/api/chan/thread/'

@zoekobii
Copy link

Did anyone find a fix for this?

@UtopianElectronics
Copy link

It works for me for "archived.moe". Use --cookies-from-browser and --user-agent (get your user agent string from here. Make sure you use the same user agent as your browser's) and after solving the captchas and refreshing both "archived.moe" and 4chan, it should work.

@stubkan
Copy link
Author

stubkan commented Sep 28, 2024

It works for me for "archived.moe". Use --cookies-from-browser and --user-agent (get your user agent string from here. Make sure you use the same user agent as your browser's) and after solving the captchas and refreshing both "archived.moe" and 4chan, it should work.

Please try other archive sites, you may notice a different result. (4plebs in particular)

@mikf
Copy link
Owner

mikf commented Sep 29, 2024

4plebs in particular

4plebs has been blocking access to its API for I believe ~80% of the time it has been supported by gallery-dl, so for this site it has always been like that. They recommend downloading their dumps instead of scraping the site.

@stubkan
Copy link
Author

stubkan commented Oct 3, 2024

I was using gallery-dl for approx 6 months to download from 4plebs. It was working. They changed the api recently, along with a few other archival sites - leading to gallery-dl to fully stop working.

@wankio
Copy link
Contributor

wankio commented Nov 29, 2024

It works for me for "archived.moe". Use --cookies-from-browser and --user-agent (get your user agent string from here. Make sure you use the same user agent as your browser's) and after solving the captchas and refreshing both "archived.moe" and 4chan, it should work.

mine didn't work :) always leads to cloudflare challenge 403, even my cookies on firefox still work, without any cloudflare on archived.moe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants