-
-
Notifications
You must be signed in to change notification settings - Fork 988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
several 4chan archives appear to have blocked gallery-dl #5399
Comments
I take it a spoofed user agent didn't help? |
I am not sure what that is, I dont see any mention of it on the documentation? I did have to set referer to blank to get one of the sites working prior to this block. Had been using gallery-dl without issue for half a year before this. I tried adding |
You need to provide cookies and user agent of the browser that can access these blocked sites: |
Can you generate a success on your end? I do not seem to be able to.
I tried firefox instead;
Thought, maybe I have to manually update the cookies myself, by visiting the site... did so... retried, and it imported 377 cookies instead of 375, but still failed
|
"When there's no Cloudflare challenge for your browser and/or there's no It doesn't seem to work for archived.moe either, even though there is a This does work for sites with a "Verifying you are human" check like nhentai, but apparently not here. |
I checked my cookies, and there is a cf_clearance cookie for thebarchive - as well as the site requesting I pass a human check, which I clicked on. But, gallery-dl is still blocked, unfortunately. |
Do you think there might be a possible work around for this or are we out of luck here @mikf ? |
Not sure, I tried the barchives example link from above, here's my log content:
But looking in Chrome DevTools with this thread, I have these cookies (among some others);
And I think cookie extraction from the browser should work... at least that's what my log says? |
That's what I recommended doing #5399 (comment), but it doesn't work in this situation #5399 (comment). |
Maybe the API is completely blocked, i.e. always returning a |
It is still accessible with a browser. It does, however, show an actual "Verify you are human" check. Solving it and using cookies and user agent afterwards does allow gallery-dl to access it, at least it did for me. So go to |
Right, maybe I should've "used" the site more basically, because the "verify human" check did not pop for me in the browser. I still saw the Opening It downloaded the grand total of one image, but I also only see one image in this thread when viewing in my browser, so this is probably correct, don't know what you guys see here.. So, @stubkan , given that you actually have the correct cookie in your browser, extraction (at least from |
The problem here is that the "verify human" check does not appear when just regularly browsing the site in question, but only when explicitly visiting an API URL. |
Unfortunately, I have been doing these steps. I have passed cloudflare human check in browser (both firefox and chromium) and checking with both archivedmoe and thebarchive. I also tried a few different threads, and it gives me the cloudflare challenge block every time. Can you please outline your command process that leads to success? I am doing it in this way in my previous comment; |
|
Yep, simply extracting the cookies from the browser definitely works. "cookies": ["chrome", "Profile 4"] You only have to make sure that the cookies from the browser match the "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36" Or, like mikf suggested, by using
I see, but honestly, I don't think this is fundamentally different to how cookie-based auth works for any other site, yet. |
I tried this method, same cloudflare block. Perhaps the cookies are different for us, or is not matching up somewhere?
my browser cookies for the API site in netscape format appear to consist of three cookies. There is no user-agent equivalent?
I have done some brief testing, and it appears I can use other scraping methods, such as ripme, wget and python requests etc to scrape thebarchive without requiring any cookie nonsense. This may be unique to gallery-dl's method of using API? |
No, the If you haven't set Edit gallery-dl/gallery_dl/extractor/common.py Lines 328 to 331 in ef0c904
|
Thank you for your patience with me. I have successfully got gallery-dl to download a thread now...
I did accidentally leave in 'firefox' but still... |
Since one must now watch for cookie expiration and manually create cookie files as well as clicking the human verification button... It is kind of a troublesome solution... Should I leave this issue open? I think it is possible to come up with a solution that does not require cookies, since alternative downloaders do work without requiring them? |
@mikf @Hrxn - Hope its ok to @, I thought since this issue is old, a comment may be missed. It seems that the situation has changed again. While 4plebs used to not require any authentication - There began to be difficulties last month, with some requests randomly getting blocked. It now appears worse this month and has some form of cloudflare that is blocking gallery-dl, but still allowing browsers to pass without cloudflare. I can access other foolfoku archive sites by creating a cloudflare cookie in firefox, by visiting the _/api/chan/thread/ and letting the browser create the cookie - and then extracting that. Then using it in the command line. However, trying this method for 4plebs in the browser does not seem to create any cloudflare cookies or validation requests, all browser visits pass successfully without invoking cloudflare. It creates a foolframe_5SU_csrf_token cookie and if you click on 'accept' the cookie conditions on the non api site - it will create a second cookie called foolframe_5SU_cookie_hasConsent. These cookies do not appear to allow gallery-dl access to scraping unfortunately. Attempts to delete or reset all cookies to try to get accessing 4plebs to re-create fresh cloudflare cookies doesn't seem to do much, as it does not seem to require cloudflare authentication at all if using a browser. I tried combining cookies from another foolfoku site (cross cookies do seem to work for other archive sites, ie using thebarchive cookie to download from archived.moe, for example) but that doesnt work for 4plebs. At this point, I am stumped and it seems like gallery-dl has a hard gallery-dl api block, even though it returns a 403 cloudflare error and the api is free to use via browser.
|
Did anyone find a fix for this? |
It works for me for "archived.moe". Use |
Please try other archive sites, you may notice a different result. (4plebs in particular) |
4plebs has been blocking access to its API for I believe ~80% of the time it has been supported by gallery-dl, so for this site it has always been like that. They recommend downloading their dumps instead of scraping the site. |
I was using gallery-dl for approx 6 months to download from 4plebs. It was working. They changed the api recently, along with a few other archival sites - leading to gallery-dl to fully stop working. |
mine didn't work :) always leads to cloudflare challenge 403, even my cookies on firefox still work, without any cloudflare on archived.moe |
A few weeks ago I noticed that some archival sites, such as thebarchive and archived.moe were unable to be scraped by gallery-dl and were blocked. I decided to wait a while, to see if the issue went away, but it appears to still be present. Also, since it is occuring with more than one archival site, I think it may be a new security update of some kind that blocks robots?
Accessing the thread normally, in a browser works. But attempting to use gallery-dl to collect images fails with the following error message;
Scraping thread 916074222... 1/1 [archivedmoe][warning] Cloudflare challenge [archivedmoe][error] HttpError: '403 Forbidden' for 'https://archived.moe/_/api/chan/thread/'
I tested multiple 4chan archival sites to see which are working and which throw the cloudflare challenge and block;
boards.4chan.org - WORKS
archive.4plebs.org - WORKS
archived.moe - BLOCKED
thebarchive.com - BLOCKED
desuarchive.org - WORKS
archive.palanq.win - WORKS
arch.b4k.co - BLOCKED
however, the cloudflare notification is absent from arch.b4k.co, not sure if it is the same
Scraping thread 671665397... 1/1 [b4k][error] HttpError: '403 Forbidden' for 'https://arch.b4k.co/_/api/chan/thread/'
The text was updated successfully, but these errors were encountered: