several 4chan archives appear to have blocked gallery-dl #5399

stubkan · 2024-03-29T11:00:48Z

A few weeks ago I noticed that some archival sites, such as thebarchive and archived.moe were unable to be scraped by gallery-dl and were blocked. I decided to wait a while, to see if the issue went away, but it appears to still be present. Also, since it is occuring with more than one archival site, I think it may be a new security update of some kind that blocks robots?

Accessing the thread normally, in a browser works. But attempting to use gallery-dl to collect images fails with the following error message;

Scraping thread 916074222... 1/1 [archivedmoe][warning] Cloudflare challenge [archivedmoe][error] HttpError: '403 Forbidden' for 'https://archived.moe/_/api/chan/thread/'

I tested multiple 4chan archival sites to see which are working and which throw the cloudflare challenge and block;

boards.4chan.org - WORKS
archive.4plebs.org - WORKS
archived.moe - BLOCKED
thebarchive.com - BLOCKED
desuarchive.org - WORKS
archive.palanq.win - WORKS

arch.b4k.co - BLOCKED
however, the cloudflare notification is absent from arch.b4k.co, not sure if it is the same

Scraping thread 671665397... 1/1 [b4k][error] HttpError: '403 Forbidden' for 'https://arch.b4k.co/_/api/chan/thread/'

The text was updated successfully, but these errors were encountered:

arisboch · 2024-03-29T11:59:25Z

I take it a spoofed user agent didn't help?

stubkan · 2024-03-29T12:56:32Z

I am not sure what that is, I dont see any mention of it on the documentation? I did have to set referer to blank to get one of the sites working prior to this block. Had been using gallery-dl without issue for half a year before this.

I tried adding gallery-dl --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0" but it throws the same cloudflare block

mikf · 2024-03-29T13:34:29Z

You need to provide cookies and user agent of the browser that can access these blocked sites:
#4844 (comment)

stubkan · 2024-03-29T14:04:25Z

Can you generate a success on your end? I do not seem to be able to.

   gallery-dl --user-agent browser --cookies-from-browser chromium https://thebarchive.com/b/thread/916060069
   [cookies][error] Failed to read from GNOME keyring
   [cookies][info] Extracted 3091 cookies from Chromium
   [thebarchive][warning] Cloudflare challenge
   [thebarchive][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/'

I tried firefox instead;

   gallery-dl --user-agent browser --cookies-from-browser firefox https://thebarchive.com/b/thread/916060069
   [cookies][info] Extracted 375 cookies from Firefox
   [thebarchive][warning] Cloudflare challenge
   [thebarchive][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/'

Thought, maybe I have to manually update the cookies myself, by visiting the site... did so... retried, and it imported 377 cookies instead of 375, but still failed

   gallery-dl --user-agent browser --cookies-from-browser firefox https://thebarchive.com/b/thread/916060069
   [cookies][info] Extracted 377 cookies from Firefox
   [thebarchive][warning] Cloudflare challenge
   [thebarchive][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/'

mikf · 2024-03-29T14:22:19Z

"When there's no Cloudflare challenge for your browser and/or there's no cf_clearance cookie, you are out of luck." ... and there isn't on for thebarchive.com

It doesn't seem to work for archived.moe either, even though there is a cf_clearance cookie present.

This does work for sites with a "Verifying you are human" check like nhentai, but apparently not here.

stubkan · 2024-03-29T15:00:52Z

I got a cloudflare human check on thebarchive, and then retried - and it still did not pass.

It seems likely to me, these changes likely will eventually propagate to all the 4chan archives if they are apparently successful.

Here's a post by the maintainer of archived.moe;

stubkan · 2024-03-29T15:33:15Z

"When there's no Cloudflare challenge for your browser and/or there's no cf_clearance cookie, you are out of luck." ... and there isn't on for thebarchive.com

I checked my cookies, and there is a cf_clearance cookie for thebarchive - as well as the site requesting I pass a human check, which I clicked on.

But, gallery-dl is still blocked, unfortunately.

cheese529 · 2024-03-31T13:41:12Z

Do you think there might be a possible work around for this or are we out of luck here @mikf ?
I know yt-dlp uses user agent + cookies to bypass cloudfare issues so maybe we could look at how they're able to scrape sites to see if we can get any info on how to deal with this.

Hrxn · 2024-04-01T07:43:56Z

Not sure, I tried the barchives example link from above, here's my log content:

[2024-04-01T09:38:51][info] Extracted 3140 cookies from Chrome
[2024-04-01T09:38:52][warning] Cloudflare challenge [Source URL: https://thebarchive.com/b/thread/916060069]
[2024-04-01T09:38:52][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/' [Source URL: https://thebarchive.com/b/thread/916060069]

But looking in Chrome DevTools with this thread, I have these cookies (among some others);

__cf_bm
cf_clearance
csrftoken
foolframe_KmD_csrf_token

And I think cookie extraction from the browser should work... at least that's what my log says?

mikf · 2024-04-01T12:56:53Z

I know yt-dlp uses user agent + cookies to bypass cloudfare issues

That's what I recommended doing #5399 (comment), but it doesn't work in this situation #5399 (comment).

Hrxn · 2024-04-01T13:26:00Z

Maybe the API is completely blocked, i.e. always returning a 403 no matter what?

mikf · 2024-04-01T14:54:08Z

It is still accessible with a browser. It does, however, show an actual "Verify you are human" check. Solving it and using cookies and user agent afterwards does allow gallery-dl to access it, at least it did for me.

So go to https://thebarchive.com/_/api/chan/thread/, let your browser solve the challenge, and then do the cookie/user-agent thingy. This should allow using the API and downloading from thebarchive and archivedmoe.

Hrxn · 2024-04-01T15:23:58Z

Right, maybe I should've "used" the site more basically, because the "verify human" check did not pop for me in the browser.

I still saw the cf_clearance cookie in my browser devtools, but maybe it was simply a default or randomized nonsensical value. Should have considered the possibility of it being invalid, true.

Opening https://thebarchive.com/_/api/chan/thread/ in the browser immediately opens the "verify human" check, I toggled the mark, closed the browser and immediately tried the thread URL from earlier again (https://thebarchive.com/b/thread/916060069), and it was actually working this time.

It downloaded the grand total of one image, but I also only see one image in this thread when viewing in my browser, so this is probably correct, don't know what you guys see here..

So, @stubkan , given that you actually have the correct cookie in your browser, extraction (at least from thebarchive.com) seems to work - to answer the issue in this thread here.

mikf · 2024-04-01T15:48:10Z

maybe I should've "used" the site more basically, because the "verify human" check did not pop for me in the browser.

The problem here is that the "verify human" check does not appear when just regularly browsing the site in question, but only when explicitly visiting an API URL.

stubkan · 2024-04-02T09:58:17Z

Unfortunately, I have been doing these steps. I have passed cloudflare human check in browser (both firefox and chromium) and checking with both archivedmoe and thebarchive. I also tried a few different threads, and it gives me the cloudflare challenge block every time.

Can you please outline your command process that leads to success? I am doing it in this way in my previous comment;

mikf · 2024-04-02T11:46:01Z

I visit https://thebarchive.com/_/api/chan/thread/ and let my browser solve the CF challenge
I visit the thread page I want to download
Export Cookies (--cookies-from-browser works as well)
Use cookies and browser's user agent

$ gallery-dl -C cookies-thebarchive-com.txt --user-agent browser https://thebarchive.com/b/thread/739772332/
thebarchive/b/739772332 Won't you look …gy face.../1500782886885 Smug face 0.png
thebarchive/b/739772332 Won't you look …r smuggy face.../1500783064989 image.jpg
...

$ gallery-dl --cookies-from firefox:/tmp/.firefox --user-agent browser https://archived.moe/gd/thread/309639/
archivedmoe/gd/309639 Which Adobe progr… Adobe_Systems_logo_and_wordmark.svg.png
archivedmoe/gd/309639 Which Adobe progr…hic design?/1495922648056 Sans titre.png

Hrxn · 2024-04-02T13:03:45Z

Yep, simply extracting the cookies from the browser definitely works.
So, either --cookies-from-browser or via config, like this, for example.

            "cookies": ["chrome", "Profile 4"]

You only have to make sure that the cookies from the browser match the "user-agent" setting used by gallery-dl, like this for example:

            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"

Or, like mikf suggested, by using "browser" as the string value for "user-agent", which will try to automatically use the UA information from your system's default browser. Also fine, if that browser is the one which has the correct cookies.

The problem here is that the "verify human" check does not appear when just regularly browsing the site in question, but only when explicitly visiting an API URL.

I see, but honestly, I don't think this is fundamentally different to how cookie-based auth works for any other site, yet.
Might be possible that the problem here is quick expiration of the cookies, but I fail to understand how visiting this special API URL in your browser, making sure it succeeds so that you've got your correct cookies is more of a problem than the usual visiting the site in your browser, and sign-in with your credentials to get your correct cookies steps.

stubkan · 2024-04-02T13:18:41Z

export cookies in netscape.txt and add with gallery-dl -C

I tried this method, same cloudflare block. Perhaps the cookies are different for us, or is not matching up somewhere?

You only have to make sure that the cookies from the browser match the "user-agent" setting used by gallery-dl, like this for example:

my browser cookies for the API site in netscape format appear to consist of three cookies. There is no user-agent equivalent?

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by Cookie-Editor
thebarchive.com	FALSE	/	FALSE	1714316255	foolframe_KmD_search_latest_5	<something containing - board talk text scrape start 2024-1-1 board talk>
thebarchive.com	FALSE	/	FALSE	1712495038	foolframe_KmD_csrf_token	<short string of letters/numbs>
#HttpOnly_.thebarchive.com	TRUE	/	TRUE	1743587157	cf_clearance	<long string of letters and numbers>

I have done some brief testing, and it appears I can use other scraping methods, such as ripme, wget and python requests etc to scrape thebarchive without requiring any cookie nonsense. This may be unique to gallery-dl's method of using API?

Hrxn · 2024-04-02T13:32:52Z

No, the "user-agent" setting of gallery-dl must match the user-agent info of the browser with the cookies.

If you haven't set "user-agent" in your gallery-dl config, make sure to do so, because otherwise you'd be using the built-in default of gallery-dl, which is a recent version of Firefox ESR, ~~but with a linux style~~. Unless you're using the exact same browser, the exact same version, on the exact same platform, you've got to change the "user-agent" setting.

Edit

gallery-dl/gallery_dl/extractor/common.py

Lines 328 to 331 in ef0c904

    
           if useragent is None: 
        
               useragent = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64; " 
        
                            "rv:109.0) Gecko/20100101 Firefox/115.0") 
        
           elif useragent == "browser":

stubkan · 2024-04-02T13:40:08Z

Thank you for your patience with me. I have successfully got gallery-dl to download a thread now...

gallery-dl -C cookies-thebarchive-com.txt -o "user-agent=Mozilla/5.0 (X11; Linux x86_64; rv:123.0) Firefox/123.0" firefox https://thebarchive.com/b/thread/739772332/
[1/2] firefox
[gallery-dl][error] Unsupported URL 'firefox'
[2/2] https://thebarchive.com/b/thread/739772332/
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500782886885 Smug face 0.png
./gallery-dl/thebarchive/b/739772332 Won't you loo….Look t her smuggy face.../1500783064989 image.jpg
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783269175 Smug face 2.png
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783400678 Smug face 3.png
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783588112 Smug face 4.png
./gallery-dl/thebarchive/b/739772332 Won't you loo….Look t her smuggy face.../1500783596174 image.jpg
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783824410 Smug face 6.png
./gallery-dl/thebarchive/b/739772332 Won't you loo….Look t her smuggy face.../1500784125831 image.jpg

I did accidentally leave in 'firefox' but still...

stubkan · 2024-04-02T16:52:40Z

Since one must now watch for cookie expiration and manually create cookie files as well as clicking the human verification button... It is kind of a troublesome solution... Should I leave this issue open?

I think it is possible to come up with a solution that does not require cookies, since alternative downloaders do work without requiring them?

stubkan · 2024-06-23T08:55:07Z

@mikf @Hrxn - Hope its ok to @, I thought since this issue is old, a comment may be missed.

It seems that the situation has changed again. While 4plebs used to not require any authentication - There began to be difficulties last month, with some requests randomly getting blocked. It now appears worse this month and has some form of cloudflare that is blocking gallery-dl, but still allowing browsers to pass without cloudflare.

I can access other foolfoku archive sites by creating a cloudflare cookie in firefox, by visiting the _/api/chan/thread/ and letting the browser create the cookie - and then extracting that. Then using it in the command line.

However, trying this method for 4plebs in the browser does not seem to create any cloudflare cookies or validation requests, all browser visits pass successfully without invoking cloudflare. It creates a foolframe_5SU_csrf_token cookie and if you click on 'accept' the cookie conditions on the non api site - it will create a second cookie called foolframe_5SU_cookie_hasConsent. These cookies do not appear to allow gallery-dl access to scraping unfortunately.

Attempts to delete or reset all cookies to try to get accessing 4plebs to re-create fresh cloudflare cookies doesn't seem to do much, as it does not seem to require cloudflare authentication at all if using a browser.

I tried combining cookies from another foolfoku site (cross cookies do seem to work for other archive sites, ie using thebarchive cookie to download from archived.moe, for example) but that doesnt work for 4plebs.

At this point, I am stumped and it seems like gallery-dl has a hard gallery-dl api block, even though it returns a 403 cloudflare error and the api is free to use via browser.

[4plebs][warning] Cloudflare challenge [4plebs][error] HttpError: '403 Forbidden' for 'https://archive.4plebs.org/_/api/chan/thread/'

zoekobii · 2024-08-27T14:05:55Z

Did anyone find a fix for this?

UtopianElectronics · 2024-09-25T16:07:12Z

It works for me for "archived.moe". Use --cookies-from-browser and --user-agent (get your user agent string from here. Make sure you use the same user agent as your browser's) and after solving the captchas and refreshing both "archived.moe" and 4chan, it should work.

stubkan · 2024-09-28T20:02:44Z

It works for me for "archived.moe". Use --cookies-from-browser and --user-agent (get your user agent string from here. Make sure you use the same user agent as your browser's) and after solving the captchas and refreshing both "archived.moe" and 4chan, it should work.

Please try other archive sites, you may notice a different result. (4plebs in particular)

mikf · 2024-09-29T06:05:59Z

4plebs in particular

4plebs has been blocking access to its API for I believe ~80% of the time it has been supported by gallery-dl, so for this site it has always been like that. They recommend downloading their dumps instead of scraping the site.

stubkan · 2024-10-03T13:45:54Z

I was using gallery-dl for approx 6 months to download from 4plebs. It was working. They changed the api recently, along with a few other archival sites - leading to gallery-dl to fully stop working.

wankio · 2024-11-29T11:40:23Z

It works for me for "archived.moe". Use --cookies-from-browser and --user-agent (get your user agent string from here. Make sure you use the same user agent as your browser's) and after solving the captchas and refreshing both "archived.moe" and 4chan, it should work.

mine didn't work :) always leads to cloudflare challenge 403, even my cookies on firefox still work, without any cloudflare on archived.moe

mikf added the cloudflare label Mar 29, 2024

cheese529 mentioned this issue Apr 1, 2024

[Site Request] archive.4plebs #5408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

several 4chan archives appear to have blocked gallery-dl #5399

several 4chan archives appear to have blocked gallery-dl #5399

stubkan commented Mar 29, 2024 •

edited

Loading

arisboch commented Mar 29, 2024

stubkan commented Mar 29, 2024 •

edited

Loading

mikf commented Mar 29, 2024

stubkan commented Mar 29, 2024 •

edited

Loading

mikf commented Mar 29, 2024

stubkan commented Mar 29, 2024 •

edited

Loading

stubkan commented Mar 29, 2024

cheese529 commented Mar 31, 2024

Hrxn commented Apr 1, 2024

mikf commented Apr 1, 2024

Hrxn commented Apr 1, 2024

mikf commented Apr 1, 2024 •

edited

Loading

Hrxn commented Apr 1, 2024 •

edited

Loading

mikf commented Apr 1, 2024

stubkan commented Apr 2, 2024

mikf commented Apr 2, 2024

Hrxn commented Apr 2, 2024

stubkan commented Apr 2, 2024

Hrxn commented Apr 2, 2024 •

edited

Loading

stubkan commented Apr 2, 2024 •

edited

Loading

stubkan commented Apr 2, 2024

stubkan commented Jun 23, 2024 •

edited

Loading

zoekobii commented Aug 27, 2024

UtopianElectronics commented Sep 25, 2024

stubkan commented Sep 28, 2024

mikf commented Sep 29, 2024

stubkan commented Oct 3, 2024

wankio commented Nov 29, 2024

several 4chan archives appear to have blocked gallery-dl #5399

several 4chan archives appear to have blocked gallery-dl #5399

Comments

stubkan commented Mar 29, 2024 • edited Loading

arisboch commented Mar 29, 2024

stubkan commented Mar 29, 2024 • edited Loading

mikf commented Mar 29, 2024

stubkan commented Mar 29, 2024 • edited Loading

mikf commented Mar 29, 2024

stubkan commented Mar 29, 2024 • edited Loading

stubkan commented Mar 29, 2024

cheese529 commented Mar 31, 2024

Hrxn commented Apr 1, 2024

mikf commented Apr 1, 2024

Hrxn commented Apr 1, 2024

mikf commented Apr 1, 2024 • edited Loading

Hrxn commented Apr 1, 2024 • edited Loading

mikf commented Apr 1, 2024

stubkan commented Apr 2, 2024

mikf commented Apr 2, 2024

Hrxn commented Apr 2, 2024

stubkan commented Apr 2, 2024

Hrxn commented Apr 2, 2024 • edited Loading

stubkan commented Apr 2, 2024 • edited Loading

stubkan commented Apr 2, 2024

stubkan commented Jun 23, 2024 • edited Loading

zoekobii commented Aug 27, 2024

UtopianElectronics commented Sep 25, 2024

stubkan commented Sep 28, 2024

mikf commented Sep 29, 2024

stubkan commented Oct 3, 2024

wankio commented Nov 29, 2024

stubkan commented Mar 29, 2024 •

edited

Loading

stubkan commented Mar 29, 2024 •

edited

Loading

stubkan commented Mar 29, 2024 •

edited

Loading

stubkan commented Mar 29, 2024 •

edited

Loading

mikf commented Apr 1, 2024 •

edited

Loading

Hrxn commented Apr 1, 2024 •

edited

Loading

Hrxn commented Apr 2, 2024 •

edited

Loading

stubkan commented Apr 2, 2024 •

edited

Loading

stubkan commented Jun 23, 2024 •

edited

Loading