[Site Request] archive.4plebs #5408

cheese529 · 2024-03-31T13:51:23Z

Hello there Mikf, hope you are doing well my man! It's been a while since i've posted here, been super busy with school and stuff. Anyways was hoping if I could get this site https://archive.4plebs.org/ added in the same format that you added 4chan Archives last year #4012.

It was super useful to just paste the link of a board like https://archive.4plebs.org/hr/page/1/ for example and have it download every single page from the board in order. Would also be nice to have post processors as well to get metadata and such.

Hopefully we can get this done soon before the owner of the site decides to add hard anti scraping measures here as well.

mikf · 2024-03-31T14:04:09Z

All 4chan archive sites based on Fuuka/Foolfuuka are already supported:
https://github.com/mikf/gallery-dl/blob/master/docs/supportedsites.md

cheese529 · 2024-03-31T14:18:53Z

All 4chan archive sites based on Fuuka/Foolfuuka are already supported: https://github.com/mikf/gallery-dl/blob/master/docs/supportedsites.md

Ah so that must mean there is no support for board extractors for those sites, hence why it keeps saying unsupported URL when I try to do https://archive.4plebs.org/hr/page/1

Hrxn · 2024-04-01T06:21:31Z

@cheese529 But thread extraction works, right?
Why would you want to grab entire boards, though? Some boards have threads which are, let's say, thematically very different..

cheese529 · 2024-04-01T07:10:15Z

But thread extraction works, right?

It does but some boards have thousands and thousands of threads, it would take hours for me to manually go through each page of threads and get the URL for each one to download. The reason for wanting to grab an entire board is mainly for boards such has /hr/ or /s/ or /aco/ where the themes are pretty similar.

Hrxn · 2024-04-01T07:23:33Z

I fail to see how /hr/ is thematically similar, to be honest.

cheese529 · 2024-04-01T07:39:48Z

I believe the majority of the threads are made about celebs on /hr/ so I guess you could count that.

Hrxn · 2024-04-01T10:19:42Z

Let's see..

Based on recently updated threads on https://boards.4chan.org/hr/, as of right now, the first 15 threads include..

Tarot Cards
A South Korean Girl K-Pop group (although I admit that could count as celeb pics)
Pictures about plants in spring
Pictures about angels in art
Dragon Ball
An Anime
Spiritual Infographics (yes, lol)

So that's about a half. Which would not fit automatically, content-wise

mikf · 2024-04-01T12:37:21Z

Boards are supported (#1044, fb64183). That's why it lists "Boards" in docs/supportedsites for Foolfuuka sites.

Their accepted URL pattern is currently rather picky, though. It needs to be https://domain/hr/ or https://domain/hr/N. Anything else won't get matched.

cheese529 · 2024-04-01T19:43:53Z

So that's about a half. Which would not fit automatically, content-wise

Those are pretty random haha but I believe it's still the minority. If you go here https://boards.4chan.org/hr/catalog and https://boards.4chan.org/hr/archive you will see it's mostly just celebs posted. I will let you know exactly how much was unrelated stuff once I download the entire board

cheese529 · 2024-04-01T19:47:30Z

Their accepted URL pattern is currently rather picky

Tried the different URL pattern but unfortunately it just gives me a cloud fare 403 error. Opening the board in browser works fine, no captcha or verification required to browser. Visiting API url such as https://archive.4plebs.org/_/api/chan/index/?board=hr&page=1 does not give a verification check either. My user agent is added to config and I am using cookies. Guess this is related to #5399

D:\4plebs> gallery-new.exe --cookies-from-browser firefox https://archive.4plebs.org/hr/
[cookies][info] Extracted 734 cookies from Firefox
[4plebs][warning] Cloudflare challenge
[4plebs][error] HttpError: '403 Forbidden' for 'https://archive.4plebs.org/_/api/chan/index/?board=hr&page=1'

stubkan · 2024-04-02T08:02:13Z

once I download the entire board

Do you intend to scrape absolutely everything? That kind of stuff is why they have been blocking robots, since people aggressively scrape vast amounts of data. 4plebs is one of the few archives that has a generous api call limit compared to the others, if this is abused - then that can change quickly.

Instead of getting the entire board by hammering 4plebs, why not get it from the provided dumps?

If you just want the recent stuff - everything in the last 3 days is still on 4chan, and you can scrape 4chan itself which has a very high bandwidth and is more able to handle scraping.

https://boards.4chan.org/hr/archive

it would take hours for me to manually go through each page of threads and get the URL for each one to download.

If you want specific things, there is a search function provided by 4plebs and other archives to only search for keywords such as "celebrity name" and parse only that for relevant threads, which is much better for the archive site. Grabbing everything doesn't make sense, because you still have to search "everything" you grabbed don't you? Do the search first, before you grab. It's not hard to scan search result page for threads and grab the thread numbers. This is how I did it.

echo "Extracting threads from $1 ..."
grep -Po '(?<=No.).+?(?=Replies)' $1 > extracted-threads.txt
echo "Written output to extracted-threads.txt."

* save cookies to tempfile, then rename avoids wiping the cookies file if the disk is full * [deviantart:stash] fix 'index' metadata (mikf#5335) * [deviantart:stash] recognize 'deviantart.com/stash/…' URLs * [gofile] fix extraction * [kemonoparty] add 'revision_count' metadata field (mikf#5334) * [kemonoparty] add 'order-revisions' option (mikf#5334) * Fix imagefap extrcator * [twitter] add 'birdwatch' metadata field (mikf#5317) should probably get a better name, but this is what it's called internally by Twitter * [hiperdex] update URL patterns & fix 'manga' metadata (mikf#5340) * [flickr] add 'contexts' option (mikf#5324) * [tests] show full path for nested values 'user.name' instead of just 'name' when testing for "user": { … , "name": "…", … } * [bluesky] add 'instance' metadata field (mikf#4438) * [vipergirls] add 'like' option (mikf#4166) * [vipergirls] add 'domain' option (mikf#4166) * [gelbooru] detect returned favorites order (mikf#5220) * [gelbooru] add 'date_favorited' metadata field * Update fapello.py get fullsize image instead resized * fapello.py Fullsize image by remove ".md" and ".th" in image url, it will download fullsize of images * [formatter] fix local DST datetime offsets for ':O' 'O' would get the *current* local UTC offset and apply it to all 'datetime' objects it gets applied to. This would result in a wrong offset if the current offset includes DST and the target 'datetime' does not or vice-versa. 'O' now determines the correct local UTC offset while respecting DST for each individual 'datetime'. * [subscribestar] fix 'date' metadata * [idolcomplex] support new pool URLs * [idolcomplex] fix metadata extraction - replace legacy 'id' vales with alphanumeric ones, since the former are no longer available - approximate 'vote_average', since the real value is no longer available - fix 'vote_count' * [bunkr] remove 'description' metadata album descriptions are no longer available on album pages and the previous code erroneously returned just '0' * [deviantart] improve 'index' extraction for stash files (mikf#5335) * [kemonoparty] fix exception for '/revision/' URLs caused by 03a9ce9 * [steamgriddb] raise proper exception for deleted assets * [tests] update extractor results * [pornhub:gif] extract 'viewkey' and 'timestamp' metadata (mikf#4463) mikf#4463 (comment) * [tests] use 'datetime.timezone.utc' instead of 'datetime.UTC' 'datetime.UTC' was added in Python 3.11 and is not defined in older versions. * [gelbooru] add 'order-posts' option for favorites (mikf#5220) * [deviantart] handle CloudFront blocks in general (mikf#5363) This was already done for non-OAuth requests (mikf#655) but CF is now blocking OAuth API requests as well. * release version 1.26.9 * [kemonoparty] fix KeyError for empty files (mikf#5368) * [twitter] fix pattern for single tweet (mikf#5371) - Add optional slash - Update tests to include some non-standard tweet URLs * [kemonoparty:favorite] support 'sort' and 'order' query params (mikf#5375) * [kemonoparty] add 'announcements' option (mikf#5262) mikf#5262 (comment) * [wikimedia] suppress exception for entries without 'imageinfo' (mikf#5384) * [docs] update defaults of 'sleep-request', 'browser', 'tls12' * [docs] complete Authentication info in supportedsites.md * [twitter] prevent crash when extracting 'birdwatch' metadata (mikf#5403) * [workflows] build complete docs Pages only on gdl-org/docs deploy only docs/oauth-redirect.html on mikf.github.io/gallery-dl * [docs] document 'actions' (mikf#4543) or at least attempt to * store 'match' and 'groups' in Extractor objects * [foolfuuka] improve 'board' pattern & support pages (mikf#5408) * [reddit] support comment embeds (mikf#5366) * [build] add minimal pyproject.toml * [build] generate sdist and wheel packages using 'build' module * [build] include only the latest CHANGELOG entries The CHANGELOG is now at a size where it takes up roughly 50kB or 10% of an sdist or wheel package. * [oauth] use Extractor.request() for HTTP requests (mikf#5433) Enables using proxies and general network options. * [kemonoparty] fix crash on posts with missing datetime info (mikf#5422) * restore LD_LIBRARY_PATH for PyInstaller builds (mikf#5421) * remove 'contextlib' imports * [pp:ugoira] log errors for general exceptions * [twitter] match '/photo/' Tweet URLs (mikf#5443) fixes regression introduced in 40c0553 * [pp:mtime] do not overwrite '_mtime' for None values (mikf#5439) * [wikimedia] fix exception for files with empty 'metadata' * [wikimedia] support wiki.gg wikis * [pixiv:novel] add 'covers' option (mikf#5373) * [tapas] add 'creator' extractor (mikf#5306) * [twitter] implement 'relogin' option (mikf#5445) * [docs] update docs/configuration links (mikf#5059, mikf#5369, mikf#5423) * [docs] replace AnchorJS with custom script use it in rendered .rst documents as well as in .md ones * [text] catch general Exceptions * compute tempfile path only once * Add warnings flag This commit adds a warnings flag It can be combined with -q / --quiet to display warnings. The intent is to provide a silent option that still surfaces warning and error messages so that they are visible in logs. * re-order verbose and warning options * [gelbooru] improve pagination logic for meta tags (mikf#5478) similar to 494acab * [common] add Extractor.input() method * [twitter] improve username & password login procedure (mikf#5445) - handle more subtasks - support 2FA - support email verification codes * [common] update Extractor.wait() message format * [common] simplify 'status_code' check in Extractor.request() * [common] add 'sleep-429' option (mikf#5160) * [common] fix NameError in Extractor.request() … when accessing 'code' after an requests exception was raised. Caused by the changes in 566472f * [common] show full URL in Extractor.request() error messages * [hotleak] download files with 404 status code (mikf#5395) * [pixiv] change 'sanity_level' debug message to a warning (mikf#5180) * [twitter] handle missing 'expanded_url' fields (mikf#5463, mikf#5490) * [tests] allow filtering extractor result tests by URL or comment python test_results.py twitter:+/i/web/ python test_results.py twitter:~twitpic * [exhentai] detect CAPTCHAs during login (mikf#5492) * [output] extend 'output.colors' (mikf#2566) allow specifying ANSI colors for all loglevels (debug, info, warning, error) * [output] enable colors by default * add '--no-colors' command-line option --------- Co-authored-by: Luc Ritchie <luc.ritchie@gmail.com> Co-authored-by: Mike Fährmann <mike_faehrmann@web.de> Co-authored-by: Herp <asdf@qwer.com> Co-authored-by: wankio <31354933+wankio@users.noreply.github.com> Co-authored-by: fireattack <human.peng@gmail.com> Co-authored-by: Aidan Harris <me@aidanharr.is>

cheese529 mentioned this issue Mar 31, 2024

[Site Request] archiveofsins.com #5409

Closed

mikf added the site:enhancement label Mar 31, 2024

mikf added a commit that referenced this issue Apr 2, 2024

[foolfuuka] improve 'board' pattern & support pages (#5408)

64948f2

mikf closed this as completed Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Site Request] archive.4plebs #5408

[Site Request] archive.4plebs #5408

cheese529 commented Mar 31, 2024

mikf commented Mar 31, 2024

cheese529 commented Mar 31, 2024 •

edited

Loading

Hrxn commented Apr 1, 2024

cheese529 commented Apr 1, 2024

Hrxn commented Apr 1, 2024

cheese529 commented Apr 1, 2024

Hrxn commented Apr 1, 2024 •

edited

Loading

mikf commented Apr 1, 2024 •

edited

Loading

cheese529 commented Apr 1, 2024

cheese529 commented Apr 1, 2024 •

edited

Loading

stubkan commented Apr 2, 2024 •

edited

Loading

[Site Request] archive.4plebs #5408

[Site Request] archive.4plebs #5408

Comments

cheese529 commented Mar 31, 2024

mikf commented Mar 31, 2024

cheese529 commented Mar 31, 2024 • edited Loading

Hrxn commented Apr 1, 2024

cheese529 commented Apr 1, 2024

Hrxn commented Apr 1, 2024

cheese529 commented Apr 1, 2024

Hrxn commented Apr 1, 2024 • edited Loading

mikf commented Apr 1, 2024 • edited Loading

cheese529 commented Apr 1, 2024

cheese529 commented Apr 1, 2024 • edited Loading

stubkan commented Apr 2, 2024 • edited Loading

cheese529 commented Mar 31, 2024 •

edited

Loading

Hrxn commented Apr 1, 2024 •

edited

Loading

mikf commented Apr 1, 2024 •

edited

Loading

cheese529 commented Apr 1, 2024 •

edited

Loading

stubkan commented Apr 2, 2024 •

edited

Loading