Create generic archiver for all valid youtube-dl URLs, add truthsocial extractor, unit tests for twitter_api extractor, utility methods for cleaning HTML and traversing objects #175

pjrobertson · 2025-01-15T17:43:49Z

This is a proof of concept, but should hopefully do away with having to have separate archivers for things that ytdlp already does (e.g. tiktok, bluesky).

Currently: it fails for posts that don't have videos (e.g. bluesky, twitter) because youtubedlp is specifically for videos.

I put out a PR on ytdlp just now to separate the extract_status logic and download_logic in the ytdlp bluesky class, and got a very quick reply from a maintainer:

This is not a supported use case. These are private methods, and compat is not guaranteed. In fact, compat is almost certainly guaranteed to be broken with extractor code

And the PR got marked as 'do not merge'. Something to discuss, but it seems like we can only do away with separate archivers for sites that only do video (youtube, tiktok...)

Also download in chunks - fixes 2 x TODOs

1. Extract more metadata 2. Better extract thumbnail 3. Setup framework for specific sites to provide more granular metadata processing

msramalho

looking good, minor changes.
are you adding the platform specific changes to this PR? (tiktok...)

msramalho · 2025-01-15T18:10:11Z

src/auto_archiver/archivers/archiver.py

+            d.raise_for_status()
+
+            # Peek at the first 256 bytes
+            first_256 = d.raw.read(256)


is this a standard size for extension guessing?

This was a typo, it should be 261, all that's required to detect. But yes, that's all that's needed to guess filetype. No point loading more than that (and wasting memory) to determine the filetype!

interesting!
I haven't tested but did you check if mimetypes can achieve the same thing with file chunks? seeing this method https://docs.python.org/3/library/mimetypes.html#mimetypes.MimeTypes.readfp may be worth quickly testing to avoid adding a new dependency

Good thinking. I looked into it and it doesn't seem easy with readfp, but actually there is a better way - the response object has Content-Type set on the headers, which (hopefully) most servers will be setting, so we can just use that :)

# get mimetype from the response headers if not Path(to_filename).suffix: content_type = d.headers.get('Content-Type') extension = mimetypes.guess_extension(content_type) if extension: to_filename += extension

src/auto_archiver/archivers/youtubedl_archiver.py

msramalho · 2025-01-15T18:10:22Z

src/auto_archiver/archivers/youtubedl_archiver.py

+                     'resolution', 'dynamic_range', 'aspect_ratio', 'cookies', 'format', 'quality', 'preference', 'artists',
+                     'channel_id', 'subtitles', 'tbr', 'url', 'original_url', 'automatic_captions', 'playable_in_embed', 'live_status',
+                     '_format_sort_fields', 'chapters', 'uploader_id', 'uploader_url', 'requested_formats', 'format_note',
+                     'audio_channels', 'asr', 'fps', 'was_live', 'is_live', 'heatmap', 'age_limit', 'stretched_ratio']


from these I'd keep any that could describe the who/when so, that I see, that's uploader_id, uploader_url if that's not saved elsewhere and is in fact ids/usernames/public links.

They're not stored anywhere else. I'll remove them from here

src/auto_archiver/archivers/youtubedl_archiver.py

pjrobertson · 2025-01-16T12:34:51Z

are you adding the platform specific changes to this PR? (tiktok...)

Yes, that was the thinking. If the general direction looks good to you, I'll also include the tiktok one (it's actually already here/working) and perhaps the Twitter one (?)

pjrobertson · 2025-01-16T14:33:49Z

OK, I've changed those items you mentioned, and have also started refactoring the code (quite a lot of changes). In summary:

Renamed youtubedl_archiver to base_archiver - happy to modify as you see fit
Moved the new base_archiver into its own folder (module) to kind of show how modules might look in the future (note that this is just to show what it might look like, but the code all still works and orchestrator remains untouched for now)
Removed the tiktok_archiver file - the base_archiver can do it all
Removed the twitter_archiver file - the base_archiver can do it all
Removed the twitter_api_archiver file - the base_archiver can do it all
Removed the bluesky_archiver - the base_archiver can do it all (once my upstream here is merged)

I think we're moving in the right direction, basically using youtubedlp as the main archiver, and only requiring additional archivers for where youtubedlp doesn't work.

In the future, we may also be able to remove:

vk archiver
Telegram archiver
facebook
Instagram

But I haven't touched those for now

msramalho

main change is not removing the official twitter api logic, otherwise looks good and we can iterate improvements over this demo module

src/auto_archiver/archivers/base_archiver/bluesky.py

src/auto_archiver/archivers/twitter_api_archiver.py