Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create generic archiver for all valid youtube-dl URLs, add truthsocial extractor, unit tests for twitter_api extractor, utility methods for cleaning HTML and traversing objects #175

Merged
merged 22 commits into from
Jan 21, 2025

Conversation

pjrobertson
Copy link
Collaborator

This is a proof of concept, but should hopefully do away with having to have separate archivers for things that ytdlp already does (e.g. tiktok, bluesky).

Currently: it fails for posts that don't have videos (e.g. bluesky, twitter) because youtubedlp is specifically for videos.

I put out a PR on ytdlp just now to separate the extract_status logic and download_logic in the ytdlp bluesky class, and got a very quick reply from a maintainer:

This is not a supported use case. These are private methods, and compat is not guaranteed. In fact, compat is almost certainly guaranteed to be broken with extractor code

And the PR got marked as 'do not merge'. Something to discuss, but it seems like we can only do away with separate archivers for sites that only do video (youtube, tiktok...)

Copy link
Contributor

@msramalho msramalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, minor changes.
are you adding the platform specific changes to this PR? (tiktok...)

d.raise_for_status()

# Peek at the first 256 bytes
first_256 = d.raw.read(256)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a standard size for extension guessing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a typo, it should be 261, all that's required to detect. But yes, that's all that's needed to guess filetype. No point loading more than that (and wasting memory) to determine the filetype!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting!
I haven't tested but did you check if mimetypes can achieve the same thing with file chunks? seeing this method https://docs.python.org/3/library/mimetypes.html#mimetypes.MimeTypes.readfp may be worth quickly testing to avoid adding a new dependency

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking. I looked into it and it doesn't seem easy with readfp, but actually there is a better way - the response object has Content-Type set on the headers, which (hopefully) most servers will be setting, so we can just use that :)

# get mimetype from the response headers
if not Path(to_filename).suffix:
    content_type = d.headers.get('Content-Type')
    extension = mimetypes.guess_extension(content_type)
    if extension:
        to_filename += extension

'resolution', 'dynamic_range', 'aspect_ratio', 'cookies', 'format', 'quality', 'preference', 'artists',
'channel_id', 'subtitles', 'tbr', 'url', 'original_url', 'automatic_captions', 'playable_in_embed', 'live_status',
'_format_sort_fields', 'chapters', 'uploader_id', 'uploader_url', 'requested_formats', 'format_note',
'audio_channels', 'asr', 'fps', 'was_live', 'is_live', 'heatmap', 'age_limit', 'stretched_ratio']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from these I'd keep any that could describe the who/when so, that I see, that's uploader_id, uploader_url if that's not saved elsewhere and is in fact ids/usernames/public links.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're not stored anywhere else. I'll remove them from here

@pjrobertson
Copy link
Collaborator Author

are you adding the platform specific changes to this PR? (tiktok...)

Yes, that was the thinking. If the general direction looks good to you, I'll also include the tiktok one (it's actually already here/working) and perhaps the Twitter one (?)

@pjrobertson pjrobertson requested a review from msramalho January 16, 2025 14:00
@pjrobertson
Copy link
Collaborator Author

OK, I've changed those items you mentioned, and have also started refactoring the code (quite a lot of changes). In summary:

  1. Renamed youtubedl_archiver to base_archiver - happy to modify as you see fit
  2. Moved the new base_archiver into its own folder (module) to kind of show how modules might look in the future (note that this is just to show what it might look like, but the code all still works and orchestrator remains untouched for now)
  3. Removed the tiktok_archiver file - the base_archiver can do it all
  4. Removed the twitter_archiver file - the base_archiver can do it all
  5. Removed the twitter_api_archiver file - the base_archiver can do it all
  6. Removed the bluesky_archiver - the base_archiver can do it all (once my upstream here is merged)

I think we're moving in the right direction, basically using youtubedlp as the main archiver, and only requiring additional archivers for where youtubedlp doesn't work.

In the future, we may also be able to remove:

  1. vk archiver
  2. Telegram archiver
  3. facebook
  4. Instagram

But I haven't touched those for now

Copy link
Contributor

@msramalho msramalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change is not removing the official twitter api logic, otherwise looks good and we can iterate improvements over this demo module

* Keep twitter_api_archiver
* Remove unit tests for obsolete archivers
* Guess filename of media using the 'Content-Type' header
* Add mechanism to run 'expensive' tests last (see conftest.py) and also flag expensive tests to fail straight off (pytest.mark.incremental)
@pjrobertson pjrobertson changed the title [WIP] Rewrite youtubedl_archiver to be more generic and work for all valid ytdlp URLs Rewrite youtubedl_archiver to be more generic and work for all valid ytdlp URLs Jan 17, 2025
@pjrobertson pjrobertson changed the title Rewrite youtubedl_archiver to be more generic and work for all valid ytdlp URLs [WIP] Rewrite youtubedl_archiver to be more generic and work for all valid ytdlp URLs Jan 17, 2025
yield info_extractor

def suitable(self, url: str) -> bool:
"""
Copy link
Collaborator

@erinhmclark erinhmclark Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I was thinking in future we could bring it up the chain of processing, or even add an optional codelist into the source for known inputs, e.g. 'youtube', 'twitter'

Copy link
Collaborator

@erinhmclark erinhmclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking good, it's great to get so much pulled out into one place.

Were you changing anything in the orchestration.yaml? Step wasn't picking up the youtubedl_archiver as a subclass, I think due to the
TODO: cannot find subclasses of child.subclasses
Not sure if the plan is to change to process 'base_archiver' in the orchestrator or something?

extractor_name += f"_{info_extractor.ie_key()}"

if self.end_means_success:
result.success(extractor_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not related to this PR, but currently the output is extractor_name:succese_status, but could update the database with just extractor_name (as success is implied), but log / store on debug a metadata dict of {"extractor_name": "status"} which will show us what was attempted and what fails

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually a subtle difference between "extractor_name: success" and "extractor_name". Miguel explained it in the AA chat a few days back:

this should have been better documented as I can't say 100%, to understand it would require testing with some real cases.
The plausible reason for not being .success("twitter-ytl") is that this method was not working as expected to return the media and this would defer the logic to a "retrying" of fetching the video with the ytdlp archiver.

But I do think we should standardise things better. Using an unknown "{extractor_name}" format for things that don't succeed could likely cause confusion. I created a note in the project on this.

# then add the platform specific additional metadata
for key, mapping in self.video_data_metadata_mapping(extractor_key, video_data).items():
if isinstance(mapping, str):
result.set(key, eval(f"video_data{mapping}"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's unlikely, but having eval can be a bit of a security risk especially with external data involved.
How nesty is the data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about this, but the mapping being used is hard-coded in the code itself. It's not taking user input, so I think it's ok. Some of the fields are double nested, like ['post']['media'] and I thought this was the easiest way to do it. But maybe there's a better way?

@pjrobertson
Copy link
Collaborator Author

It's looking good, it's great to get so much pulled out into one place.

Were you changing anything in the orchestration.yaml? Step wasn't picking up the youtubedl_archiver as a subclass, I think due to the TODO: cannot find subclasses of child.subclasses Not sure if the plan is to change to process 'base_archiver' in the orchestrator or something?

Nothing changed. I monkey-patched the youtubedl_archiver here, which means that currently for orchestration etc. nothing here is changed. The orchestrator should run just fine without any changes to the settings. But let me add an integration unit test for running the entire orchestrator, to make sure it works

if info_extractor.ie_key() == 'Bluesky':
return bluesky.create_metadata(video_data, self, url)
if info_extractor.ie_key() == 'Twitter':
return twitter.create_metadata(video_data, self, url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiktok missing here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. So actually it's not needed for tiktok, since all tiktok pages have valid videos, and the 'standard' video_data is returned by the ytdl tiktok extractor. So no further processing is required.

But you've made me realise I can abstract out and make the code clearer. I will do that

msramalho
msramalho previously approved these changes Jan 20, 2025
Copy link
Contributor

@msramalho msramalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy with how it's looking and okay to merge if you are not planning to add any more features here (eg the tiktok comment I left).

@pjrobertson
Copy link
Collaborator Author

happy with how it's looking and okay to merge if you are not planning to add any more features here (eg the tiktok comment I left).

Only question mark here is that my upstream ytdlp mergePR (needed for the bluesky changes here) hasn't been merged yet. I've just pushed them. Worst case is I copy over the single method with a TODO for now, since this is all changing fairly quickly.

@msramalho
Copy link
Contributor

use your own judgement over how long you expect the PR to take vs how much code you have to port here/additional work later, and also if this is blocking/creating headaches for the next tasks.

Copy link
Contributor

@msramalho msramalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor changes only.


def create_metadata(post: dict, archiver: Archiver, url: str) -> Metadata:
"""
Creates metaata from a truth social post
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Creates metaata from a truth social post
Creates metadata from a truth social post

Comment on lines 19 to 20
timestamp = post['created_at'] # format is 2022-12-29T19:51:18.161Z
result.set_timestamp(datetime.datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%S.%fZ"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timestamp = post['created_at'] # format is 2022-12-29T19:51:18.161Z
result.set_timestamp(datetime.datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%S.%fZ"))
result.set_timestamp(post['created_at'])

parse_dt handles this format.

>>> from dateutil.parser import parse as parse_dt
>>> parse_dt("2022-12-29T19:51:18.161Z")
datetime.datetime(2022, 12, 29, 19, 51, 18, 161000, tzinfo=tzutc())


for key in ['replies_count', 'reblogs_count', 'favourites_count', ('account', 'followers_count'), ('account', 'following_count'), ('account', 'statuses_count'), ('account', 'display_name'), 'language', 'in_reply_to_account', 'replies_count']:
if isinstance(key, tuple):
store_key = u" ".join(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
store_key = u" ".join(key)
store_key = " ".join(key)

unicode prefix is only needed for python2. py3 assumes unicode by default

assert len(result.media) == 1
assert result is not False

@pytest.mark.skip("Currently failing, multiple images are not being downloaded - this is due to an issue with ytdlp extractor")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an issue on their repo we can link to here?

msramalho
msramalho previously approved these changes Jan 20, 2025
@pjrobertson pjrobertson changed the title [WIP] Rewrite youtubedl_archiver to be more generic and work for all valid ytdlp URLs Create generic archiver for all valid youtube-dl URLs, add truthsocial extractor, unit tests for twitter_api extractor, utility methods for cleaning HTML and traversing objects Jan 21, 2025
Co-authored-by: Miguel Sozinho Ramalho <19508417+msramalho@users.noreply.github.com>
@pjrobertson pjrobertson merged commit d4fff0b into main Jan 21, 2025
3 of 4 checks passed
@pjrobertson pjrobertson deleted the youtubedlp-rewrite branch January 21, 2025 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants