-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create generic archiver for all valid youtube-dl URLs, add truthsocial extractor, unit tests for twitter_api extractor, utility methods for cleaning HTML and traversing objects #175
Conversation
Also download in chunks - fixes 2 x TODOs
1. Extract more metadata 2. Better extract thumbnail 3. Setup framework for specific sites to provide more granular metadata processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, minor changes.
are you adding the platform specific changes to this PR? (tiktok...)
d.raise_for_status() | ||
|
||
# Peek at the first 256 bytes | ||
first_256 = d.raw.read(256) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a standard size for extension guessing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a typo, it should be 261, all that's required to detect. But yes, that's all that's needed to guess filetype. No point loading more than that (and wasting memory) to determine the filetype!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting!
I haven't tested but did you check if mimetypes can achieve the same thing with file chunks? seeing this method https://docs.python.org/3/library/mimetypes.html#mimetypes.MimeTypes.readfp may be worth quickly testing to avoid adding a new dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thinking. I looked into it and it doesn't seem easy with readfp, but actually there is a better way - the response object has Content-Type
set on the headers, which (hopefully) most servers will be setting, so we can just use that :)
# get mimetype from the response headers
if not Path(to_filename).suffix:
content_type = d.headers.get('Content-Type')
extension = mimetypes.guess_extension(content_type)
if extension:
to_filename += extension
'resolution', 'dynamic_range', 'aspect_ratio', 'cookies', 'format', 'quality', 'preference', 'artists', | ||
'channel_id', 'subtitles', 'tbr', 'url', 'original_url', 'automatic_captions', 'playable_in_embed', 'live_status', | ||
'_format_sort_fields', 'chapters', 'uploader_id', 'uploader_url', 'requested_formats', 'format_note', | ||
'audio_channels', 'asr', 'fps', 'was_live', 'is_live', 'heatmap', 'age_limit', 'stretched_ratio'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from these I'd keep any that could describe the who/when so, that I see, that's uploader_id
, uploader_url
if that's not saved elsewhere and is in fact ids/usernames/public links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're not stored anywhere else. I'll remove them from here
Yes, that was the thinking. If the general direction looks good to you, I'll also include the tiktok one (it's actually already here/working) and perhaps the Twitter one (?) |
OK, I've changed those items you mentioned, and have also started refactoring the code (quite a lot of changes). In summary:
I think we're moving in the right direction, basically using youtubedlp as the main archiver, and only requiring additional archivers for where youtubedlp doesn't work. In the future, we may also be able to remove:
But I haven't touched those for now |
bb0ea67
to
31dbbd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main change is not removing the official twitter api logic, otherwise looks good and we can iterate improvements over this demo module
* Keep twitter_api_archiver * Remove unit tests for obsolete archivers * Guess filename of media using the 'Content-Type' header * Add mechanism to run 'expensive' tests last (see conftest.py) and also flag expensive tests to fail straight off (pytest.mark.incremental)
31dbbd2
to
394bcd8
Compare
yield info_extractor | ||
|
||
def suitable(self, url: str) -> bool: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I was thinking in future we could bring it up the chain of processing, or even add an optional codelist into the source for known inputs, e.g. 'youtube', 'twitter'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking good, it's great to get so much pulled out into one place.
Were you changing anything in the orchestration.yaml? Step wasn't picking up the youtubedl_archiver as a subclass, I think due to the
TODO: cannot find subclasses of child.subclasses
Not sure if the plan is to change to process 'base_archiver' in the orchestrator or something?
extractor_name += f"_{info_extractor.ie_key()}" | ||
|
||
if self.end_means_success: | ||
result.success(extractor_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not related to this PR, but currently the output is extractor_name:succese_status, but could update the database with just extractor_name (as success is implied), but log / store on debug a metadata dict of {"extractor_name": "status"} which will show us what was attempted and what fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's actually a subtle difference between "extractor_name: success" and "extractor_name". Miguel explained it in the AA chat a few days back:
this should have been better documented as I can't say 100%, to understand it would require testing with some real cases.
The plausible reason for not being .success("twitter-ytl") is that this method was not working as expected to return the media and this would defer the logic to a "retrying" of fetching the video with the ytdlp archiver.
But I do think we should standardise things better. Using an unknown "{extractor_name}" format for things that don't succeed could likely cause confusion. I created a note in the project on this.
# then add the platform specific additional metadata | ||
for key, mapping in self.video_data_metadata_mapping(extractor_key, video_data).items(): | ||
if isinstance(mapping, str): | ||
result.set(key, eval(f"video_data{mapping}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's unlikely, but having eval can be a bit of a security risk especially with external data involved.
How nesty is the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought about this, but the mapping
being used is hard-coded in the code itself. It's not taking user input, so I think it's ok. Some of the fields are double nested, like ['post']['media']
and I thought this was the easiest way to do it. But maybe there's a better way?
Nothing changed. I monkey-patched the |
if info_extractor.ie_key() == 'Bluesky': | ||
return bluesky.create_metadata(video_data, self, url) | ||
if info_extractor.ie_key() == 'Twitter': | ||
return twitter.create_metadata(video_data, self, url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiktok missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. So actually it's not needed for tiktok, since all tiktok pages have valid videos, and the 'standard' video_data is returned by the ytdl tiktok extractor. So no further processing is required.
But you've made me realise I can abstract out and make the code clearer. I will do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy with how it's looking and okay to merge if you are not planning to add any more features here (eg the tiktok comment I left).
Only question mark here is that my upstream ytdlp mergePR (needed for the bluesky changes here) hasn't been merged yet. I've just pushed them. Worst case is I copy over the single method with a TODO for now, since this is all changing fairly quickly. |
use your own judgement over how long you expect the PR to take vs how much code you have to port here/additional work later, and also if this is blocking/creating headaches for the next tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor changes only.
|
||
def create_metadata(post: dict, archiver: Archiver, url: str) -> Metadata: | ||
""" | ||
Creates metaata from a truth social post |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creates metaata from a truth social post | |
Creates metadata from a truth social post |
timestamp = post['created_at'] # format is 2022-12-29T19:51:18.161Z | ||
result.set_timestamp(datetime.datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%S.%fZ")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp = post['created_at'] # format is 2022-12-29T19:51:18.161Z | |
result.set_timestamp(datetime.datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%S.%fZ")) | |
result.set_timestamp(post['created_at']) |
parse_dt handles this format.
>>> from dateutil.parser import parse as parse_dt
>>> parse_dt("2022-12-29T19:51:18.161Z")
datetime.datetime(2022, 12, 29, 19, 51, 18, 161000, tzinfo=tzutc())
|
||
for key in ['replies_count', 'reblogs_count', 'favourites_count', ('account', 'followers_count'), ('account', 'following_count'), ('account', 'statuses_count'), ('account', 'display_name'), 'language', 'in_reply_to_account', 'replies_count']: | ||
if isinstance(key, tuple): | ||
store_key = u" ".join(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store_key = u" ".join(key) | |
store_key = " ".join(key) |
unicode prefix is only needed for python2. py3 assumes unicode by default
assert len(result.media) == 1 | ||
assert result is not False | ||
|
||
@pytest.mark.skip("Currently failing, multiple images are not being downloaded - this is due to an issue with ytdlp extractor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an issue on their repo we can link to here?
(renamed from youtube_archiver)
Seems like ytdlp may not merge yt-dlp/yt-dlp#12098 anytime soon
6814205
to
d3e3eb7
Compare
src/auto_archiver/archivers/generic_archiver/generic_archiver.py
Outdated
Show resolved
Hide resolved
src/auto_archiver/archivers/generic_archiver/generic_archiver.py
Outdated
Show resolved
Hide resolved
src/auto_archiver/archivers/generic_archiver/generic_archiver.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Miguel Sozinho Ramalho <19508417+msramalho@users.noreply.github.com>
This is a proof of concept, but should hopefully do away with having to have separate archivers for things that ytdlp already does (e.g. tiktok, bluesky).
Currently: it fails for posts that don't have videos (e.g. bluesky, twitter) because youtubedlp is specifically for videos.
I put out a PR on ytdlp just now to separate the extract_status logic and download_logic in the ytdlp bluesky class, and got a very quick reply from a maintainer:
And the PR got marked as 'do not merge'. Something to discuss, but it seems like we can only do away with separate archivers for sites that only do video (youtube, tiktok...)