Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds an unauthenticated Bluesky archiver #160

Merged
merged 5 commits into from
Jan 7, 2025
Merged

Conversation

msramalho
Copy link
Contributor

  • avoids ATProto
  • leaves a few TODOs in reusable download_from_url method

@msramalho msramalho added enhancement New feature or request archiver labels Dec 23, 2024
from .instagram_api_archiver import InstagramAPIArchiver
from .bluesky_archiver import BlueskyArchiver
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general comment here (not relevant to this PR) but I don't like how we import all archivers in the init.py file. Ideally, we should only load dependencies (archivers, feeders etc.) that are enabled in the orchestration.

Reason: I came across a bug whereby an archiver was broken (snscrape was broken), and even though I hadn't enabled it, it was still being loaded and throwing an error. Imagine if an archiver breaks for whatever reason (or more likely, an archiver's dependencies break), it would be much easier to just disable the archiver than telling the user to find an __init__.py file and 'change the python code'.

This is essentially how django works. You create a list of INSTALLED_APPS and then the archiver tries to install them one by one. We already have this list of 'installed apps' (it's called steps in our orchestration). So it's just a matter of on startup, trying to load each of them. Then when they can't be loaded throwing an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, please open an issue for those changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, under 'backlog'

(Images are stored in media/images)
Use 'python -m unittest' from the project root to run
@pjrobertson
Copy link
Collaborator

LGTM. Tested and works.

One issue I found was the media wasn't getting downloaded for this url: https://bsky.app/profile/colborne.bsky.social/post/3lec2bqjc5s2y

For some reason, bsky uses 2 different places to store images. Under images and also under media/images (I think the latter is if there are multiple images)

(Pdb) post.get("record", {}).get("embed", {})
{'$type': 'app.bsky.embed.recordWithMedia', 'media': {'$type': 'app.bsky.embed.images', 'images': [{'alt': '', 'aspectRatio': {'height': 1119, 'width': 922}, 'image': {'$type': 'blob', 'ref': {'$link': 'bafkreiflrkfihcvwlhka5tb2opw2qog6gfvywsdzdlibveys2acozh75tq'}, 'mimeType': 'image/jpeg', 'size': 479309}}, {'alt': '', 'aspectRatio': {'height': 1333, 'width': 2000}, 'image': {'$type': 'blob', 'ref': {'$link': 'bafkreibsprmwchf7r6xcstqkdvvuj3ijw7efciw7l3y4crxr4cmynseo7u'}, 'mimeType': 'image/jpeg', 'size': 800522}}]}, 'record': {'$type': 'app.bsky.embed.record', 'record': {'cid': 'bafyreiavgtdninmcqeoiwrw7pgjubgu7rjfdd7jfb7dwvbvj7aavppwkxm', 'uri': 'at://did:plc:pr46lkz3rkp2yvxnvwe47ib2/app.bsky.feed.post/3ldpxkzv53k2q'}}}

Looking at the bksy source, it seems that embed can be one of:

  embed?:
    | AppBskyEmbedImages.View
    | AppBskyEmbedVideo.View
    | AppBskyEmbedExternal.View
    | AppBskyEmbedRecord.View
    | AppBskyEmbedRecordWithMedia.View
    | { $type: string; [k: string]: unknown }

I've modified the code to look at both these paths for images. I also took the liberty of adding in some unit tests to test the 3 different types of media (single image, multiple image, video).

The unit tests might want to be abstracted out to another PR. But it's a framework on how unit testing may work. One big question we need to discuss: do we want to the unit tests to make live requests to our online platforms?

Advantages: allows us to test the platforms as well, and to make sure the APIs/webpages haven't broken or changed
Disadvantages: unit tests will run more slowly, since they need to actually download from the internet, our tests will break even if we did nothing wrong.

@msramalho
Copy link
Contributor Author

I agree with merging this unit tests changes even if we refactor how those are setup etc, thanks for adding.
I cannot approve my own PR so leaving that to @erinhmclark on this one.

@msramalho msramalho requested a review from erinhmclark January 6, 2025 16:47
Copy link
Collaborator

@erinhmclark erinhmclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

@msramalho msramalho merged commit a697f0a into main Jan 7, 2025
@msramalho msramalho deleted the feat/bluesky-archiver branch January 7, 2025 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
archiver enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants