adds an unauthenticated Bluesky archiver #160

msramalho · 2024-12-23T17:45:46Z

avoids ATProto
leaves a few TODOs in reusable download_from_url method

example.orchestration.yaml

pjrobertson · 2024-12-31T08:24:18Z

src/auto_archiver/archivers/__init__.py

+from .instagram_api_archiver import InstagramAPIArchiver
+from .bluesky_archiver import BlueskyArchiver


Just a general comment here (not relevant to this PR) but I don't like how we import all archivers in the init.py file. Ideally, we should only load dependencies (archivers, feeders etc.) that are enabled in the orchestration.

Reason: I came across a bug whereby an archiver was broken (snscrape was broken), and even though I hadn't enabled it, it was still being loaded and throwing an error. Imagine if an archiver breaks for whatever reason (or more likely, an archiver's dependencies break), it would be much easier to just disable the archiver than telling the user to find an __init__.py file and 'change the python code'.

This is essentially how django works. You create a list of INSTALLED_APPS and then the archiver tries to install them one by one. We already have this list of 'installed apps' (it's called steps in our orchestration). So it's just a matter of on startup, trying to load each of them. Then when they can't be loaded throwing an error.

good point, please open an issue for those changes.

Done, under 'backlog'

(Images are stored in media/images)

Use 'python -m unittest' from the project root to run

pjrobertson · 2024-12-31T10:57:32Z

LGTM. Tested and works.

One issue I found was the media wasn't getting downloaded for this url: https://bsky.app/profile/colborne.bsky.social/post/3lec2bqjc5s2y

For some reason, bsky uses 2 different places to store images. Under images and also under media/images (I think the latter is if there are multiple images)

(Pdb) post.get("record", {}).get("embed", {})
{'$type': 'app.bsky.embed.recordWithMedia', 'media': {'$type': 'app.bsky.embed.images', 'images': [{'alt': '', 'aspectRatio': {'height': 1119, 'width': 922}, 'image': {'$type': 'blob', 'ref': {'$link': 'bafkreiflrkfihcvwlhka5tb2opw2qog6gfvywsdzdlibveys2acozh75tq'}, 'mimeType': 'image/jpeg', 'size': 479309}}, {'alt': '', 'aspectRatio': {'height': 1333, 'width': 2000}, 'image': {'$type': 'blob', 'ref': {'$link': 'bafkreibsprmwchf7r6xcstqkdvvuj3ijw7efciw7l3y4crxr4cmynseo7u'}, 'mimeType': 'image/jpeg', 'size': 800522}}]}, 'record': {'$type': 'app.bsky.embed.record', 'record': {'cid': 'bafyreiavgtdninmcqeoiwrw7pgjubgu7rjfdd7jfb7dwvbvj7aavppwkxm', 'uri': 'at://did:plc:pr46lkz3rkp2yvxnvwe47ib2/app.bsky.feed.post/3ldpxkzv53k2q'}}}

Looking at the bksy source, it seems that embed can be one of:

  embed?:
    | AppBskyEmbedImages.View
    | AppBskyEmbedVideo.View
    | AppBskyEmbedExternal.View
    | AppBskyEmbedRecord.View
    | AppBskyEmbedRecordWithMedia.View
    | { $type: string; [k: string]: unknown }

I've modified the code to look at both these paths for images. I also took the liberty of adding in some unit tests to test the 3 different types of media (single image, multiple image, video).

The unit tests might want to be abstracted out to another PR. But it's a framework on how unit testing may work. One big question we need to discuss: do we want to the unit tests to make live requests to our online platforms?

Advantages: allows us to test the platforms as well, and to make sure the APIs/webpages haven't broken or changed
Disadvantages: unit tests will run more slowly, since they need to actually download from the internet, our tests will break even if we did nothing wrong.

msramalho · 2025-01-06T16:47:45Z

I agree with merging this unit tests changes even if we refactor how those are setup etc, thanks for adding.
I cannot approve my own PR so leaving that to @erinhmclark on this one.

erinhmclark

Approved.

msramalho added 2 commits December 23, 2024 17:15

adds a TODO for nex code iterations

0dfdb86

implements bsky archiver

b8af64c

msramalho added enhancement New feature or request archiver labels Dec 23, 2024

adds new archiver to example orchestration file

19dc3ae

erinhmclark reviewed Dec 26, 2024

View reviewed changes

example.orchestration.yaml Show resolved Hide resolved

pjrobertson reviewed Dec 31, 2024

View reviewed changes

pjrobertson added 2 commits December 31, 2024 11:56

Fix downloading media for posts with multiple images

b38a2b7

(Images are stored in media/images)

Setup a basic framework for unit tests

3a93e9a

Use 'python -m unittest' from the project root to run

msramalho requested a review from erinhmclark January 6, 2025 16:47

erinhmclark approved these changes Jan 6, 2025

View reviewed changes

msramalho merged commit a697f0a into main Jan 7, 2025

msramalho deleted the feat/bluesky-archiver branch January 7, 2025 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds an unauthenticated Bluesky archiver #160

adds an unauthenticated Bluesky archiver #160

msramalho commented Dec 23, 2024

pjrobertson Dec 31, 2024

msramalho Jan 6, 2025

pjrobertson Jan 6, 2025

pjrobertson commented Dec 31, 2024

msramalho commented Jan 6, 2025

erinhmclark left a comment

		from .instagram_api_archiver import InstagramAPIArchiver
		from .bluesky_archiver import BlueskyArchiver

adds an unauthenticated Bluesky archiver #160

adds an unauthenticated Bluesky archiver #160

Conversation

msramalho commented Dec 23, 2024

pjrobertson Dec 31, 2024

Choose a reason for hiding this comment

msramalho Jan 6, 2025

Choose a reason for hiding this comment

pjrobertson Jan 6, 2025

Choose a reason for hiding this comment

pjrobertson commented Dec 31, 2024

msramalho commented Jan 6, 2025

erinhmclark left a comment

Choose a reason for hiding this comment