A Ruby gem for streaming data from the Bluesky/AtProto firehose 🦋
Note
ATProto Ruby gems collection: skyfall | blue_factory | minisky | didkit
Skyfall is a Ruby library for connecting to the "firehose" of the Bluesky social network, i.e. a websocket which streams all new posts and everything else happening on the Bluesky network in real time. The code connects to the websocket endpoint, decodes the messages which are encoded in some binary formats like DAG-CBOR, and returns the data as Ruby objects, which you can filter and save to some kind of database (e.g. in order to create a custom feed).
gem install skyfall
Start a connection to the firehose by creating a Skyfall::Stream
object, passing the server hostname and endpoint name:
require 'skyfall'
sky = Skyfall::Stream.new('bsky.network', :subscribe_repos)
Add event listeners to handle incoming messages and get notified of errors:
sky.on_connect { puts "Connected" }
sky.on_disconnect { puts "Disconnected" }
sky.on_message { |m| p m }
sky.on_error { |e| puts "ERROR: #{e}" }
When you're ready, open the connection by calling connect
:
sky.connect
Each message passed to on_message
is an instance of a subclass of WebsocketMessage
, depending on the message type. The supported message types are:
CommitMessage
(#commit
) - represents a change in a user's repo; most messages are of this typeHandleMessage
(#handle
) - when a different handle is assigned to a user's DIDTombstoneMessage
(#tombstone
) - when an account is deletedInfoMessage
(#info
) - a protocol error message, e.g. about an invalid cursor parameterUnknownMessage
is used for other unrecognized message types
All message objects have the following properties:
type
(symbol) - the message type identifier, e.g.:commit
seq
(integer) - a sequential index of the messagerepo
ordid
(string) - DID of the repository (user account)time
(Time) - timestamp of the described action
All properties except type
may be nil for some message types that aren't related to a specific user, like #info
.
Commit messages additionally have:
commit
- CID of the commitprev
- CID of the previous commit in that repooperations
- list of operations (usually one)
Handle messages additionally have:
handle
- the new handle assigned to the DID
Info messages additionally have:
name
- identifier of the message/errormessage
- a human-readable description
Operations are objects of type Operation
and have such properties:
repo
ordid
(string) - DID of the repository (user account)collection
(string) - name of the relevant collection in the repository, e.g.app.bsky.feed.post
for poststype
(symbol) - short name of the collection, e.g.:bsky_post
rkey
(string) - identifier of a record in a collectionpath
(string) - the path part of the at:// URI - collection name + ID (rkey) of the itemuri
(string) - the complete at:// URIaction
(symbol) -:create
,:update
or:delete
cid
- CID of the operation/record (nil
for delete operations)
Create and update operations will also have an attached record (JSON object) with details of the post, like etc. The record data is currently available as a Ruby hash via raw_record
property (custom types will be added in future).
So for example, in order to filter only "create post" operations and print their details, you can do something like this:
sky.on_message do |m|
next if m.type != :commit
m.operations.each do |op|
next unless op.action == :create && op.type == :bsky_post
puts "#{op.repo}:"
puts op.raw_record['text']
puts
end
end
For more examples, see the example folder or the bluesky-feeds-rb project, which implements a feed generator service.
A note on custom lexicons: the Skyfall::Operation
objects have two properties that tell you the kind of record they're about: #collection
, which is a string containing the official name of the collection/lexicon, e.g. "app.bsky.feed.post"
; and #type
, which is a symbol meant to save you some typing, e.g. :bsky_post
.
When Skyfall receives a message about a record type that's not on the list, whether in the app.bsky
namespace or not, the operation type
will be :unknown
, while the collection
will be the original string. So if an app like e.g. "Skygram" appears with a zz.skygram.*
namespace that lets you share photos on ATProto, the operations will have a type :unknown
and collection names like zz.skygram.feed.photo
, and you can check the collection
field for record types known to you and process them in some appropriate way, even if Skyfall doesn't recognize the record type.
Do not however check if such operations have a type
equal to :unknown
first - just ignore the type and only check the collection
string. The reason is that some next version of Skyfall might start recognizing those records and add a new type
value for them like e.g. :skygram_photo
, and then they won't match your condition anymore.
Skyfall::Stream
sends a user agent header when making a connection. This is set by default to "Skyfall/0.x.y"
, but it's recommended that you override it using the user_agent
field to something that identifies your app and its author – this will let the owner of the server you're connecting to know who to contact in case the client is causing some problems.
You can also append your user agent info to the default value like this:
sky.user_agent = "NewsBot (@news.bot) #{sky.default_user_agent}"
Occasionally, especially during times of very heavy traffic, the websocket can get into a stuck state where it stops receiving any data, but doesn't disconnect and just hangs like this forever. To work around this, there is a "heartbeat" feature which starts a background timer, which periodically checks how much time has passed since the last received event, and if the time exceeds a set limit, it manually disconnects and reconnects the stream.
The option is not enabled by default, because there are some firehoses which will not be sending events often, possibly only once in a while – e.g. labellers and independent PDS firehoses – and in this case we don't want any heartbeat since it will be completely normal not to have any events for a long time. It's not really possible to detect easily if we're connecting to a full network relay or one of those, so in order to avoid false alarms, you need to enable this manually using the check_heartbeat
property.
You can also change the heartbeat_interval
, i.e. how often the timer is triggered (default: 10s), and the heartbeat_timeout
, i.e. the amount of time passed without events when it reconnects (default: 5 min):
sky.check_heartbeat = true
sky.heartbeat_interval = 5
sky.heartbeat_timeout = 120
Copyright © 2024 Kuba Suder (@mackuba.eu).
The code is available under the terms of the zlib license (permissive, similar to MIT).
Bug reports and pull requests are welcome 😎