RFC: Shard refactoring and user control #1639

zeylahellyer · 2022-03-27T14:03:37Z

zeylahellyer
Mar 27, 2022
Maintainer

What?

Refactor the gateway shard implementation to remove background tasks and bring control points directly to the user. Rather than spawn a processor and its tasks for shards in the background and use channels to communicate back and forth with the "shard" interface we provide users, we can instead bring the processing directly into the user's hands by merging the shard processor with the shard interface users control.

Why?

Users currently don't programmatically receive many of the errors the shard processor may encounter: failure to deserialize events, abnormal WebSocket closures, channels closing, events that are out of sequence, and so on. These are logged (and has a history of causing grievances due to liberal logging levels), but can't cleanly be sent to users. Meanwhile, all that the shard processor can send users are the properly deserialized events the shard processor receives.

Users have little control over this: shards can be used to indicate to the shard processor that they would like to be shutdown (but receive no real-time notification of when this happens, if it even does!), request for gateway commands to be sent over the WebSocket, and request the WebSocket connection be started, but there's little or no feedback on whether and when these take place. By moving this logic directly into the shard, we can support these operations with real-time feedback and errors and open up new possibilities of control, such as pausing processing and allowing shard restarts.

There are three forms of operations that can be done with shards: status, sending, and receiving. Status operations have an API like so:

pub async fn start(&self) -> Result<(), ShardStartError>;

pub fn shutdown(&self);

Meanwhile, sending looks like this:

pub async fn command(&self, value: &impl Command) -> Result<(), CommandError>;

and receiving like so:

let (_shard, mut events) = Shard::new(token, intents);

while let Some(event) = events.next().await {
    // work with event..
}

This is mostly "fine": start can provide the user an error if the shard couldn't be started; command can tell the user if it couldn't serialize the command or the shard wasn't started; and receiving events can tell the user when the event stream ends due to the shard processor no longer processing events. However, where we run into problems is that start only handles some errors, while most of the work that can error is passed off into the background by the shard processor. The story is similar with command and receiving events: you can infer some basic information, but you can't possibly know the full story.

How?

Much of the infrastructure supporting the separation of shards and their processors can be removed by consolidating shards. Here's what such an API may look like:

/// Create a new shard and connect to the gateway.
///
/// Returns errors for failure to retrieve or parse the gateway URL, TLS errors,
/// Websocket handshake failures, so on: everything to go from nothing to
/// "running."
async fn new(token: String, intents: Intents) -> Result<Self, ShardEstablishError>;

/// Wait for the next Discord event over the Websocket.
///
/// Returns errors for WebSocket issues, Gateway session issues (e.g. token
/// invalidation, session invalidation, network drops), deserialization errors
/// or unknown events, and so on.
async fn next_event(&mut self) -> Option<Result<Event, ShardEventError>>;

/// Wait for the next Websocket message over the Websocket.
///
/// Returns errors for WebSocket issues, Gateway session issues (e.g. token
/// invalidation, session invalidation, network drops)
async fn next_message(&mut self) -> Option<Result<Message, ShardMessageError>>;

/// Like the current `Shard::command` method, but all errors, including
/// Websocket send errors.
async fn command(&mut self, command: &impl Command) -> Result<(), ShardCommandError>;

/// Create a Sendable sink to pass around to other tasks.
async fn sink(&self) -> ShardSink;

/// Shutdown the shard's connection to the gateway.
async fn shutdown(&mut self);

// so on...

At its core, this isn't fundamentally different in usage:

-let (shard, mut events) = Shard::new(token, intents);
+let mut shard = Shard::new(token, intents);

-shard.start().await?;
-
-while let Some(event) = events.next().await {
+while let Some(maybe_event) = shard.next_event().await {
+    let event = match maybe_event {
+        Ok(event) => event,
+        Err(source) => // handle error...
+    };
+
    // work with the event...
}

Notably, this will make users handle a Result. We don't need to offload the decision making on when to shutdown or reconnect a shard to users -- we can still do that. This Result is simply a way for deserialization and processing errors to be presented to users, and decide to take their own actions if they want to.

This has a benefit of removing a significant amount of what would now be cruft in the gateway crate: the Heartbeater can now be inlined to event calls; the Socket Forwarder and Emitter no longer needs to exist as intermediary messaging layers; Sessions can now be cleaned up and no longer needs to use channels and cells for propogation; and implementations of what does remain will just end up being simpler to maintain and read.

One important detail is that we can increase our bus factor on the gateway. The gateway has historically been difficult to understand, read, and document, due to the layers of separation between individual components in it. It's difficult to understand where and how the Socket Forwarder communicates with the Shard Processor, and how the chain of logic propogates down to end users. By implementing this change, we can remove all of these layers and end up only with the Shard implementation.

When?

There is nothing blocking this. I have already started on an implementation. Refactoring the gateway and its internal components like the socket forwarder and the heartbeater have been long term goals of mine as a way to refactor and document the shard processor as a whole. This is a way to do all of these in one fell swoop.

7596ff · 2022-03-28T22:09:18Z

7596ff
Mar 28, 2022

I like all of this except for next_message, I think it should be up to users to consume events as events, determining if that event is a message. Using both next_event and next_message in a project would create confusion as to whether message events are in both queues, race conditions updating the cache, and so on.

1 reply

zeylahellyer Mar 28, 2022
Maintainer Author

next_message refers to a Websocket message here, while next_event refers to a message containing a Discord event. I'll update the documentation to be more precise

Erk- · 2022-03-29T15:29:38Z

Erk-
Mar 29, 2022
Maintainer

One issue I can see is that command being &mut self will mean that it would be hard to pass it around to other threads or similar, and would mean that the sink must be used in most cases.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Shard refactoring and user control #1639

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

RFC: Shard refactoring and user control #1639

zeylahellyer Mar 27, 2022 Maintainer

What?

Why?

How?

When?

Replies: 2 comments · 1 reply

7596ff Mar 28, 2022

zeylahellyer Mar 28, 2022 Maintainer Author

Erk- Mar 29, 2022 Maintainer

zeylahellyer
Mar 27, 2022
Maintainer

Replies: 2 comments 1 reply

7596ff
Mar 28, 2022

zeylahellyer Mar 28, 2022
Maintainer Author

Erk-
Mar 29, 2022
Maintainer