status | flip | authors | sponsor | updated |
---|---|---|---|---|
Released (most of the important Engines have been updated) |
343 |
Alex Hentschel (alex.hentschel@dapperlabs.com) |
Alex Hentschel (alex.hentschel@dapperlabs.com) |
2023-06-06 |
FLIP to generalize the API though which the Networking Layer hands messages to Engines
An Engine subscribes to the incoming messages from a channel by registering themselves with the networking layer:
type Network interface {
Register(channel network.Channel, engine network.Engine) (network.Conduit, error)
}
The networking layer serves messages from the channel directly to the engine (-> code):
// eng is the Engine for the respective channel
err = eng.Process(qm.SenderID, qm.Payload)
if err != nil {
n.logger.Error().
...
Msg("failed to process message")
}
Note that we feed the Engine
using the Process
method:
// Process processes the given event from the node with the given origin ID
// in a blocking manner. It returns the potential processing error when done.
Process(originID flow.Identifier, event interface{}) error
which is blocking by API specification.
Currently, the networking layer has a single priority queue for all engines it serves. It has a limited number of workers (e.g. 5) which is uses to serve the engines.
As the networking layer as no detailed context about the messages it transmits to the application layer, it can't effectively decide which messages to drop, or which messages can be processed concurrently and which messages have to be processed in a blocking manner.
If one of the engines (e.g. Engine C
) gets overwhelmed with large number of messages, this engine will
likely delay messages for all other engines as the workers will soon all be blocked by the overwhelmed engine.
Furthermore, the overwhelmed engine cannot drop messages or change its behaviour depending on its backlog, as the message queue is not within its component.
The detailed semantics of the message types, and their processing modalities are only known to the application layer. Overall, I think it would be beneficial to generalize the API through which the networking layer hands the messages to the application layer. For each channel, the application layer should inject the desired message consumer.
Thereby, we could support the following desired use-cases:
- The application layer should have access to the queued messages, so it can drop stale messages and protect itself from getting overwhelmed.
- It should be the application layer's decision whether to process messages concurrently or one-by-one (depending on the message type).
- As engines might re-request entities, it would be beneficial if re-requesting could be temporarily suspended, if the engine has queued up a lot of messages.
- Generalize the API for the message sink from the viewpoint of the networking layer:
Note that the
// MessageConsumer represents a consumer of messages for a particular channel from // the viewpoint of the networking layer. MessageConsumer implementations must handle // large message volumes and consume messages in a non-blocking manner. type MessageConsumer interface { // Consume submits the given event from the node with the given origin ID // to the MessageConsumer. Implementation must be non-blocking and // internally queue (or drop) messages. Consume(originID [32]byte, event interface{}) }
MessageConsumer
has no error return as opposed to the Engine'sProcess
method. Motivation- What is the networking layer supposed to do with an error? Currently, the networking layer just logs an error (-> code).
- Best suited to handle occurring errors is the application layer (i.e. the
MessageConsumer
), which has the necessary context about message semantics and potential errors.
- change the
Network
'sRegister
method toRegister(channel network.Channel, consumer MessageConsumer) (network.Conduit, error)
The goal of the proposed implementation steps is to transition to the new API in iterative steps.
There are numerous Engine
implementations. For some, it might be a good opportunity
to implement message queuing tailored to the specific message type(s), which the Engine processes.
For other Engines, a generic queueing implementation might be sufficient.
-
In the first step, we will only change the interface, but the implementation will remain functionally unchanged. Specifically, we:
- Add the
MessageConsumer interface
tonetwork
package. - Move the
Engine interface
out of the networking layer into themodule
package. - Keep the priority queue in the networking layer and the workers feeding the messages to the engines.
- To each engine, we add the
Consume
method, which directly calls into the engine'sProcess
method. To only other function ofConsume
would be to logg potential error returns fromProcess
This first step would create technical debt, as the
MessageConsumer
implementations are not conforming to the API specification: because they block onConsume
! - Add the
-
We implement a generic priority queue (maybe reusing the existing implementation from the networking layer).
- we could consider adding a generic inbound queue implementation to
engine/Unit
, as most engines have aUnit
instance
- we could consider adding a generic inbound queue implementation to
-
Now, each
Engine
can be independently refactored to include its own inbound message queue(s), drop messages, or whatever fits best for their specific message type. -
After all engines are compliant with the
MessageConsumer interface
(i.e. theirConsume
implementation is non-blocking), we can remove the priority queue from the networking layer
The implementation will remain functionally unchanged. The tech debt is cleaned up in subsequent steps
- Add the
MessageConsumer interface
tonetwork
package. - Move the
Engine interface
out of the networking layer into themodule
package. - To each
Engine
implementation, add the followingConsume
methodNote: The// MessageConsumer represents a consumer of messages for a particular channel from // the viewpoint of the networking layer. MessageConsumer implementations must // handle large message volumes and consume messages in a non-blocking manner. // // Consume is called by networking layer to hand a message from (the node with // the given origin ID) to the engine. It conforms to the MessageConsumer // interface. However, the current implementation violates the // MessageConsumer's API spec as it uses the networking layer's routine to // process the inbound message. // TODO: change implementation to non-blocking func (e *Engine) Consume(originID flow.Identifier, event interface{}) { err := e.Process(originID, event) if err != nil { e.log.Error(). Err(err). Hex("sender_id", logging.ID(originID)). Msg("failed to process inbound message") } }
MessageConsumer
implementations are not conforming to the API specification: because they block onConsume
! While the tech debt created here is comparatively small, it allows us to decouple and potentially parallelize updating the different engine implementations.
This would be a single issue.
Implementation of a generic priority queue (maybe reusing the existing implementation from the networking layer).
We could consider adding a generic inbound queue implementation to engine/Unit
, as most engines have a Unit
instance.
This would be a single issue and could go on in parallel to step (1)
Each Engine
is refactored to include its own inbound message queue(s),
drop messages, or whatever fits best for their specific message type.
For each individual Engine
implementation, we have a dedicated issue:
- determine which message processing strategy (queue, drop, etc) fits the engine's use case
- implementation This step is blocked until step (1) and (2) are completed. All issues from this step could theoretically be worked on in parallel.
By completing step (3), all Engine
implementations are compliant with the MessageConsumer interface
(i.e. their Consume
implementation is non-blocking).
We remove the priority queue from the networking layer.
This would be a single issue.
Some buffer for technical complications and additional cleanup work.
We could remove the network.Engine
interface and the SubmitLocal
, Process
functions that are copied in every engine
to satisfy it. These are rarely used and only exist because network.Engine
was originally created with more methods
than it needed.
The needed implementations of SubmitLocal
or Process
methods could be renamed with context-specific names
(e.g. engine.HandleBlock(block)
)