EnTK messaging system #527

iparask · 2020-12-22T21:46:02Z

iparask
Dec 22, 2020

EnTK currently utilizes RabbitMQ through pika to move messages, such as state transitions and heartbeats between its components. RabbitMQ is a messaging system that utilizes a broker server. All producers and consumers connect to the server to establish some communication. A major feature of RabbitMQ is that it offers message persistence and guarantees that no message will be lost. This guarantee is feasible because RabbitMQ uses the disc to store messages. The use of a single point server introduces some limitations. First of all, it requires the administration of a secure server for the messages. Second, it highly depends on the network capabilities of that server.

ZeroMQ is a serverless messaging system. Connections between endpoints happen through TCP connections. Because there is no server that there is no guarantee about messages. If one of the endpoints fails, messages are lost.

Apart from the main differences in how messages are handled, these two systems have different performance, with ZeroMQ offering better performance.

All EnTK components are running on the same machine, and if one fails due to issues from the host, all will fail. Furthermore, the acknowledgment capabilities of RabbitMQ are currently disabled. As a result, we are not fully utilizing the capabilities of RabbitMQ, and we are paying a maintenance cost and probably performance cost. The performance cost may be more important when the VM hosting RabbitMQ has network issues or when the resource executing EnTK is not very close to it. Also, EnTK does not utilize any of the scaling capabilities of RabbitMQ, where multiple consumers are connected to a queue and work on messages.

I propose to change EnTK's messaging system from RabbitMQ to ZeroMQ. This change will remove the dependency on a single endpoint and the need to maintain a RabbitMQ. Also, it will bring it closer to RP's codebase, where ZeroMQ is already utilized.

Interesting paper comparing ZMQ and RMQ (here in Spanish)

mturilli · 2020-12-23T00:59:15Z

mturilli
Dec 23, 2020
Maintainer

All EnTK components are running on the same machine,

Might be running but not necessarily. The separation of EnTK design into stateless components and the use or RabbitMQ was dictated by the following requirements:

We may have to execute multiple instances of the same component for scaling/performance reasons
Components my have to execute on different machines.
RabbitMQ must not execute on a login node.

We should probably discuss what requirements we should relax and what design changes we may derive from relaxing them.

and if one fails due to issues from the host, all will fail.

Just to make sure we agree that the above is only one of the many ways in which EnTK's components can fail. The initial requirements were for managing:

Failure of one of the stateless components (all by AppManager) should be recoverable and it should be guaranteed (as in by design) that no task state is lost due to the failure.
Failure of statefull components (AppManager) should be fatal but restarting the application should guarantee that no task state is lost.

Note that the above does not mandate recovering of resource states, something we may want now to consider. Also, the second requirement was never satisfied.

Furthermore, the acknowledgment capabilities of RabbitMQ are currently disabled. As a result, we are not fully utilizing the capabilities of RabbitMQ, and we are paying a maintenance cost and probably performance cost. The performance cost may be more important when the VM hosting RabbitMQ has network issues or when the resource executing EnTK is not very close to it. Also, EnTK does not utilize any of the scaling capabilities of RabbitMQ, where multiple consumers are connected to a queue and work on messages.

There was another benefit of using RabbitMQ and one that ultimately determined the choice over ZeroMQ (that was indeed considered in the initial design phase): coding ease and time to market. ZeroMQ requires a lower-level programming than RabbitMQ with many more details to code about the protocol and many more pitfalls in terms of race conditions and protocol implementation. Having the time and resources, I would agree that using ZeroMQ would be preferable to RabbitMQ for the limitations you mentioned above.

I propose to change EnTK's messaging system from RabbitMQ to ZeroMQ. This change will remove the dependency on a single endpoint and the need to maintain a RabbitMQ. Also, it will bring it closer to RP's codebase, where ZeroMQ is already utilized.

In principle I would agree but we need to work on an estimate of the effort, a preliminary rapid prototyping of some of the key features outside EnTK, and a plausible roadmap. Also, I would consider all that against the need to (in order of importance):

Implement multi-level fault-tolerance and resilience
Separate workflow manager and workflow engine, moving the latter to the target resources (i.e., HPC compute nodes but also cloud resources)
Implement a connector subsystem to enable using arbitrary runtime systems.

Interesting paper comparing ZMQ and RMQ (here in Spanish)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EnTK messaging system #527

{{title}}

Replies: 1 comment

{{title}}

Select a reply

EnTK messaging system #527

iparask Dec 22, 2020

Replies: 1 comment

mturilli Dec 23, 2020 Maintainer

iparask
Dec 22, 2020

mturilli
Dec 23, 2020
Maintainer