Handling (Network) Failures in LF Federation #1191

suyourice · 2022-05-24T07:05:36Z

Motivation

In current LF federated programs, when one federate is disconnected or exits the federation unexpectedly, the entire federation does not work correctly. For example, the entire federation stops advancing times because of the failed federates, or just the whole federation freezes.
Thus, we want to make the federated programs more resilient. To be specific, we want to make the federation advance time or gracefully terminate when there is a network failure.
We believe there are many ways to achieve this. For example, we can wait for the disconnected federate to rejoin the federation, or we can make the RTI send stop request messages to the federates when there is an unexpected failure in one of the federates.

Goals

To make an LF federation continue to work or gracefully terminate even when there is a network failure.
To be specific, we want to add advanced handling for network failures in RTI and federates.
To explore various error handling options to deal with network failures.
If the connection is lost, we can make a new connection to fix it like using another federate or mutation function.
If we want to allow a disconnected federates to rejoin the federation, we will also need to determine the correct logical time for rejoining federates.

Non-Goals

We do not plan to redesign the current RTI-federate protocols.
We will not address authentication issues for federates.

Approach

We can use ChatApplication(https://github.com/lf-lang/examples-lingua-franca/blob/main/C/src/ChatApplication/SimpleChat.lf) to test the connection and reconnection cases.

List of Different Failures

A federate crashes (unexpected program termination).
A federate loses network connection permanently.
A federate loses the network connection temporarily (while still executing) and recovers the network connection.
A federate experiences a significantly slower network connection (much higher latency).
- We probably need a threshold to determine the slower connection.
If you find any cases more, please leave comments.

Implementation Details

We are thinking about two possible error handling mechanisms as follows (We will probably need an additional target property to specify the error handling mechanism.).

Stop Request Approach

target C {
    federate-failure-handling : StopRequest
}

Add advance error handling code in RTI so that the RTI sends the stop request messages to the entire federation for graceful termination in case of a federate failure.

Waiting for Recovery Approach

target C {
    federate-failure-handling : WaitForRecovery
}

Make the RTI allow a disconnected federate to rejoin the federation and determine the proper logical time for a rejoining federate by asking existing federates for their current logical time and asking the rejoining federate for its current physical time.

Hot Spare Approach

target C {
    federate-failure-handling : HotSpare
}

We are thinking of using a federate as a hot spare and using it as a replacement when a failure occurs. A hot spare could be more useful than a cold spare because we can keep the hot spare synchronized with the federation before the failure so that the hot spare joins the federation right away (without additional synchronization). However, a cold spare approach would need synchronization when the cold spare joins the federation.

The text was updated successfully, but these errors were encountered:

suyourice · 2022-05-24T07:06:42Z

The contents may be updated after more discussion.

hokeun · 2022-05-31T07:23:09Z

@Soroosh129 , @lhstrh , do you think you could take a look at this issue (by @suyourice, one of my grad students) describing approaches for making the federation more resilient against (network) failures when you get a chance? Any feedback will be greatly appreciated!

Soroosh129 · 2022-05-31T15:53:18Z

I think this proposal is a great start! All the listed recovery methods sound good to me.

An important question that I feel is still unanswered is how failures are detected. I think the failure detection might be as hard, if not harder, than recovering from it. Note that failures 1 to 4 could all look the same to the RTI and/or other federates.

Also, it seems like the failure recovery methods are all concentrated on centralized coordination. What should happen in the case of decentralized coordination?

suyourice changed the title ~~Improving Resilience of RTI-federate connection~~ Handling Failures in LF Federation May 24, 2022

hokeun changed the title ~~Handling Failures in LF Federation~~ Handling (Network) Failures in LF Federation May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling (Network) Failures in LF Federation #1191

Handling (Network) Failures in LF Federation #1191

suyourice commented May 24, 2022 •

edited

Loading

suyourice commented May 24, 2022

hokeun commented May 31, 2022

Soroosh129 commented May 31, 2022 •

edited

Loading

Handling (Network) Failures in LF Federation #1191

Handling (Network) Failures in LF Federation #1191

Comments

suyourice commented May 24, 2022 • edited Loading

Motivation

Goals

Non-Goals

Approach

List of Different Failures

Implementation Details

suyourice commented May 24, 2022

hokeun commented May 31, 2022

Soroosh129 commented May 31, 2022 • edited Loading

suyourice commented May 24, 2022 •

edited

Loading

Soroosh129 commented May 31, 2022 •

edited

Loading