Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling (Network) Failures in LF Federation #1191

Open
suyourice opened this issue May 24, 2022 · 3 comments
Open

Handling (Network) Failures in LF Federation #1191

suyourice opened this issue May 24, 2022 · 3 comments

Comments

@suyourice
Copy link
Contributor

suyourice commented May 24, 2022

Motivation

  • In current LF federated programs, when one federate is disconnected or exits the federation unexpectedly, the entire federation does not work correctly. For example, the entire federation stops advancing times because of the failed federates, or just the whole federation freezes.
  • Thus, we want to make the federated programs more resilient. To be specific, we want to make the federation advance time or gracefully terminate when there is a network failure.
  • We believe there are many ways to achieve this. For example, we can wait for the disconnected federate to rejoin the federation, or we can make the RTI send stop request messages to the federates when there is an unexpected failure in one of the federates.

Goals

  • To make an LF federation continue to work or gracefully terminate even when there is a network failure.
  • To be specific, we want to add advanced handling for network failures in RTI and federates.
  • To explore various error handling options to deal with network failures.
  • If the connection is lost, we can make a new connection to fix it like using another federate or mutation function.
  • If we want to allow a disconnected federates to rejoin the federation, we will also need to determine the correct logical time for rejoining federates.

Non-Goals

  • We do not plan to redesign the current RTI-federate protocols.
  • We will not address authentication issues for federates.

Approach

List of Different Failures

  1. A federate crashes (unexpected program termination).
  2. A federate loses network connection permanently.
  3. A federate loses the network connection temporarily (while still executing) and recovers the network connection.
  4. A federate experiences a significantly slower network connection (much higher latency).
    • We probably need a threshold to determine the slower connection.
  5. If you find any cases more, please leave comments.

Implementation Details

  • We are thinking about two possible error handling mechanisms as follows (We will probably need an additional target property to specify the error handling mechanism.).
  1. Stop Request Approach
target C {
    federate-failure-handling : StopRequest
}
  • Add advance error handling code in RTI so that the RTI sends the stop request messages to the entire federation for graceful termination in case of a federate failure.
  1. Waiting for Recovery Approach
target C {
    federate-failure-handling : WaitForRecovery
}
  • Make the RTI allow a disconnected federate to rejoin the federation and determine the proper logical time for a rejoining federate by asking existing federates for their current logical time and asking the rejoining federate for its current physical time.
  1. Hot Spare Approach
target C {
    federate-failure-handling : HotSpare
}
  • We are thinking of using a federate as a hot spare and using it as a replacement when a failure occurs. A hot spare could be more useful than a cold spare because we can keep the hot spare synchronized with the federation before the failure so that the hot spare joins the federation right away (without additional synchronization). However, a cold spare approach would need synchronization when the cold spare joins the federation.
@suyourice suyourice changed the title Improving Resilience of RTI-federate connection Handling Failures in LF Federation May 24, 2022
@suyourice
Copy link
Contributor Author

The contents may be updated after more discussion.

@hokeun hokeun changed the title Handling Failures in LF Federation Handling (Network) Failures in LF Federation May 31, 2022
@hokeun
Copy link
Member

hokeun commented May 31, 2022

@Soroosh129 , @lhstrh , do you think you could take a look at this issue (by @suyourice, one of my grad students) describing approaches for making the federation more resilient against (network) failures when you get a chance? Any feedback will be greatly appreciated!

@Soroosh129
Copy link
Contributor

Soroosh129 commented May 31, 2022

I think this proposal is a great start! All the listed recovery methods sound good to me.

An important question that I feel is still unanswered is how failures are detected. I think the failure detection might be as hard, if not harder, than recovering from it. Note that failures 1 to 4 could all look the same to the RTI and/or other federates.

Also, it seems like the failure recovery methods are all concentrated on centralized coordination. What should happen in the case of decentralized coordination?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants