-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RLY fails if one node is unavailable #1268
Comments
thanks for opening this issue, i agree that restarting because one node is unreachable doesn't seem like desirable behavior. i'll discuss this internally and see how the team wants to prioritize this, i may possibly be able to take this on in our next sprint. |
Is there any update on this? Or any config that can be set to prevent it to restart? |
I started working on a PoC for this awhile back at this point but got pulled away to work on some other stuff. Recently one of the engineers on our team revisited this issue but was struggling to get the rly process to crash due to one node being unavailable. He said, "I am having a very difficult time figuring out how to make the Chain Processor error out and crash the application. Even if the chain is configured with an invalid node endpoint, it will just keep trying and trying; It never crashes. I've looked into the code, and as of right now, the only time it will fully error out is when there is a stuck packet that doesn't get resolved: relayer/relayer/chains/cosmos/cosmos_chain_processor.go Lines 486 to 496 in df42391
@joelsmith-2019 does this sound correct? If we can confirm that this behavior is still present and we can replicate it locally in testing then we should be able to find someone who can take this on sooner rather than later to get things refactored into a state that results in more desirable behavior. |
@jtieri - Yes, that does sound correct. |
This could be related to the number of chains and paths we relay. We faced the issue with 25 chains and 88 paths. |
When a node is unreachable, the entire
rly
process restarts even though there are other channels being covered. It seems like that channel should be passed on, rather than the entire service being ended.The text was updated successfully, but these errors were encountered: