Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix muxer that can be stucked into an infinite loop #2681

Merged
merged 11 commits into from
Mar 21, 2023

Conversation

elmattic
Copy link
Contributor

@elmattic elmattic commented Mar 17, 2023

Summary of changes

Changes introduced in this pull request:

  • Fix muxer that can go into an infinite loop during shutdown

Reference issue to close (if applicable)

Closes #2672

82ca23c does run Muxer future separately to reproduce the issue.

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation,
  • I have added tests that prove my fix is effective or that my feature works
    (if possible),
  • I have made sure the
    CHANGELOG is
    up-to-date. All user-facing changes should be reflected in this document.

@elmattic elmattic marked this pull request as ready for review March 17, 2023 15:47
@lemmih
Copy link
Contributor

lemmih commented Mar 17, 2023

Could you explain the problem and how this fixes it?

@elmattic
Copy link
Contributor Author

elmattic commented Mar 17, 2023

When creating the daemon chain_muxer task, we are giving it the network_rx channel just created before.

The root cause of the problem lies in the order of cancelation of those tasks during tokio rt shutdown.

If the chain_muxer is the first to be cancelled, so before p2p_service, everything should be fine. Otherwise the network_rx channel is closed and some error could be received in the ChainMuxer promise.

My fix consists into catching this error and returning Poll::Pending to avoid continuously looping. Maybe we should test for the string "receiving on a closed channel" as well, just to be sure.

@LesnyRumcajs
Copy link
Member

So this won't loop at all now and will exit? Or will it continue running, just without the error logs?

@elmattic
Copy link
Contributor Author

So this won't loop at all now and will exit? Or will it continue running, just without the error logs?

This gives the hand to the tokio runtime. And since the task has be cancelled it will then exit.

@lemmih
Copy link
Contributor

lemmih commented Mar 20, 2023

I still have a hard time understand both the problem and the solution. Could you explain it in a simpler way and then add that explanation to the code?

@lemmih
Copy link
Contributor

lemmih commented Mar 20, 2023

This seems rather fragile to me. What if something fails with a different message? Can't we fix the underlying issue? Why can any error throw us into an infinite loop?

@elmattic
Copy link
Contributor Author

I still have a hard time understand both the problem and the solution. Could you explain it in a simpler way and then add that explanation to the code?

I've added a comment and testing as well for the specific error string. I'm not sure if it's the best but I'm open to some better fix.

@lemmih
Copy link
Contributor

lemmih commented Mar 20, 2023

I still have a hard time understand both the problem and the solution. Could you explain it in a simpler way and then add that explanation to the code?

I've added a comment and testing as well for the specific error string. I'm not sure if it's the best but I'm open to some better fix.

It looks to me that the root of the problem is that we cannot presently tell the difference between incorrect input (such as malformed tipsets) and fatal errors (such as a DB error or a closed channel). Right now, we treat all errors as recoverable and enter a tight, infinite loop if the errors are persistent.

I propose we bail on errors by default and then keep a white-list of recoverable errors (such as receiving bad data from peers). This would be a band-aid solution until we can re-write the muxer code.

Copy link
Member

@LesnyRumcajs LesnyRumcajs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this.

Copy link
Contributor

@lemmih lemmih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if it causes any problems. It might bail too eagerly on transient errors.

@elmattic elmattic merged commit 0ccda6d into main Mar 21, 2023
@elmattic elmattic deleted the elmattic/fix-muxer-future branch March 21, 2023 16:49
@aatifsyed aatifsyed mentioned this pull request Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Forest chain muxer error
3 participants