Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF validation host livelock #909

Open
pepyakin opened this issue Nov 16, 2021 · 0 comments
Open

PVF validation host livelock #909

pepyakin opened this issue Nov 16, 2021 · 0 comments
Labels
I2-bug The node fails to follow expected behavior.

Comments

@pepyakin
Copy link
Contributor

There is a potential livelock in the PVF validation host code.

In order to trigger it the following set of conditions need to take place:

  1. there is a request for preparation of a certain PVF.
  2. the preparation worker process dies.
  3. approximately at the same time, the pool receives a message which then leads to calling of purge_dead clean up routine.
  4. this leads to a race between purge_dead and I/O error originating from a read call on the UDS socket that connects the worker and the validation host. (NOTE that the race itself is not unforeseen and was an acceptable part of the design)
  5. rip message is sent back to the queue
  6. the queue will react by re-adding the message back into the execution queue optionally spawning an additional worker.
  7. then when the worker is spawned or freed and picks up that job the cycle starts all over.

So in order to trigger it, those things that should take place:

  1. the preparation worker dies.
  2. the preparation pool receives a message in the narrow time window that triggers purge_dead
  3. on top of that purge_dead wins the race to the read I/O error.

The first condition may be not easy to trigger, but it is possible. Either the node itself is under heavy load (esp. memory-wise) or the attacker crafted a PVF that can lead to panics in the preparation process.

The second and the third condition seem to be very unlikely. The preparation needs to be requested just in time between the exploited worker dies and but before the kernel notified the polkadot process that the pipe is closed and the async runtime picked up that change.

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added I2-bug The node fails to follow expected behavior. and removed I3-bug labels Aug 25, 2023
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior.
Projects
Status: Backlog
Development

No branches or pull requests

2 participants