-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix ProcessNewBlock vs EnforceBestChainLock deadlocks in ActivateBestChain #3492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PastaPastaPasta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK, looks good to me... Would like some others to look at this too, @codablock @xdustinface
|
Looks like this is not enough unfortunately, see #3476 (comment). Need some better solution. Closing. |
|
can you rebase on develop so travis is happy? https://travis-ci.com/github/PastaPastaPasta/dash/builds/167488704 |
|
Resurrecting this. The latest commit fixed the second deadlock it seems, my nodes are running fine for 4+ days now. |
|
Looks also good so far here on my testnet node.. it had it running into the deadlock with the latest |
src/validation.cpp
Outdated
| boost::this_thread::interruption_point(); | ||
|
|
||
| if (GetMainSignals().CallbacksPending() > 10) { | ||
| if (fSyncQueue && GetMainSignals().CallbacksPending() > 10) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also codewise it looks overall good to me. Though i wonder if it now even makes any sense to skip the queue sync if ActivateBestChain gets called by CL/IS with CL having its own scheduler? Wasn't the issue that CL put ActivateBestChain (over edges) in the validation interface scheduler's queue and that basically blocked itself by waiting for the queue it runs from to be done with all tasks?
@UdjinM6 please correct me if im wrong and explain me your full understanding of the deadlock as i was investigating a lot here how exactly it happens and i just want to fully get it 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are two different cases:
1.
a) CL: LOCK->SyncWithValidationInterfaceQueue
b) PeerLogicValidation: (dead)LOCK
(b) (which is executed as a part of the queue flushing) is waiting for (a) to release the lock which will never happen because (a) is waiting for (b). Even if (a) would be in another scheduler it wouldn't matter, the lock would be held anyway because queue processing is blocked. Skipping SyncWithValidationInterfaceQueue lets (a) to finish and release the lock.
a) PeerLogicValidation: LOCK->SyncWithValidationInterfaceQueue
b) CL: (dead)LOCK
In this case (b) (executed as a regular scheduler job) blocks the scheduler itself not just the queue, it simply can't process anything at this point. Using another scheduler for (b) fixes this because now (b) blocks some another thread and the original scheduler can process all the tasks and release the lock which in its turn lets (b) to continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. soo.. if i look at the two cases you commented in #3476 it looks for me like the queue skipping just fixed one deadlock and introduced a new one.
Lets assume we have:
- Thread 1 >
A> CConnman::ThreadMessageHandler - Thread 2 >
B> Validation Interface Scheduler Thread - Thread 3 >
C> New ChainLocks Scheduler Thread - Thread 4 >
D> CInstantSendManager::WorkThreadMain - Thread 5 >
E> CSigSharesManager::WorkThreadMain
This are the two wait conditions involved here
Line 2948 in 0662f17
| LOCK(cs_activateBestChain); |
dash/src/validationinterface.cpp
Line 165 in 0662f17
| promise.get_future().wait(); |
Case 1
Relates to #3476 (comment)
1a) B runs ActivateBestChain it locks cs_activateBestChain and then since there are more than 10 tasks in its queue it adds another one and waits for this other one to finish with: promise.get_future().wait(). Which can never happen because it waits for itself.
1b) Now A gets a new block message, wants to process it and therefor also calls ActivateBestChain and waits for cs_activateBestChain to release which won’t happen because of the above..
Case 2 / Skipping the sync for CL code
Relates to #3476 (comment)
If we skip the sync now Case 1 can’t longer happen because in 1a) B will lock cs_activateBestChain, skips the sync (and with this it skips the wait on yourself), runs the rest of ActivateBestChain and then it releases cs_activateBestChain which means A will not wait forever for cs_activateBestChain.. This gives us just the new deadlock below:
2a) A gets a new block, wants to process it and calls ActivateBestChain. It locks cs_activateBestChain and since there are more than 10 tasks in the queue of B it adds another task to the queue of B and waits for this other task to finish with promise.get_future().wait().
2b) B which has at this point more than 10 tasks (one of them is ActivateBestChain which it got from the CL code with sync skipping) processes its tasks until it comes to the call of ActivateBestChain where it then waits for cs_activateBestChain to release whihc won't happen because A in 2a) locked it.
Fix the deadlock / Adding Thread C
Soo.. the actual reason for the deadlock is IMO only calling ActivateBestChain from B. Means just adding C should be the fix for the initial deadlock and then deadlock 2 will never be introduced.
And this is a problem now because we have the “Let Thread B” catch up thing here introduced in #3376
Lines 2956 to 2961 in 0662f17
| if (GetMainSignals().CallbacksPending() > 10) { | |
| // Block until the validation queue drains. This should largely | |
| // never happen in normal operation, however may happen during | |
| // reindex, causing memory blowup if we run too far ahead. | |
| SyncWithValidationInterfaceQueue(); | |
| } |
which leads to
X waiting for B if X calls ActivateBestChain and there are more than 10 queue entries (which is why the deadlock happened randomly and more often on the mainnet it seems) and therefor B waiting for itself if ActivateBestChain gets called by B .
So i would say having this "skip sync" code included makes no sense anymore because if ActivateBestChain gets called from instantsend code its called by D or E and if its called by CL its called by C. So there should be no more deadlock, not?
Does this make sense? Note: I'm tired now so i hope i didn't miss any other crazy magic 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand now, thanks 👍 Reverted, let's see :)
|
ran |
|
two things:
(Maybe, CLion suggests this, but I'm not sure if this is something we want / follow) prefer lambda over std::bind PastaPastaPasta@dffb262 |
|
@PastaPastaPasta I was just following the surrounding code and usage instructions in https://github.com/dashpay/dash/blob/develop/src/scheduler.h#L23-L34. Scheduler was updated to using |
|
Okay, understand |
|
Possible problem here? https://travis-ci.com/github/PastaPastaPasta/dash/jobs/339530126 Maybe this is a different problem / ci issue edit: and another https://travis-ci.com/github/PastaPastaPasta/dash/jobs/339530126 |
…hen called from IS or CL threads" This reverts commit 1c9f6da.
PastaPastaPasta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK, ci still fails rarely, however, this is a definite improvement and appears to fix the deadlock, even if simplepose still fails sometimes.
|
re: llmq-simplepose.py fails about 1 in 100 times for me locally |
…Chain (dashpay#3492) * Drop dead code in DoInvalidateBlock * Let ActivateBestChain skip SyncWithValidationInterfaceQueue when called from IS or CL threads * Use CL's own scheduler instead of a global one * Revert "Let ActivateBestChain skip SyncWithValidationInterfaceQueue when called from IS or CL threads" This reverts commit 1c9f6da.
It should be ok to skip
SyncWithValidationInterfaceQueuewhen called from CL/IS cause it will be triggered wheneverActivateBestChainis called by smth else (e.g. by a block received from p2p) and we don't expect CL/IS to reorg the chain this much anyway. Not a very elegant solution but should do the job I think.Note: 0b3183c?w=1 is just a cleanup to avoid fixing smth that is never really used and 2e136a14e32b4c3898d30de2cb4de915c27d7f34 also unifies the way we handle
ActivateBestChainfailures in IS/CL.Fixes #3476