Fix ProcessNewBlock vs EnforceBestChainLock deadlocks in ActivateBestChain #3492

UdjinM6 · 2020-05-17T08:51:28Z

It should be ok to skip SyncWithValidationInterfaceQueue when called from CL/IS cause it will be triggered whenever ActivateBestChain is called by smth else (e.g. by a block received from p2p) and we don't expect CL/IS to reorg the chain this much anyway. Not a very elegant solution but should do the job I think.

Note: 0b3183c?w=1 is just a cleanup to avoid fixing smth that is never really used and 2e136a14e32b4c3898d30de2cb4de915c27d7f34 also unifies the way we handle ActivateBestChain failures in IS/CL.

Fixes #3476

PastaPastaPasta

utACK, looks good to me... Would like some others to look at this too, @codablock @xdustinface

UdjinM6 · 2020-05-18T21:08:35Z

Looks like this is not enough unfortunately, see #3476 (comment). Need some better solution. Closing.

PastaPastaPasta · 2020-05-22T06:22:46Z

can you rebase on develop so travis is happy? https://travis-ci.com/github/PastaPastaPasta/dash/builds/167488704

…ed from IS or CL threads

UdjinM6 · 2020-05-23T13:42:12Z

Resurrecting this. The latest commit fixed the second deadlock it seems, my nodes are running fine for 4+ days now.

xdustinface · 2020-05-23T15:16:46Z

Looks also good so far here on my testnet node.. it had it running into the deadlock with the latest develop and now it runs on e279acf for 2+ days 👍

xdustinface · 2020-05-23T15:52:12Z

src/validation.cpp

        boost::this_thread::interruption_point();

-        if (GetMainSignals().CallbacksPending() > 10) {
+        if (fSyncQueue && GetMainSignals().CallbacksPending() > 10) {


Also codewise it looks overall good to me. Though i wonder if it now even makes any sense to skip the queue sync if ActivateBestChain gets called by CL/IS with CL having its own scheduler? Wasn't the issue that CL put ActivateBestChain (over edges) in the validation interface scheduler's queue and that basically blocked itself by waiting for the queue it runs from to be done with all tasks?

@UdjinM6 please correct me if im wrong and explain me your full understanding of the deadlock as i was investigating a lot here how exactly it happens and i just want to fully get it 🙂

These are two different cases:
1.
a) CL: LOCK->SyncWithValidationInterfaceQueue
b) PeerLogicValidation: (dead)LOCK

(b) (which is executed as a part of the queue flushing) is waiting for (a) to release the lock which will never happen because (a) is waiting for (b). Even if (a) would be in another scheduler it wouldn't matter, the lock would be held anyway because queue processing is blocked. Skipping SyncWithValidationInterfaceQueue lets (a) to finish and release the lock.

a) PeerLogicValidation: LOCK->SyncWithValidationInterfaceQueue
b) CL: (dead)LOCK

In this case (b) (executed as a regular scheduler job) blocks the scheduler itself not just the queue, it simply can't process anything at this point. Using another scheduler for (b) fixes this because now (b) blocks some another thread and the original scheduler can process all the tasks and release the lock which in its turn lets (b) to continue.

Hmm.. soo.. if i look at the two cases you commented in #3476 it looks for me like the queue skipping just fixed one deadlock and introduced a new one.

Lets assume we have:

Thread 1 > A > CConnman::ThreadMessageHandler

Thread 2 > B > Validation Interface Scheduler Thread

Thread 3 > C > New ChainLocks Scheduler Thread

Thread 4 > D > CInstantSendManager::WorkThreadMain

Thread 5 > E > CSigSharesManager::WorkThreadMain

This are the two wait conditions involved here

dash/src/validation.cpp

Line 2948 in 0662f17

LOCK(cs_activateBestChain);

dash/src/validationinterface.cpp

Line 165 in 0662f17

promise.get_future().wait();

Case 1

Relates to #3476 (comment)

1a) B runs ActivateBestChain it locks cs_activateBestChain and then since there are more than 10 tasks in its queue it adds another one and waits for this other one to finish with: promise.get_future().wait(). Which can never happen because it waits for itself.

1b) Now A gets a new block message, wants to process it and therefor also calls ActivateBestChain and waits for cs_activateBestChain to release which won’t happen because of the above..

Case 2 / Skipping the sync for CL code

Relates to #3476 (comment)

If we skip the sync now Case 1 can’t longer happen because in 1a) B will lock cs_activateBestChain, skips the sync (and with this it skips the wait on yourself), runs the rest of ActivateBestChain and then it releases cs_activateBestChain which means A will not wait forever for cs_activateBestChain.. This gives us just the new deadlock below:

2a) A gets a new block, wants to process it and calls ActivateBestChain. It locks cs_activateBestChain and since there are more than 10 tasks in the queue of B it adds another task to the queue of B and waits for this other task to finish with promise.get_future().wait().

2b) B which has at this point more than 10 tasks (one of them is ActivateBestChain which it got from the CL code with sync skipping) processes its tasks until it comes to the call of ActivateBestChain where it then waits for cs_activateBestChain to release whihc won't happen because A in 2a) locked it.

Fix the deadlock / Adding Thread C

Soo.. the actual reason for the deadlock is IMO only calling ActivateBestChain from B. Means just adding C should be the fix for the initial deadlock and then deadlock 2 will never be introduced.

And this is a problem now because we have the “Let Thread B” catch up thing here introduced in #3376

dash/src/validation.cpp

Lines 2956 to 2961 in 0662f17

if (GetMainSignals().CallbacksPending() > 10) {

// Block until the validation queue drains. This should largely

// never happen in normal operation, however may happen during

// reindex, causing memory blowup if we run too far ahead.

SyncWithValidationInterfaceQueue();

}

which leads to X waiting for B if X calls ActivateBestChain and there are more than 10 queue entries (which is why the deadlock happened randomly and more often on the mainnet it seems) and therefor B waiting for itself if ActivateBestChain gets called by B .

So i would say having this "skip sync" code included makes no sense anymore because if ActivateBestChain gets called from instantsend code its called by D or E and if its called by CL its called by C. So there should be no more deadlock, not?

Does this make sense? Note: I'm tired now so i hope i didn't miss any other crazy magic 🤣

I think I understand now, thanks 👍 Reverted, let's see :)

PastaPastaPasta · 2020-05-23T20:16:10Z

ran ubuntu@ubuntu-Ryzen:~/workspace/dash/test/functional$ for n in {1..10}; do python3 test_runner.py -j10 llmq-*; done
So running llmq tests 10 times, w/ no test failures, which we had seen in some PRs, I'm gonna try to do this again on travis running tests multiple times.

PastaPastaPasta · 2020-05-23T20:58:51Z

two things:

Use std::bind and std::thread instead of boost::* PastaPastaPasta@211af7f
Hmm, need to think about std::thread and interupting, cause with this patch the thread will never join...
Use std::bind not boost::bind, but still use boost::thread PastaPastaPasta@88f17f7

(Maybe, CLion suggests this, but I'm not sure if this is something we want / follow) prefer lambda over std::bind PastaPastaPasta@dffb262

UdjinM6 · 2020-05-23T22:09:12Z

@PastaPastaPasta I was just following the surrounding code and usage instructions in https://github.com/dashpay/dash/blob/develop/src/scheduler.h#L23-L34. Scheduler was updated to using std over boost much later in bitcoin#18234

PastaPastaPasta · 2020-05-23T22:09:53Z

Okay, understand

PastaPastaPasta · 2020-05-25T04:30:09Z

Possible problem here? https://travis-ci.com/github/PastaPastaPasta/dash/jobs/339530126

Maybe this is a different problem / ci issue

edit: and another https://travis-ci.com/github/PastaPastaPasta/dash/jobs/339530126

…hen called from IS or CL threads" This reverts commit 1c9f6da.

PastaPastaPasta

ACK, ci still fails rarely, however, this is a definite improvement and appears to fix the deadlock, even if simplepose still fails sometimes.

PastaPastaPasta · 2020-06-08T20:05:31Z

re: llmq-simplepose.py fails about 1 in 100 times for me locally

…Chain (dashpay#3492) * Drop dead code in DoInvalidateBlock * Let ActivateBestChain skip SyncWithValidationInterfaceQueue when called from IS or CL threads * Use CL's own scheduler instead of a global one * Revert "Let ActivateBestChain skip SyncWithValidationInterfaceQueue when called from IS or CL threads" This reverts commit 1c9f6da.

UdjinM6 added this to the 16 milestone May 17, 2020

PastaPastaPasta previously approved these changes May 17, 2020

View reviewed changes

UdjinM6 closed this May 18, 2020

UdjinM6 reopened this May 19, 2020

UdjinM6 marked this pull request as draft May 19, 2020 01:41

UdjinM6 dismissed PastaPastaPasta’s stale review via 9017be5 May 19, 2020 02:07

UdjinM6 force-pushed the fixabcdeadlock3 branch from 9017be5 to 6fd5973 Compare May 19, 2020 02:26

UdjinM6 changed the title ~~Let ActivateBestChain skip SyncWithValidationInterfaceQueue when called from IS or CL threads~~ Fix ProcessNewBlock vs EnforceBestChainLock deadlocks in ActivateBestChain May 19, 2020

UdjinM6 added 3 commits May 22, 2020 11:40

Drop dead code in DoInvalidateBlock

4a13084

Let ActivateBestChain skip SyncWithValidationInterfaceQueue when call…

1c9f6da

…ed from IS or CL threads

Use CL's own scheduler instead of a global one

e279acf

UdjinM6 force-pushed the fixabcdeadlock3 branch from 6fd5973 to e279acf Compare May 22, 2020 08:40

UdjinM6 marked this pull request as ready for review May 23, 2020 13:38

xdustinface reviewed May 23, 2020

View reviewed changes

Revert "Let ActivateBestChain skip SyncWithValidationInterfaceQueue w…

509bb94

…hen called from IS or CL threads" This reverts commit 1c9f6da.

PastaPastaPasta approved these changes Jun 8, 2020

View reviewed changes

UdjinM6 merged commit 232430f into dashpay:develop Jun 9, 2020

PastaPastaPasta mentioned this pull request Jun 9, 2020

[Bug/Instability] Nodes running on develop keep getting stuck randomly every day or two. #3476

Closed

UdjinM6 deleted the fixabcdeadlock3 branch November 26, 2020 13:26

	if (GetMainSignals().CallbacksPending() > 10) {
	// Block until the validation queue drains. This should largely
	// never happen in normal operation, however may happen during
	// reindex, causing memory blowup if we run too far ahead.
	SyncWithValidationInterfaceQueue();
	}

Fix ProcessNewBlock vs EnforceBestChainLock deadlocks in ActivateBestChain #3492

Fix ProcessNewBlock vs EnforceBestChainLock deadlocks in ActivateBestChain #3492

Uh oh!

Conversation

UdjinM6 commented May 17, 2020

Uh oh!

PastaPastaPasta left a comment

Choose a reason for hiding this comment

Uh oh!

UdjinM6 commented May 18, 2020

Uh oh!

PastaPastaPasta commented May 22, 2020

Uh oh!

UdjinM6 commented May 23, 2020

Uh oh!

xdustinface commented May 23, 2020

Uh oh!

xdustinface May 23, 2020

Choose a reason for hiding this comment

Uh oh!

UdjinM6 May 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xdustinface May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Case 1

Case 2 / Skipping the sync for CL code

Fix the deadlock / Adding Thread C

Uh oh!

UdjinM6 May 26, 2020

Choose a reason for hiding this comment

Uh oh!

PastaPastaPasta commented May 23, 2020

Uh oh!

PastaPastaPasta commented May 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

UdjinM6 commented May 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PastaPastaPasta commented May 23, 2020

Uh oh!

PastaPastaPasta commented May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PastaPastaPasta left a comment

Choose a reason for hiding this comment

Uh oh!

PastaPastaPasta commented Jun 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UdjinM6 May 23, 2020 •

edited

Loading

xdustinface May 26, 2020 •

edited

Loading

PastaPastaPasta commented May 23, 2020 •

edited

Loading

UdjinM6 commented May 23, 2020 •

edited

Loading

PastaPastaPasta commented May 25, 2020 •

edited

Loading