Skip to content

Conversation

@UdjinM6
Copy link

@UdjinM6 UdjinM6 commented May 17, 2020

It should be ok to skip SyncWithValidationInterfaceQueue when called from CL/IS cause it will be triggered whenever ActivateBestChain is called by smth else (e.g. by a block received from p2p) and we don't expect CL/IS to reorg the chain this much anyway. Not a very elegant solution but should do the job I think.

Note: 0b3183c?w=1 is just a cleanup to avoid fixing smth that is never really used and 2e136a14e32b4c3898d30de2cb4de915c27d7f34 also unifies the way we handle ActivateBestChain failures in IS/CL.

Fixes #3476

@UdjinM6 UdjinM6 added this to the 16 milestone May 17, 2020
Copy link
Member

@PastaPastaPasta PastaPastaPasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK, looks good to me... Would like some others to look at this too, @codablock @xdustinface

@UdjinM6
Copy link
Author

UdjinM6 commented May 18, 2020

Looks like this is not enough unfortunately, see #3476 (comment). Need some better solution. Closing.

@UdjinM6 UdjinM6 closed this May 18, 2020
@UdjinM6 UdjinM6 reopened this May 19, 2020
@UdjinM6 UdjinM6 marked this pull request as draft May 19, 2020 01:41
@UdjinM6 UdjinM6 force-pushed the fixabcdeadlock3 branch from 9017be5 to 6fd5973 Compare May 19, 2020 02:26
@UdjinM6 UdjinM6 changed the title Let ActivateBestChain skip SyncWithValidationInterfaceQueue when called from IS or CL threads Fix ProcessNewBlock vs EnforceBestChainLock deadlocks in ActivateBestChain May 19, 2020
@PastaPastaPasta
Copy link
Member

can you rebase on develop so travis is happy? https://travis-ci.com/github/PastaPastaPasta/dash/builds/167488704

@UdjinM6 UdjinM6 force-pushed the fixabcdeadlock3 branch from 6fd5973 to e279acf Compare May 22, 2020 08:40
@UdjinM6 UdjinM6 marked this pull request as ready for review May 23, 2020 13:38
@UdjinM6
Copy link
Author

UdjinM6 commented May 23, 2020

Resurrecting this. The latest commit fixed the second deadlock it seems, my nodes are running fine for 4+ days now.

@xdustinface
Copy link

Looks also good so far here on my testnet node.. it had it running into the deadlock with the latest develop and now it runs on e279acf for 2+ days 👍

boost::this_thread::interruption_point();

if (GetMainSignals().CallbacksPending() > 10) {
if (fSyncQueue && GetMainSignals().CallbacksPending() > 10) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also codewise it looks overall good to me. Though i wonder if it now even makes any sense to skip the queue sync if ActivateBestChain gets called by CL/IS with CL having its own scheduler? Wasn't the issue that CL put ActivateBestChain (over edges) in the validation interface scheduler's queue and that basically blocked itself by waiting for the queue it runs from to be done with all tasks?

@UdjinM6 please correct me if im wrong and explain me your full understanding of the deadlock as i was investigating a lot here how exactly it happens and i just want to fully get it 🙂

Copy link
Author

@UdjinM6 UdjinM6 May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are two different cases:
1.
a) CL: LOCK->SyncWithValidationInterfaceQueue
b) PeerLogicValidation: (dead)LOCK

(b) (which is executed as a part of the queue flushing) is waiting for (a) to release the lock which will never happen because (a) is waiting for (b). Even if (a) would be in another scheduler it wouldn't matter, the lock would be held anyway because queue processing is blocked. Skipping SyncWithValidationInterfaceQueue lets (a) to finish and release the lock.

a) PeerLogicValidation: LOCK->SyncWithValidationInterfaceQueue
b) CL: (dead)LOCK

In this case (b) (executed as a regular scheduler job) blocks the scheduler itself not just the queue, it simply can't process anything at this point. Using another scheduler for (b) fixes this because now (b) blocks some another thread and the original scheduler can process all the tasks and release the lock which in its turn lets (b) to continue.

Copy link

@xdustinface xdustinface May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. soo.. if i look at the two cases you commented in #3476 it looks for me like the queue skipping just fixed one deadlock and introduced a new one.

Lets assume we have:

  • Thread 1 > A > CConnman::ThreadMessageHandler
  • Thread 2 > B > Validation Interface Scheduler Thread
  • Thread 3 > C > New ChainLocks Scheduler Thread
  • Thread 4 > D > CInstantSendManager::WorkThreadMain
  • Thread 5 > E > CSigSharesManager::WorkThreadMain

This are the two wait conditions involved here

LOCK(cs_activateBestChain);

promise.get_future().wait();

Case 1

Relates to #3476 (comment)

1a) B runs ActivateBestChain it locks cs_activateBestChain and then since there are more than 10 tasks in its queue it adds another one and waits for this other one to finish with: promise.get_future().wait(). Which can never happen because it waits for itself.

1b) Now A gets a new block message, wants to process it and therefor also calls ActivateBestChain and waits for cs_activateBestChain to release which won’t happen because of the above..

Case 2 / Skipping the sync for CL code

Relates to #3476 (comment)

If we skip the sync now Case 1 can’t longer happen because in 1a) B will lock cs_activateBestChain, skips the sync (and with this it skips the wait on yourself), runs the rest of ActivateBestChain and then it releases cs_activateBestChain which means A will not wait forever for cs_activateBestChain.. This gives us just the new deadlock below:

2a) A gets a new block, wants to process it and calls ActivateBestChain. It locks cs_activateBestChain and since there are more than 10 tasks in the queue of B it adds another task to the queue of B and waits for this other task to finish with promise.get_future().wait().

2b) B which has at this point more than 10 tasks (one of them is ActivateBestChain which it got from the CL code with sync skipping) processes its tasks until it comes to the call of ActivateBestChain where it then waits for cs_activateBestChain to release whihc won't happen because A in 2a) locked it.

Fix the deadlock / Adding Thread C

Soo.. the actual reason for the deadlock is IMO only calling ActivateBestChain from B. Means just adding C should be the fix for the initial deadlock and then deadlock 2 will never be introduced.

And this is a problem now because we have the “Let Thread B” catch up thing here introduced in #3376

dash/src/validation.cpp

Lines 2956 to 2961 in 0662f17

if (GetMainSignals().CallbacksPending() > 10) {
// Block until the validation queue drains. This should largely
// never happen in normal operation, however may happen during
// reindex, causing memory blowup if we run too far ahead.
SyncWithValidationInterfaceQueue();
}

which leads to X waiting for B if X calls ActivateBestChain and there are more than 10 queue entries (which is why the deadlock happened randomly and more often on the mainnet it seems) and therefor B waiting for itself if ActivateBestChain gets called by B .

So i would say having this "skip sync" code included makes no sense anymore because if ActivateBestChain gets called from instantsend code its called by D or E and if its called by CL its called by C. So there should be no more deadlock, not?

Does this make sense? Note: I'm tired now so i hope i didn't miss any other crazy magic 🤣

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now, thanks 👍 Reverted, let's see :)

@PastaPastaPasta
Copy link
Member

ran ubuntu@ubuntu-Ryzen:~/workspace/dash/test/functional$ for n in {1..10}; do python3 test_runner.py -j10 llmq-*; done
So running llmq tests 10 times, w/ no test failures, which we had seen in some PRs, I'm gonna try to do this again on travis running tests multiple times.

@PastaPastaPasta
Copy link
Member

PastaPastaPasta commented May 23, 2020

two things:

Use std::bind and std::thread instead of boost::* PastaPastaPasta@211af7f
Hmm, need to think about std::thread and interupting, cause with this patch the thread will never join...

Use std::bind not boost::bind, but still use boost::thread PastaPastaPasta@88f17f7

(Maybe, CLion suggests this, but I'm not sure if this is something we want / follow) prefer lambda over std::bind PastaPastaPasta@dffb262

@UdjinM6
Copy link
Author

UdjinM6 commented May 23, 2020

@PastaPastaPasta I was just following the surrounding code and usage instructions in https://github.com/dashpay/dash/blob/develop/src/scheduler.h#L23-L34. Scheduler was updated to using std over boost much later in bitcoin#18234

@PastaPastaPasta
Copy link
Member

Okay, understand

@PastaPastaPasta
Copy link
Member

PastaPastaPasta commented May 25, 2020

Possible problem here? https://travis-ci.com/github/PastaPastaPasta/dash/jobs/339530126

Maybe this is a different problem / ci issue

edit: and another https://travis-ci.com/github/PastaPastaPasta/dash/jobs/339530126

…hen called from IS or CL threads"

This reverts commit 1c9f6da.
Copy link
Member

@PastaPastaPasta PastaPastaPasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, ci still fails rarely, however, this is a definite improvement and appears to fix the deadlock, even if simplepose still fails sometimes.

@PastaPastaPasta
Copy link
Member

re: llmq-simplepose.py fails about 1 in 100 times for me locally

@UdjinM6 UdjinM6 merged commit 232430f into dashpay:develop Jun 9, 2020
@UdjinM6 UdjinM6 deleted the fixabcdeadlock3 branch November 26, 2020 13:26
gades pushed a commit to cosanta/cosanta-core that referenced this pull request Mar 11, 2022
…Chain (dashpay#3492)

* Drop dead code in DoInvalidateBlock

* Let ActivateBestChain skip SyncWithValidationInterfaceQueue when called from IS or CL threads

* Use CL's own scheduler instead of a global one

* Revert "Let ActivateBestChain skip SyncWithValidationInterfaceQueue when called from IS or CL threads"

This reverts commit 1c9f6da.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants