Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from Service-Factory deadlock #87

Closed

Conversation

HannesWell
Copy link
Member

@HannesWell HannesWell commented Jul 24, 2022

Follow up of #68, with a first attempt to implement a recovery strategy to resolve an encountered deadlock during the use of service-factories.

In case of an encountered Deadlock between Service-Factories, this approach simply checks if the current service-factory invocation is the first one of this thread and then makes a new attempt to call the factory.
In order to make a new deadlock more unlikely the thread sleeps for a random period that becomes even longer with further attempts, hoping that the other threads involved the deadlock sleep for another time that is different enough so that the race-condition does not occur anymore.

This PR is a first draft with a relatively naive approach, but at least it worked in the test-cases added for #68:

  • Assumes it is possible without restrictions to
  • Currently requires Java-9's StackWalker, but this can be reworked to use means available in Java-8.

Alternatively creates of Service-factories could be adjusted to handle encountered deadlocks, but this would likely require changes outside of Equinox and maybe some spec-work in OSGi.

I'm on vacation for the next two weeks, but wanted to share this current state/the basic idea already.

@HannesWell HannesWell force-pushed the recoverFromDeadlock branch 2 times, most recently from 86c1e52 to 22f3da3 Compare August 11, 2022 23:21
@HannesWell HannesWell marked this pull request as ready for review August 11, 2022 23:21
@github-actions
Copy link

github-actions bot commented Aug 11, 2022

Unit Test Results

     24 files       24 suites   11m 31s ⏱️
2 135 tests 2 089 ✔️ 44 💤 1  1 🔥
2 179 runs  2 133 ✔️ 44 💤 1  1 🔥

For more details on these failures and errors, see this check.

Results for commit 13a7615.

♻️ This comment has been updated with latest results.

@bjhargrave
Copy link
Member

I am unclear how waiting will solve the deadlock. The previous fix can detect the deadlock and then throw an exception to alert that there was a deadlock (versus hanging). But just catching the deadlock exception and then blocking the thread for some random time before retrying does not seem to be able to "solve" the deadlock. If A->B->C->A is the deadlock, then no amount of time waiting for A will fix anything.

@tjwatson
Copy link
Contributor

I am unclear how waiting will solve the deadlock. The previous fix can detect the deadlock and then throw an exception to alert that there was a deadlock (versus hanging). But just catching the deadlock exception and then blocking the thread for some random time before retrying does not seem to be able to "solve" the deadlock. If A->B->C->A is the deadlock, then no amount of time waiting for A will fix anything.

IIRC the idea is that If Thread 1 is starting at C with: C->A->B->C and Thread 2 is starting at A with A->B->C->A and we detect a deadlock then if we are lucky enough to where Thread 1 is at C or Thread 2 is at A (their starting point) then we force them to wait without the lock on their starting point to allow the other thread to continue. But it seems like a narrow gain that still can allow deadlock.

@bjhargrave
Copy link
Member

IIRC the idea is that If Thread 1 is starting at C with: C->A->B->C and Thread 2 is starting at A with A->B->C->A and we detect a deadlock then if we are lucky enough to where Thread 1 is at C or Thread 2 is at A (their starting point) then we force them to wait without the lock on their starting point to allow the other thread to continue. But it seems like a narrow gain that still can allow deadlock.

This can't work since neither C nor A can be available to another thread until it is completely constructed (that is, the service factory returns an object) and neither can completely construct because of the cycle.

@HannesWell HannesWell force-pushed the recoverFromDeadlock branch from 22f3da3 to 13a7615 Compare August 12, 2022 20:25
@HannesWell
Copy link
Member Author

I am unclear how waiting will solve the deadlock. The previous fix can detect the deadlock and then throw an exception to alert that there was a deadlock (versus hanging). But just catching the deadlock exception and then blocking the thread for some random time before retrying does not seem to be able to "solve" the deadlock. If A->B->C->A is the deadlock, then no amount of time waiting for A will fix anything.

IIRC the idea is that If Thread 1 is starting at C with: C->A->B->C and Thread 2 is starting at A with A->B->C->A and we detect a deadlock then if we are lucky enough to where Thread 1 is at C or Thread 2 is at A (their starting point) then we force them to wait without the lock on their starting point to allow the other thread to continue. But it seems like a narrow gain that still can allow deadlock.

That is the intention yes. Of course it is not a perfect solution that works for all cases, but it works for cases like the test case, where the cycle contains optional links. Of course a cycle of required links can never be resolved. But IIRC Felix warns/errors about that if one models that. But for optional-links this is a possible way as you can see in the adjusted test-cases testLateBindingInSameBundleDeadLock, where all components gets their optional links set eventually.

@HannesWell HannesWell force-pushed the recoverFromDeadlock branch from 13a7615 to 4c32551 Compare October 1, 2022 19:12
This assumes that factories are stateless.

Signed-off-by: Hannes Wellmann <wellmann.hannes1@gmx.net>
@HannesWell
Copy link
Member Author

Do you have further remarks on this one?
For the case of deadlocks introduced by cyclic optional/multi links I think this can be beneficial.

@github-actions
Copy link

Test Results

     24 files       24 suites   11m 29s ⏱️
2 138 tests 2 093 ✔️ 44 💤 0  1 🔥
2 182 runs  2 137 ✔️ 44 💤 0  1 🔥

For more details on these errors, see this check.

Results for commit fa69761.

@laeubi
Copy link
Member

laeubi commented Apr 30, 2024

@HannesWell as we have not seen any deadlock problems in the day passed and it seems to be there is not full agreement how to handle this maybe just postpone this?

@HannesWell
Copy link
Member Author

@HannesWell as we have not seen any deadlock problems in the day passed and it seems to be there is not full agreement how to handle this maybe just postpone this?

Yes, I'll close this for now but will leave the branch intact, just in case this becomes relevant again and somebody remembers this.

@HannesWell HannesWell closed this Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants