-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address encountered deadlock issues with service use. #68
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some requested changes.
...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java
Outdated
Show resolved
Hide resolved
...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java
Outdated
Show resolved
Hide resolved
...se.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceRegistrationImpl.java
Outdated
Show resolved
Hide resolved
...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java
Outdated
Show resolved
Hide resolved
...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java
Outdated
Show resolved
Hide resolved
...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java
Outdated
Show resolved
Hide resolved
...se.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceRegistrationImpl.java
Outdated
Show resolved
Hide resolved
...se.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceRegistrationImpl.java
Outdated
Show resolved
Hide resolved
...g.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ReentrantLockExt.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you create this PR in order to address #15 respectively #31? Because it sounds very related.
I didn't, as I said for us this was about 10 months ago, but it does look very very similar. Call that a coincidence.
Looks like a coincidence, yes. :)
Great, I just thought that we should soon work to fix that. Didn't expect to have a PR that fast. So thanks already.
Can you fetch #31 and report if this fixes it?
You solve the deadlock by throwing a ServiceException if the lock cannot be obtained after a fixed timeout has elapsed. IIRC this was also an approach discussed in #15.
So does the start-up then simply fail or will the SCR-actor perform another attempt to active a component later?
@tjwatson just in case it is not yet implemented: Do you think Felix-SCR could handle a Service-Exception caused by a dead-lock specifically by just re-attempting to activate the root component?
Besides that I think it would be even better if we had a 'real' dead-lock detection, meaning that the chain of uses in this thread is tracked and if one lock cannot be obtained it is checked if there is another thread with a 'crossing' chain in the opposite direction.
I think this could be achieved by maintaining a set of locked uses per thread that is updated when the lock is acquired and released. When a lock cannot be acquired it can then be checked if the thread holding the desired lock is waiting for a lock held by this thread. The latter would also require a Mapping lock->owner, but since you already subclass ReentrantLock
the protected method can be made visible.
Having a 'real' dead-lock detection would be more reliable (maybe some components take just very long to activate) and often faster (for actually quickly activating components waiting for 10s is just waste of time).
if (debug.DEBUG_SERVICES) { | ||
Debug.println("getServiceObject[factory=" + registration.getBundle() + "](" + context.getBundleImpl() + "," + registration + ")"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$ | ||
Debug.println('[' + Thread.currentThread().getName() + "] getServiceObject[PSfactory=" //$NON-NLS-1$ | ||
+ registration.getBundle() + "](" + context.getBundleImpl() + "," + registration + ")"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log messages seem to be very similar.
I think it would make the code more readable to exact them into a dedicated message wo which the required objects and the specific message part is passed.
/** lock for this service */ | ||
private final ReentrantLock lock = new ReentrantLockExt(); | ||
|
||
private Duration lockTimeout = Duration.ofSeconds(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private Duration lockTimeout = Duration.ofSeconds(10); | |
private static long lockTimeout = Duration.ofSeconds(10).toMillis(); |
I agree that it makes the code more clear when using Duration, but this requires repetitive conversions to Milli-seconds that is not necessary. Therefore I suggest to convert it immediately (and make it static).
This way it should be readable and performant.
void runWithLock(Runnable runnable) { | ||
callWithLock(Executors.callable(runnable)); | ||
} | ||
|
||
<T> T callWithLock(Callable<T> callable) { | ||
try { | ||
if (lock.tryLock(getLockTimeout().toMillis(), TimeUnit.MILLISECONDS)) { | ||
try { | ||
return callable.call(); | ||
} catch (InterruptedException e) { | ||
Thread.currentThread().interrupt(); // To appease sonar | ||
return null; | ||
} catch (Exception e) { | ||
throw new RuntimeException("Callable threw exception", e); //$NON-NLS-1$ | ||
} finally { | ||
lock.unlock(); | ||
} | ||
} | ||
throw new ServiceException( | ||
"Failed to acquire lock within try period, Lock, id:" + System.identityHashCode(lock) + ", " //$NON-NLS-1$ //$NON-NLS-2$ | ||
+ lock.toString(), | ||
ServiceException.DEADLOCK); | ||
} catch (InterruptedException e) { | ||
Thread.currentThread().interrupt(); // To appease sonar | ||
return null; | ||
} | ||
} | ||
|
||
Duration getLockTimeout() { | ||
return lockTimeout; | ||
} | ||
|
||
ReentrantLock getLock() { | ||
return lock; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void runWithLock(Runnable runnable) { | |
callWithLock(Executors.callable(runnable)); | |
} | |
<T> T callWithLock(Callable<T> callable) { | |
try { | |
if (lock.tryLock(getLockTimeout().toMillis(), TimeUnit.MILLISECONDS)) { | |
try { | |
return callable.call(); | |
} catch (InterruptedException e) { | |
Thread.currentThread().interrupt(); // To appease sonar | |
return null; | |
} catch (Exception e) { | |
throw new RuntimeException("Callable threw exception", e); //$NON-NLS-1$ | |
} finally { | |
lock.unlock(); | |
} | |
} | |
throw new ServiceException( | |
"Failed to acquire lock within try period, Lock, id:" + System.identityHashCode(lock) + ", " //$NON-NLS-1$ //$NON-NLS-2$ | |
+ lock.toString(), | |
ServiceException.DEADLOCK); | |
} catch (InterruptedException e) { | |
Thread.currentThread().interrupt(); // To appease sonar | |
return null; | |
} | |
} | |
Duration getLockTimeout() { | |
return lockTimeout; | |
} | |
ReentrantLock getLock() { | |
return lock; | |
} | |
interface Locked extends AutoCloseable { | |
@Override | |
void close(); | |
} | |
Locked lock() throws InterruptedException { | |
if (lock.tryLock(lockTimeout, TimeUnit.MILLISECONDS)) { | |
if (debug.DEBUG_SERVICES) { | |
Debug.println('[' + Thread.currentThread().getName() + "] getServiceLock[" //$NON-NLS-1$ | |
+ context.getBundle() + "](" + this + "), id:" + System.identityHashCode(lock) //$NON-NLS-1$ //$NON-NLS-2$ | |
+ ". holdCount:" + lock.getHoldCount() //$NON-NLS-1$ | |
+ ", queued:" + lock.getQueueLength()); //$NON-NLS-1$ | |
} | |
return lock::unlock; | |
} | |
throw new ServiceException( | |
"Failed to acquire lock within try period, Lock, id:" + System.identityHashCode(lock) + ", " //$NON-NLS-1$ //$NON-NLS-2$ | |
+ lock.toString(), | |
ServiceException.DEADLOCK); | |
} |
What do you and the others think about using the lock()
method proposed above. It would allow to use the 'locked-state' as resource of a try-block in a safe manner, at the three-locations where use.getLock()
is now used:
try (Locked locked = use.lock()) {
// perform the protected action
} catch (InterruptedException e) {
Thread.currentThread().interrupt(); // To appease sonar
// if necessary return a reasonable default
}
More or less try (Locked locked = use.lock()) {
would simply replace synchronized (use) {
.
This would make the runWithLock()
methods obsolete since other exceptions can simply be propagated outside the try-block and would also simplify the lock's use in ServiceRegistrationImpl.getService()
. Furthermore code duplications would be avoided (e.g. the 'DeadLock'-ServiceException
).
bundles/org.eclipse.osgi/osgi/src/org/osgi/framework/ServiceException.java
Outdated
Show resolved
Hide resolved
It is not tested and maybe I have missed something, but I think the following code in
|
I have forced pushed to this PR with a number of updates and improvements. |
FYI, I authored all this ServiceUse code, so I am very familiar with it. |
I don't understand enough to know what the build error means and if it related to changes in this PR.
|
@bjhargrave you need to increment the version of org.eclipse.equinox.supplement in the maninfest (and probably pom if there is any) from 1.10.500 -> 1.10.600 |
This test failure may be the result of the PR changing to break the deadlock using the timeout on the ReentrantLock. |
This test seems to fail only in the github actions, no one has been able to determine why. The test run on eclipse CI looks more suspicious https://ci.eclipse.org/equinox/job/equinox/job/PR-68/7/ But I don't think it is related. It has this deadlock caused by some issue with closing the framework storage:
This has been a known issue from the past, but I've not seen it happen on the eclipse CI for a long time. |
I worry about the performance penalty of this change for such a high use code path. Particularly the move away from simple synchronization and the use of many new lambdas. We will need to do some performance analysis on this change @jhanders34 |
If we use the AutoClosable-lock approach suggested above (#68 (comment)) there is only one location with a lambda, which can be easily replaced by a corresponding anonymous class. |
As mentioned in the other change, I think it would be better trying to move to a more lock free approach than using time-outs. |
Regarding the timeout, I can only say that when we captured debugging and analyzed the results, we could never see any delay during normal use and even cases that would produce deadlock would have very low delay. The original synchronized would obviously have delays of its own. IMHO the biggest question is the choice of the 10 secs timeout, which was arbitrary and might be better at a much lower value. |
I am not sure this is wise either. But calling ServiceFactories has always been an issue. We must prevent multiple concurrent calls to a factory (so I am not sure how a lock free design can handle that), so we hold a mutex lock. Using intrinsic locks is simple but provides no way to back off when a timeout elapses which can be evidence of a deadlock. Using a ReentrantLock allows a timeout so we can back off and throw an exception indication the ServiceFactory could not be called.
There are now 3 lambda expressions. They can be replaced with anonymous classes if necessary.
ServiceException can always be thrown. I think returning
The challenge is to distinguish between a real deadlock rather than a very slow ServiceFactory implementation which is running on another thread. For the latter, we want to wait for it. For the former, the deadlock will never resolve (or it wouldn't be a deadlock), so waiting a long time is of no great harm. I would be more concerned that 10 seconds is not long enough to be convinced that it is not just a very slow ServiceFactory implementation. |
As mentioned before I share this concern, therefore I suggested to use a 'real' deadlock detection, for which I described a prototype above: #68 (comment) |
The main issue is that it is a big hammer to hit an very rare problem. It uses N+1 maps (N being the number of threads which call this code) and the maps are pinned to the thread via ThreadLocals. With changes coming to Java for virtual threads, we should steer clear of adding more ThreadLocals. |
@HannesWell has made some experiments, and the deadlock could be prevented by putting a break point at the right place, so it is not really an unavoidable dead-lock but somewhere in the locking-chain there is an incomplete locking of required objects. From the last investigations, I found that the sync is just done for too many things (e.g. service use increments) and also for factories and for service, while e.g. for a @bjhargrave I wonder if you as the author of this code could tell how well it is covered by test, e.g. if one wants to adjust the locking model used here, do we need manual tests or can we rely on the automatic ones so if build is okay we are quite certain nothing breaks here? |
I don't know how extensive the build tests are. The code has been used for many years in many production environments. So deep care must be taken with any changes here.
I assume you mean the user's bundle context. That is too coarse. That means a bundle can only get one service at a time (SCR might be upset). ServiceUse is the combination of the user of the service and the service which is why it is the lock object. It provides the finest grain for locking and, in general, should not be contended. |
Better slow/upset than dead-locking ;-) |
It is indeed a big hammer. Do you know if there is a better approach? I mean deadlock detection is actually a quite generic programming problem (not even Java specific), so I assume that some smart people already got better ideas than I did when thinking ~15min about the problem. IIRC Eclipse Jobs can detect deadlocks? But probably only between jobs-rules? Or is this something we can inherit.
I want to clarify that I had to put the break-point and with that hold back thread B back before it obtained the lock that is later wanted by thread A. So basically I manually scheduled the threads so that one can complete its work before the other ones starts it. EDIT: But of course it would be even better to resolve the dead-lock than just aborting. |
I don't think that this is a performance problem
|
That cannot be known until we run performance measurements. We do this regularly in OpenLiberty when we upgrade the framework.
You cannot make that assumption. There are uses of Equinox that enable parallel bundle activation during framework startup where bundles within the same start-level are activated in parallel.
My comment about performance was more about the sheer number of additional methods being called here vs a simple intrinsic lock and the use of a lambda which has proven to have difficulties to optimize with JIT (at least on some versions of Java) when used in high traffic code paths. All of that on a code path that is highly used in systems that have 1000s of service components.
I'm not saying we have a better option. Only that we need to understand the cost of this for something that is arguably a bug in the design of service components involved in the deadlock and perhaps should be addressed by SCR itself that allows its own service factories to trigger the deadlock in the first place. On the other hand I see the appeal to having the deadlock break out from the framework level. But the result isn't going to fix the components involved in the deadlock. They will still be in a broken state, correct? |
...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java
Show resolved
Hide resolved
The tests are now complete and can be reviewed as well. |
Using a reentrant lock on Service use instead of simple synchronized blocks. Also added quite a bit of logging to help tracing. Signed-off-by: Alain Picard <picard@castortech.com> Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
Signed-off-by: Alain Picard <picard@castortech.com> Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
Numerous comments and asserts needed updating. All uses under lock are run through withLock which has cleaned up the exception handling to properly manage interrupt state. NLS is used for exception messages. Lock subclass toString is cleaned up to avoid string concatenation when we are already using a StringBuilder. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
The ServiceUse.lock() method returns an AutoCloseable so the caller can use try-with-resources to obtain and then close the lock. This is much closer to intrinsic locking in coding practice. Instead of using a long-ish timeout for deadlock detection, which could be wrong on slow systems, we now use a map to hold the thread to lock parings and examine the lock chain to see if the current thread is in the chain. Signed-off-by: Hannes Wellmann <wellmann.hannes1@gmx.net> Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
When throwing the deadlock exception we also need to remove the lock from the awaiting map. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
This is close to behavior of intrinsic locking which does not alter the thread interrupt status. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
Remove unnecessary elements and dead-lock detection logic into dedicated isDeadLocked() method to make the code more readable. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
Prevent one interrupted check on each call to lock(). From the start of lock() to useLock.tryLock(... everything should be as performant as possible. Afterwards we are usually waiting so it doesn't really matter since we cannot wait faster. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
Reproducer test for: https://github.com/eclipse-equinox/equinox.framework/issues/55 Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
We move the concurrent map to be a instance field of ServiceRegistry instead of a static. We want to avoid static mutable state to support multiple framework instances in the same JVM. We also include the DEADLOCK type in the exception. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
I ran the proposed changes through an Open Liberty build and it passed all but one of our 86246 tests (the one failure appears to be an infrastructure issue). This helps confirm the fix does not have any functional issues, although we should get a run against the OSGi core TCK also. I still need to get startup performance measurements done on a configuration that enables a very large set of service components during startup. I hope to have something in the next day or so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance run across a wide range of configurations in Open Liberty showed no measurably significant slow down. In some cases we saw a small performance improvement.
I suspect the improvements observed are due to other improvements that we have yet to consume in Open Liberty though. For example eclipse-equinox/equinox.framework#54 and eclipse-equinox/equinox.framework#46. It is still possible that this change has some measurable cost but it could be getting offset by the other improvements. Regardless, it seems that the cost is not significant enough to be concerned about at this point.
Another observation is that the measurable CPU time did see a slight increase in some configurations, but that increased CPU utilization did not seem to impact the overall startup time of that configuration when 4 cores are assigned to the test run, implying that a bit more work is happening in the background. When restricting the run to a single core this did convert to a slightly slower startup time for these configurations. Regardless, that still seems to be a small cost to the overall startup time performance.
I should also state that in Open Liberty we use |
@tjwatson with this fix, can you activate equinox.start.level.thread.count by default (setting it to > 1)? |
@vogella AFAIK this is non standard behavior and should best be activated on the EPP if desired. |
Recall my comments from https://bugs.eclipse.org/bugs/show_bug.cgi?id=540507#c17
I see very little advantage to using this option by default for any Eclipse RCP usage as long as most all bundles are lazy activated.
There is nothing non-standard here. The OSGi specification does not mandate that bundles within the same start-level be started by a single thread or in any particular order within the start-level. The implementation is free to start in parallel or not. |
Can't you simply test with the latest I-build if a value greater one for that flag speeds up the Start-up? |
Yes, that is what I did when I first put this function into Equinox. I saw no improvement at all with it enabled. After some investigation it was because most everything is lazy-activated so all the real work only happens when classes are loaded. The majority of class loading does not happen until all the bundles have been started and control has been returned to the thread that runs the SWT display thread. Because DS is lazy also, even the registration of the DS components do not load any classes until the service is actually used (unless the component is immediate). Let me clarify the term "started". That is really just Bundle.start. But for lazy activated bundles that does not mean the I personally do not plan to retest if starting in parallel helps for Eclipse RCP/IDE bring up again because I don't think anything has changed that would show it helping. |
This code was added at our end over 10 months ago, and has had 1 minor fix 5 months ago and been used in production, as we wanted to make sure that no issue were seen over time. Now that we have a new release to sync with, it's time to create a formal PR.
Using a reentrant lock (extended default only for special toString) on ServiceUse instead of simple synchronized blocks.
Also added quite a bit of logging to help tracing.