Address encountered deadlock issues with service use. #68

castortech · 2022-06-30T14:27:07Z

This code was added at our end over 10 months ago, and has had 1 minor fix 5 months ago and been used in production, as we wanted to make sure that no issue were seen over time. Now that we have a new release to sync with, it's time to create a formal PR.

Using a reentrant lock (extended default only for special toString) on ServiceUse instead of simple synchronized blocks.

Also added quite a bit of logging to help tracing.

HannesWell · 2022-06-30T14:56:22Z

Did you create this PR in order to address #15 respectively #31?
Because it sounds very related.

castortech · 2022-06-30T15:04:53Z

Did you create this PR in order to address #15 respectively #31? Because it sounds very related.

I didn't, as I said for us this was about 10 months ago, but it does look very very similar. Call that a coincidence.

bjhargrave

Some requested changes.

...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java

...se.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceRegistrationImpl.java

...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java

...se.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceRegistrationImpl.java

...g.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ReentrantLockExt.java

HannesWell

Did you create this PR in order to address #15 respectively #31? Because it sounds very related.

I didn't, as I said for us this was about 10 months ago, but it does look very very similar. Call that a coincidence.

Looks like a coincidence, yes. :)
Great, I just thought that we should soon work to fix that. Didn't expect to have a PR that fast. So thanks already.

Can you fetch #31 and report if this fixes it?

You solve the deadlock by throwing a ServiceException if the lock cannot be obtained after a fixed timeout has elapsed. IIRC this was also an approach discussed in #15.
So does the start-up then simply fail or will the SCR-actor perform another attempt to active a component later?
@tjwatson just in case it is not yet implemented: Do you think Felix-SCR could handle a Service-Exception caused by a dead-lock specifically by just re-attempting to activate the root component?

Besides that I think it would be even better if we had a 'real' dead-lock detection, meaning that the chain of uses in this thread is tracked and if one lock cannot be obtained it is checked if there is another thread with a 'crossing' chain in the opposite direction.
I think this could be achieved by maintaining a set of locked uses per thread that is updated when the lock is acquired and released. When a lock cannot be acquired it can then be checked if the thread holding the desired lock is waiting for a lock held by this thread. The latter would also require a Mapping lock->owner, but since you already subclass ReentrantLock the protected method can be made visible.

Having a 'real' dead-lock detection would be more reliable (maybe some components take just very long to activate) and often faster (for actually quickly activating components waiting for 10s is just waste of time).

HannesWell · 2022-06-30T17:08:07Z

...osgi/container/src/org/eclipse/osgi/internal/serviceregistry/PrototypeServiceFactoryUse.java

 		if (debug.DEBUG_SERVICES) {
-			Debug.println("getServiceObject[factory=" + registration.getBundle() + "](" + context.getBundleImpl() + "," + registration + ")"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
+			Debug.println('[' + Thread.currentThread().getName() + "] getServiceObject[PSfactory=" //$NON-NLS-1$
+					+ registration.getBundle() + "](" + context.getBundleImpl() + "," + registration + ")"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
 		}


The log messages seem to be very similar.
I think it would make the code more readable to exact them into a dedicated message wo which the required objects and the specific message part is passed.

HannesWell · 2022-06-30T20:55:12Z

...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java

+	/** lock for this service */
+	private final ReentrantLock lock = new ReentrantLockExt();
+
+	private Duration lockTimeout = Duration.ofSeconds(10);


Suggested change

private Duration lockTimeout = Duration.ofSeconds(10);

private static long lockTimeout = Duration.ofSeconds(10).toMillis();

I agree that it makes the code more clear when using Duration, but this requires repetitive conversions to Milli-seconds that is not necessary. Therefore I suggest to convert it immediately (and make it static).
This way it should be readable and performant.

HannesWell · 2022-06-30T21:04:35Z

...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java

+	void runWithLock(Runnable runnable) {
+		callWithLock(Executors.callable(runnable));
+	}
+
+	<T> T callWithLock(Callable<T> callable) {
+		try {
+			if (lock.tryLock(getLockTimeout().toMillis(), TimeUnit.MILLISECONDS)) {
+				try {
+					return callable.call();
+				} catch (InterruptedException e) {
+					Thread.currentThread().interrupt(); // To appease sonar
+					return null;
+				} catch (Exception e) {
+					throw new RuntimeException("Callable threw exception", e); //$NON-NLS-1$
+				} finally {
+					lock.unlock();
+				}
+			}
+			throw new ServiceException(
+					"Failed to acquire lock within try period, Lock, id:" + System.identityHashCode(lock) + ", " //$NON-NLS-1$ //$NON-NLS-2$
+							+ lock.toString(),
+					ServiceException.DEADLOCK);
+		} catch (InterruptedException e) {
+			Thread.currentThread().interrupt(); // To appease sonar
+			return null;
+		}
+	}
+
+	Duration getLockTimeout() {
+		return lockTimeout;
+	}
+
+	ReentrantLock getLock() {
+		return lock;
 	}


Suggested change

void runWithLock(Runnable runnable) {

callWithLock(Executors.callable(runnable));

}

<T> T callWithLock(Callable<T> callable) {

try {

if (lock.tryLock(getLockTimeout().toMillis(), TimeUnit.MILLISECONDS)) {

try {

return callable.call();

} catch (InterruptedException e) {

Thread.currentThread().interrupt(); // To appease sonar

return null;

} catch (Exception e) {

throw new RuntimeException("Callable threw exception", e); //$NON-NLS-1$

} finally {

lock.unlock();

}

}

throw new ServiceException(

"Failed to acquire lock within try period, Lock, id:" + System.identityHashCode(lock) + ", " //$NON-NLS-1$ //$NON-NLS-2$

+ lock.toString(),

ServiceException.DEADLOCK);

} catch (InterruptedException e) {

Thread.currentThread().interrupt(); // To appease sonar

return null;

}

}

Duration getLockTimeout() {

return lockTimeout;

}

ReentrantLock getLock() {

return lock;

}

interface Locked extends AutoCloseable {

@Override

void close();

}

Locked lock() throws InterruptedException {

if (lock.tryLock(lockTimeout, TimeUnit.MILLISECONDS)) {

if (debug.DEBUG_SERVICES) {

Debug.println('[' + Thread.currentThread().getName() + "] getServiceLock[" //$NON-NLS-1$

+ context.getBundle() + "](" + this + "), id:" + System.identityHashCode(lock) //$NON-NLS-1$ //$NON-NLS-2$

+ ". holdCount:" + lock.getHoldCount() //$NON-NLS-1$

+ ", queued:" + lock.getQueueLength()); //$NON-NLS-1$

}

return lock::unlock;

}

throw new ServiceException(

"Failed to acquire lock within try period, Lock, id:" + System.identityHashCode(lock) + ", " //$NON-NLS-1$ //$NON-NLS-2$

+ lock.toString(),

ServiceException.DEADLOCK);

}

What do you and the others think about using the lock() method proposed above. It would allow to use the 'locked-state' as resource of a try-block in a safe manner, at the three-locations where use.getLock() is now used:

try (Locked locked = use.lock()) { // perform the protected action } catch (InterruptedException e) { Thread.currentThread().interrupt(); // To appease sonar // if necessary return a reasonable default }

More or less try (Locked locked = use.lock()) { would simply replace synchronized (use) {.
This would make the runWithLock() methods obsolete since other exceptions can simply be propagated outside the try-block and would also simplify the lock's use in ServiceRegistrationImpl.getService(). Furthermore code duplications would be avoided (e.g. the 'DeadLock'-ServiceException).

bundles/org.eclipse.osgi/osgi/src/org/osgi/framework/ServiceException.java

HannesWell · 2022-06-30T22:01:41Z

It is not tested and maybe I have missed something, but I think the following code in ServiceUse could be a real deadlock detection:

	private final ReentrantLockExt lock = new ReentrantLockExt(); // needs an additional override of getOwner() to make it visible

	interface Locked extends AutoCloseable {
		@Override
		void close();
	}

	private static final ConcurrentMap<Thread, ReentrantLock> AWAITING_LOCKS = new ConcurrentHashMap<>();

	private final ThreadLocal<Set<ReentrantLock>> ownedLocks = ThreadLocal.withInitial(HashSet::new);

	Locked lock() throws InterruptedException {
		boolean setAwaiting = false;
		do  {
			if (lock.tryLock(10, TimeUnit.MILLISECONDS)) { // use a short time-out to not spin-wait for the lock to become available
				ownedLocks.get().add(lock);
				if (setAwaiting) {
					AWAITING_LOCKS.remove(Thread.currentThread());
				}
				if (debug.DEBUG_SERVICES) {
					Debug.println('[' + Thread.currentThread().getName() + "] getServiceLock[" //$NON-NLS-1$
							+ context.getBundle() + "](" + this + "), id:" + System.identityHashCode(lock) //$NON-NLS-1$ //$NON-NLS-2$
							+ ". holdCount:" + lock.getHoldCount() //$NON-NLS-1$
							+ ", queued:" + lock.getQueueLength()); //$NON-NLS-1$
				}
				return () -> {
					ownedLocks.get().remove(lock);
					lock.unlock();
				};
			}
			AWAITING_LOCKS.put(Thread.currentThread(), lock);
			setAwaiting = true;
			Thread owner = lock.getOwner();
			if (owner != null) { // lock could be released in the meantime
				ReentrantLock crossingLock = AWAITING_LOCKS.get(owner);
				if (ownedLocks.get().contains(crossingLock)) {
					throw new ServiceException(
							"Failed to acquire lock within try period, Lock, id:" + System.identityHashCode(lock) + ", " //$NON-NLS-1$ //$NON-NLS-2$
									+ lock.toString(),
							ServiceException.DEADLOCK);
				}
			}
			// Not a dead-lock. Lock was regularly hold by another thread. Just try again
		} while (true);
	}

bjhargrave · 2022-06-30T22:48:02Z

I have forced pushed to this PR with a number of updates and improvements.

bjhargrave · 2022-06-30T22:49:47Z

FYI, I authored all this ServiceUse code, so I am very familiar with it.

bjhargrave · 2022-06-30T22:58:03Z

I don't understand enough to know what the build error means and if it related to changes in this PR.

Error: Failed to execute goal org.eclipse.tycho.extras:tycho-p2-extras-plugin:3.0.0-SNAPSHOT:compare-version-with-baselines (compare-attached-artifacts-with-release) on project org.eclipse.equinox.supplement: Only qualifier changed for (org.eclipse.equinox.supplement/1.10.500.v20220630-2245). Expected to have bigger x.y.z than what is available in baseline (1.10.500.v20220516-2155) -> [Help 1]

laeubi · 2022-07-01T04:25:05Z

@bjhargrave you need to increment the version of org.eclipse.equinox.supplement in the maninfest (and probably pom if there is any) from 1.10.500 -> 1.10.600

github-actions · 2022-07-01T15:11:05Z

Unit Test Results

    24 files     24 suites 11m 59s ⏱️
2 134 tests 2 089 ✔️ 44 💤 1 ❌
2 178 runs 2 133 ✔️ 44 💤 1 ❌

For more details on these failures, see this check.

Results for commit ce17f97.

♻️ This comment has been updated with latest results.

bjhargrave · 2022-07-01T15:15:31Z

java.lang.AssertionError: The StaticCircuit2 component shouldn't have bound objects expected:<0> but was:<1>
at org.eclipse.equinox.ds.tests.tbc.DSTest.testCircularityHandling(DSTest.java:1426)

This test failure may be the result of the PR changing to break the deadlock using the timeout on the ReentrantLock.

tjwatson · 2022-07-03T12:42:34Z

This test failure may be the result of the PR changing to break the deadlock using the timeout on the ReentrantLock.

This test seems to fail only in the github actions, no one has been able to determine why. The test run on eclipse CI looks more suspicious https://ci.eclipse.org/equinox/job/equinox/job/PR-68/7/

But I don't think it is related. It has this deadlock caused by some issue with closing the framework storage:

    java.base@17.0.1/java.io.FileDescriptor.close0(Native Method)
    java.base@17.0.1/java.io.FileDescriptor.close(FileDescriptor.java:297)
    java.base@17.0.1/java.io.FileOutputStream$1.close(FileOutputStream.java:392)
    java.base@17.0.1/java.io.FileDescriptor.closeAll(FileDescriptor.java:355)
    java.base@17.0.1/java.io.FileOutputStream.close(FileOutputStream.java:390)
    org.eclipse.osgi.storagemanager.StorageManager.closeOutputStream(StorageManager.java:912)
    org.eclipse.osgi.storagemanager.ManagedOutputStream.close(ManagedOutputStream.java:56)

This has been a known issue from the past, but I've not seen it happen on the eclipse CI for a long time.

tjwatson · 2022-07-03T12:45:32Z

I worry about the performance penalty of this change for such a high use code path. Particularly the move away from simple synchronization and the use of many new lambdas.

We will need to do some performance analysis on this change @jhanders34

HannesWell · 2022-07-03T15:15:06Z

I worry about the performance penalty of this change for such a high use code path. Particularly the move away from simple synchronization and the use of many new lambdas.

If we use the AutoClosable-lock approach suggested above (#68 (comment)) there is only one location with a lambda, which can be easily replaced by a corresponding anonymous class.

laeubi · 2022-07-04T05:44:26Z

As mentioned in the other change, I think it would be better trying to move to a more lock free approach than using time-outs.
I even wonder if according to the OSGi spec a caller has to expect such "unusual" exceptions? I think none of the documented options in org.osgi.framework.BundleContext.getService(ServiceReference<S>) allows for such, even returning null might be arguable as the server obviously is registered.

castortech · 2022-07-04T09:04:00Z

Regarding the timeout, I can only say that when we captured debugging and analyzed the results, we could never see any delay during normal use and even cases that would produce deadlock would have very low delay. The original synchronized would obviously have delays of its own. IMHO the biggest question is the choice of the 10 secs timeout, which was arbitrary and might be better at a much lower value.

bjhargrave · 2022-07-04T17:15:47Z

Particularly the move away from simple synchronization

I am not sure this is wise either. But calling ServiceFactories has always been an issue. We must prevent multiple concurrent calls to a factory (so I am not sure how a lock free design can handle that), so we hold a mutex lock. Using intrinsic locks is simple but provides no way to back off when a timeout elapses which can be evidence of a deadlock. Using a ReentrantLock allows a timeout so we can back off and throw an exception indication the ServiceFactory could not be called.

the use of many new lambdas.

There are now 3 lambda expressions. They can be replaced with anonymous classes if necessary.

I even wonder if according to the OSGi spec a caller has to expect such "unusual" exceptions? I think none of the documented options in org.osgi.framework.BundleContext.getService(ServiceReference<S>) allows for such, even returning null might be arguable as the server obviously is registered.

ServiceException can always be thrown. I think returning null is wrong for a timeout as it wallpapers over a real issue. Throwing an exception allows the caller to know something is wrong and tell someone (e.g. log it).

the biggest question is the choice of the 10 secs timeout, which was arbitrary and might be better at a much lower value.

The challenge is to distinguish between a real deadlock rather than a very slow ServiceFactory implementation which is running on another thread. For the latter, we want to wait for it. For the former, the deadlock will never resolve (or it wouldn't be a deadlock), so waiting a long time is of no great harm. I would be more concerned that 10 seconds is not long enough to be convinced that it is not just a very slow ServiceFactory implementation.

HannesWell · 2022-07-04T17:22:43Z

the biggest question is the choice of the 10 secs timeout, which was arbitrary and might be better at a much lower value.

The challenge is to distinguish between a real deadlock rather than a very slow ServiceFactory implementation which is running on another thread. For the latter, we want to wait for it. For the former, the deadlock will never resolve (or it wouldn't be a deadlock), so waiting a long time is of no great harm. I would be more concerned that 10 seconds is not long enough to be convinced that it is not just a very slow ServiceFactory implementation.

As mentioned before I share this concern, therefore I suggested to use a 'real' deadlock detection, for which I described a prototype above: #68 (comment)
I have neither tested nor used it, so it might be flawed, but from my theoretical thinking it should work. But let me know if you see a problem.

bjhargrave · 2022-07-04T17:28:06Z

But let me know if you see a problem.

The main issue is that it is a big hammer to hit an very rare problem. It uses N+1 maps (N being the number of threads which call this code) and the maps are pinned to the thread via ThreadLocals. With changes coming to Java for virtual threads, we should steer clear of adding more ThreadLocals.

laeubi · 2022-07-04T17:43:48Z

so I am not sure how a lock free design can handle that

@HannesWell has made some experiments, and the deadlock could be prevented by putting a break point at the right place, so it is not really an unavoidable dead-lock but somewhere in the locking-chain there is an incomplete locking of required objects.

From the last investigations, I found that the sync is just done for too many things (e.g. service use increments) and also for factories and for service, while e.g. for a ServiceFactory one should actually lock on the BundleContext as far as I can see and for a plain service there need no sync at all ...

@bjhargrave I wonder if you as the author of this code could tell how well it is covered by test, e.g. if one wants to adjust the locking model used here, do we need manual tests or can we rely on the automatic ones so if build is okay we are quite certain nothing breaks here?

bjhargrave · 2022-07-04T17:57:08Z

I wonder if you as the author of this code could tell how well it is covered by test

I don't know how extensive the build tests are.

The code has been used for many years in many production environments. So deep care must be taken with any changes here.

while e.g. for a ServiceFactory one should actually lock on the BundleContext as far as I can see

I assume you mean the user's bundle context. That is too coarse. That means a bundle can only get one service at a time (SCR might be upset). ServiceUse is the combination of the user of the service and the service which is why it is the lock object. It provides the finest grain for locking and, in general, should not be contended.

laeubi · 2022-07-04T18:02:34Z

I assume you mean the user's bundle context. That is too coarse. That means a bundle can only get one service at a time (SCR might be upset)

Better slow/upset than dead-locking ;-)
The problem arise if two requestor of a service crossing dependency chains, then it is not a problem of contention... see #15 for more details.

HannesWell · 2022-07-04T18:03:41Z

But let me know if you see a problem.

The main issue is that it is a big hammer to hit an very rare problem. It uses N+1 maps (N being the number of threads which call this code) and the maps are pinned to the thread via ThreadLocals. With changes coming to Java for virtual threads, we should steer clear of adding more ThreadLocals.

It is indeed a big hammer. Do you know if there is a better approach? I mean deadlock detection is actually a quite generic programming problem (not even Java specific), so I assume that some smart people already got better ideas than I did when thinking ~15min about the problem.

IIRC Eclipse Jobs can detect deadlocks? But probably only between jobs-rules? Or is this something we can inherit.

so I am not sure how a lock free design can handle that

@HannesWell has made some experiments, and the deadlock could be prevented by putting a break point at the right place, so it is not really an unavoidable dead-lock but somewhere in the locking-chain there is an incomplete locking of required objects.

I want to clarify that I had to put the break-point and with that hold back thread B back before it obtained the lock that is later wanted by thread A. So basically I manually scheduled the threads so that one can complete its work before the other ones starts it.
If we want to implemented that it will probably be more complicated and has a worse performance (is my guess, but I might be wrong) than just detecting it and aborting with an exception.

EDIT: But of course it would be even better to resolve the dead-lock than just aborting.

laeubi · 2022-07-04T18:09:06Z

If we want to implemented that it will probably be more complicated and has a worse performance (is my guess, but I might be wrong) than just detecting it and aborting with an exception.

I don't think that this is a performance problem

we don't see heavy multi-treading here, actually two threads, bigger setups might have a few more but realistically acquiring services with hundreds of threads concurrently is not really a usual use-case
Often one service acquires the other and activates and we have more a linear chain
the performance penalty of starting all over again (if the code actually is able to recover!) is much higher, e.g. how/when should a failed lockup retried? Maybe with a backoff-algorithm? Compared to waiting a bit more for something I need to wait anyways (given we not have dead-locked) is nothing much to gain

tjwatson · 2022-07-05T13:15:51Z

If we want to implemented that it will probably be more complicated and has a worse performance (is my guess, but I might be wrong) than just detecting it and aborting with an exception.

I don't think that this is a performance problem

That cannot be known until we run performance measurements. We do this regularly in OpenLiberty when we upgrade the framework.

we don't see heavy multi-treading here, actually two threads, bigger setups might have a few more but realistically acquiring services with hundreds of threads concurrently is not really a usual use-case

You cannot make that assumption. There are uses of Equinox that enable parallel bundle activation during framework startup where bundles within the same start-level are activated in parallel.

Often one service acquires the other and activates and we have more a linear chain

My comment about performance was more about the sheer number of additional methods being called here vs a simple intrinsic lock and the use of a lambda which has proven to have difficulties to optimize with JIT (at least on some versions of Java) when used in high traffic code paths. All of that on a code path that is highly used in systems that have 1000s of service components.

the performance penalty of starting all over again (if the code actually is able to recover!) is much higher, e.g. how/when should a failed lockup retried? Maybe with a backoff-algorithm? Compared to waiting a bit more for something I need to wait anyways (given we not have dead-locked) is nothing much to gain

I'm not saying we have a better option. Only that we need to understand the cost of this for something that is arguably a bug in the design of service components involved in the deadlock and perhaps should be addressed by SCR itself that allows its own service factories to trigger the deadlock in the first place. On the other hand I see the appeal to having the deadlock break out from the framework level. But the result isn't going to fix the components involved in the deadlock. They will still be in a broken state, correct?

...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java

HannesWell · 2022-07-21T23:52:15Z

The tests are now complete and can be reviewed as well.

Using a reentrant lock on Service use instead of simple synchronized blocks. Also added quite a bit of logging to help tracing. Signed-off-by: Alain Picard <picard@castortech.com> Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Signed-off-by: Alain Picard <picard@castortech.com> Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Numerous comments and asserts needed updating. All uses under lock are run through withLock which has cleaned up the exception handling to properly manage interrupt state. NLS is used for exception messages. Lock subclass toString is cleaned up to avoid string concatenation when we are already using a StringBuilder. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

The ServiceUse.lock() method returns an AutoCloseable so the caller can use try-with-resources to obtain and then close the lock. This is much closer to intrinsic locking in coding practice. Instead of using a long-ish timeout for deadlock detection, which could be wrong on slow systems, we now use a map to hold the thread to lock parings and examine the lock chain to see if the current thread is in the chain. Signed-off-by: Hannes Wellmann <wellmann.hannes1@gmx.net> Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

When throwing the deadlock exception we also need to remove the lock from the awaiting map. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

This is close to behavior of intrinsic locking which does not alter the thread interrupt status. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Remove unnecessary elements and dead-lock detection logic into dedicated isDeadLocked() method to make the code more readable. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Prevent one interrupted check on each call to lock(). From the start of lock() to useLock.tryLock(... everything should be as performant as possible. Afterwards we are usually waiting so it doesn't really matter since we cannot wait faster. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Reproducer test for: https://github.com/eclipse-equinox/equinox.framework/issues/55 Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

We move the concurrent map to be a instance field of ServiceRegistry instead of a static. We want to avoid static mutable state to support multiple framework instances in the same JVM. We also include the DEADLOCK type in the exception. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

tjwatson · 2022-07-25T12:59:15Z

I ran the proposed changes through an Open Liberty build and it passed all but one of our 86246 tests (the one failure appears to be an infrastructure issue). This helps confirm the fix does not have any functional issues, although we should get a run against the OSGi core TCK also.

I still need to get startup performance measurements done on a configuration that enables a very large set of service components during startup. I hope to have something in the next day or so.

tjwatson

The performance run across a wide range of configurations in Open Liberty showed no measurably significant slow down. In some cases we saw a small performance improvement.

I suspect the improvements observed are due to other improvements that we have yet to consume in Open Liberty though. For example eclipse-equinox/equinox.framework#54 and eclipse-equinox/equinox.framework#46. It is still possible that this change has some measurable cost but it could be getting offset by the other improvements. Regardless, it seems that the cost is not significant enough to be concerned about at this point.

Another observation is that the measurable CPU time did see a slight increase in some configurations, but that increased CPU utilization did not seem to impact the overall startup time of that configuration when 4 cores are assigned to the test run, implying that a bit more work is happening in the background. When restricting the run to a single core this did convert to a slightly slower startup time for these configurations. Regardless, that still seems to be a small cost to the overall startup time performance.

tjwatson · 2022-07-26T13:47:26Z

I should also state that in Open Liberty we use equinox.start.level.thread.count to enable parallel bundle activation by setting it to something > 1. This causes a flood of service registry activity from multiple threads when we are starting up. This in turn should exercise the locking mechanism here to a great extent.

vogella · 2022-08-02T08:16:32Z

@tjwatson with this fix, can you activate equinox.start.level.thread.count by default (setting it to > 1)?

laeubi · 2022-08-02T08:37:25Z

@vogella AFAIK this is non standard behavior and should best be activated on the EPP if desired.

tjwatson · 2022-08-02T12:43:58Z

@tjwatson with this fix, can you activate equinox.start.level.thread.count by default (setting it to > 1)?

Recall my comments from https://bugs.eclipse.org/bugs/show_bug.cgi?id=540507#c17

You should know that only that first number from "Starting application" can really be influenced by the parallel bundle activation from the start-level launch. By this time all the bundles have been activated by the framework. The rest of the activation that occurs in a typical Eclipse launch happens in response to a class load with lazy activation.

This is one reason I think you will find very little gain in parallel bundle activation at this level, because most of Eclipse bundles are lazy activated with a trigger class load. That activation then happens on the thread that is loading the class.

Furthermore, I really only think this parallel bundle activation from the framework launch helps in cases where many bundles are installed that are eagerly activated.

I see very little advantage to using this option by default for any Eclipse RCP usage as long as most all bundles are lazy activated.

@vogella AFAIK this is non standard behavior and should best be activated on the EPP if desired.

There is nothing non-standard here. The OSGi specification does not mandate that bundles within the same start-level be started by a single thread or in any particular order within the start-level. The implementation is free to start in parallel or not.

HannesWell · 2022-08-05T16:09:54Z

@tjwatson with this fix, can you activate equinox.start.level.thread.count by default (setting it to > 1)?

Can't you simply test with the latest I-build if a value greater one for that flag speeds up the Start-up?

tjwatson · 2022-08-05T16:38:26Z

Can't you simply test with the latest I-build if a value greater one for that flag speeds up the Start-up?

Yes, that is what I did when I first put this function into Equinox. I saw no improvement at all with it enabled. After some investigation it was because most everything is lazy-activated so all the real work only happens when classes are loaded. The majority of class loading does not happen until all the bundles have been started and control has been returned to the thread that runs the SWT display thread. Because DS is lazy also, even the registration of the DS components do not load any classes until the service is actually used (unless the component is immediate).

Let me clarify the term "started". That is really just Bundle.start. But for lazy activated bundles that does not mean the BundleActivator.start gets called immediately. Instead that will be delayed until the first class is defined using the class loader of the bundle. The changes to do parallel starts for start-level really just make sure Bundle.start is called. But for the vast majority of our bundles that results in very little work to be done until a class load is triggered.

I personally do not plan to retest if starting in parallel helps for Eclipse RCP/IDE bring up again because I don't think anything has changed that would show it helping.

laeubi requested a review from tjwatson June 30, 2022 14:54

bjhargrave requested changes Jun 30, 2022

View reviewed changes

castortech requested a review from bjhargrave June 30, 2022 16:46

HannesWell requested changes Jun 30, 2022

View reviewed changes

bjhargrave force-pushed the master branch from b53662a to 1d4f467 Compare June 30, 2022 22:46

bjhargrave changed the title ~~Address encountered deadlock issues with service registration.~~ Address encountered deadlock issues with service use. Jun 30, 2022

bjhargrave force-pushed the master branch from 7f75597 to c8d5d4a Compare July 1, 2022 14:50

tjwatson added a commit to tjwatson/open-liberty that referenced this pull request Jul 21, 2022

Run test against eclipse-equinox/equinox#68

da34a35

tjwatson mentioned this pull request Jul 21, 2022

Run test against https://github.com/eclipse-equinox/equinox/pull/68 DO NOT MERGE OpenLiberty/open-liberty#21825

Closed

tjwatson reviewed Jul 21, 2022

View reviewed changes

...les/org.eclipse.osgi/container/src/org/eclipse/osgi/internal/serviceregistry/ServiceUse.java Show resolved Hide resolved

HannesWell mentioned this pull request Jul 24, 2022

Execute tycho-ds-plugin for eclipse-test-plugins eclipse-tycho/tycho#1185

Merged

castortech and others added 14 commits July 24, 2022 18:54

Address comment from PR

1260b86

Signed-off-by: Alain Picard <picard@castortech.com> Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Avoid infinite loop when current thread not part of cycle

1bd6c6c

Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Must clear added lock no matter how we exit the while loop

4843d03

When throwing the deadlock exception we also need to remove the lock from the awaiting map. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Restore interrupt status after lock

8fd75f2

This is close to behavior of intrinsic locking which does not alter the thread interrupt status. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Simplify code structure and apply general clean ups.

999e93e

Remove unnecessary elements and dead-lock detection logic into dedicated isDeadLocked() method to make the code more readable. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Add LazyServiceComponentActivationDeadLockTest

11605f7

Reproducer test for: https://github.com/eclipse-equinox/equinox.framework/issues/55 Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

Add simple ServiceFactory based dead-lock test and generalize DS case

d0134bf

Minor comment/doc/variable adjustments

e0d3581

Fix test-case build and add contribution entry and update copyright year

ff5dd84

HannesWell force-pushed the master branch from ce17f97 to ff5dd84 Compare July 24, 2022 16:55

HannesWell mentioned this pull request Jul 24, 2022

Recover from Service-Factory deadlock #87

Closed

tjwatson approved these changes Jul 26, 2022

View reviewed changes

tjwatson merged commit 4b397a2 into eclipse-equinox:master Jul 26, 2022

	private Duration lockTimeout = Duration.ofSeconds(10);
	private static long lockTimeout = Duration.ofSeconds(10).toMillis();

Address encountered deadlock issues with service use. #68

Address encountered deadlock issues with service use. #68

Conversation

castortech commented Jun 30, 2022

HannesWell commented Jun 30, 2022

castortech commented Jun 30, 2022

bjhargrave left a comment

Choose a reason for hiding this comment

HannesWell left a comment

Choose a reason for hiding this comment

HannesWell Jun 30, 2022

Choose a reason for hiding this comment

HannesWell Jun 30, 2022

Choose a reason for hiding this comment

HannesWell Jun 30, 2022

Choose a reason for hiding this comment

HannesWell commented Jun 30, 2022

bjhargrave commented Jun 30, 2022

bjhargrave commented Jun 30, 2022

bjhargrave commented Jun 30, 2022

laeubi commented Jul 1, 2022

github-actions bot commented Jul 1, 2022 • edited Loading

Unit Test Results

bjhargrave commented Jul 1, 2022

tjwatson commented Jul 3, 2022

tjwatson commented Jul 3, 2022

HannesWell commented Jul 3, 2022

laeubi commented Jul 4, 2022

castortech commented Jul 4, 2022

bjhargrave commented Jul 4, 2022 • edited Loading

HannesWell commented Jul 4, 2022

bjhargrave commented Jul 4, 2022

laeubi commented Jul 4, 2022

bjhargrave commented Jul 4, 2022

laeubi commented Jul 4, 2022

HannesWell commented Jul 4, 2022 • edited Loading

laeubi commented Jul 4, 2022

tjwatson commented Jul 5, 2022

HannesWell commented Jul 21, 2022

tjwatson commented Jul 25, 2022

tjwatson left a comment

Choose a reason for hiding this comment

tjwatson commented Jul 26, 2022

vogella commented Aug 2, 2022

laeubi commented Aug 2, 2022

tjwatson commented Aug 2, 2022

HannesWell commented Aug 5, 2022

tjwatson commented Aug 5, 2022

github-actions bot commented Jul 1, 2022 •

edited

Loading

bjhargrave commented Jul 4, 2022 •

edited

Loading

HannesWell commented Jul 4, 2022 •

edited

Loading