Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address encountered deadlock issues with service use. #68

Merged
merged 14 commits into from
Jul 26, 2022

Conversation

castortech
Copy link
Contributor

This code was added at our end over 10 months ago, and has had 1 minor fix 5 months ago and been used in production, as we wanted to make sure that no issue were seen over time. Now that we have a new release to sync with, it's time to create a formal PR.

Using a reentrant lock (extended default only for special toString) on ServiceUse instead of simple synchronized blocks.

Also added quite a bit of logging to help tracing.

@laeubi laeubi requested a review from tjwatson June 30, 2022 14:54
@HannesWell
Copy link
Member

Did you create this PR in order to address #15 respectively #31?
Because it sounds very related.

@castortech
Copy link
Contributor Author

Did you create this PR in order to address #15 respectively #31? Because it sounds very related.

I didn't, as I said for us this was about 10 months ago, but it does look very very similar. Call that a coincidence.

Copy link
Member

@bjhargrave bjhargrave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some requested changes.

@castortech castortech requested a review from bjhargrave June 30, 2022 16:46
Copy link
Member

@HannesWell HannesWell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you create this PR in order to address #15 respectively #31? Because it sounds very related.

I didn't, as I said for us this was about 10 months ago, but it does look very very similar. Call that a coincidence.

Looks like a coincidence, yes. :)
Great, I just thought that we should soon work to fix that. Didn't expect to have a PR that fast. So thanks already.

Can you fetch #31 and report if this fixes it?

You solve the deadlock by throwing a ServiceException if the lock cannot be obtained after a fixed timeout has elapsed. IIRC this was also an approach discussed in #15.
So does the start-up then simply fail or will the SCR-actor perform another attempt to active a component later?
@tjwatson just in case it is not yet implemented: Do you think Felix-SCR could handle a Service-Exception caused by a dead-lock specifically by just re-attempting to activate the root component?

Besides that I think it would be even better if we had a 'real' dead-lock detection, meaning that the chain of uses in this thread is tracked and if one lock cannot be obtained it is checked if there is another thread with a 'crossing' chain in the opposite direction.
I think this could be achieved by maintaining a set of locked uses per thread that is updated when the lock is acquired and released. When a lock cannot be acquired it can then be checked if the thread holding the desired lock is waiting for a lock held by this thread. The latter would also require a Mapping lock->owner, but since you already subclass ReentrantLock the protected method can be made visible.

Having a 'real' dead-lock detection would be more reliable (maybe some components take just very long to activate) and often faster (for actually quickly activating components waiting for 10s is just waste of time).

Comment on lines 64 to 67
if (debug.DEBUG_SERVICES) {
Debug.println("getServiceObject[factory=" + registration.getBundle() + "](" + context.getBundleImpl() + "," + registration + ")"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$
Debug.println('[' + Thread.currentThread().getName() + "] getServiceObject[PSfactory=" //$NON-NLS-1$
+ registration.getBundle() + "](" + context.getBundleImpl() + "," + registration + ")"); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log messages seem to be very similar.
I think it would make the code more readable to exact them into a dedicated message wo which the required objects and the specific message part is passed.

/** lock for this service */
private final ReentrantLock lock = new ReentrantLockExt();

private Duration lockTimeout = Duration.ofSeconds(10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private Duration lockTimeout = Duration.ofSeconds(10);
private static long lockTimeout = Duration.ofSeconds(10).toMillis();

I agree that it makes the code more clear when using Duration, but this requires repetitive conversions to Milli-seconds that is not necessary. Therefore I suggest to convert it immediately (and make it static).
This way it should be readable and performant.

Comment on lines 68 to 102
void runWithLock(Runnable runnable) {
callWithLock(Executors.callable(runnable));
}

<T> T callWithLock(Callable<T> callable) {
try {
if (lock.tryLock(getLockTimeout().toMillis(), TimeUnit.MILLISECONDS)) {
try {
return callable.call();
} catch (InterruptedException e) {
Thread.currentThread().interrupt(); // To appease sonar
return null;
} catch (Exception e) {
throw new RuntimeException("Callable threw exception", e); //$NON-NLS-1$
} finally {
lock.unlock();
}
}
throw new ServiceException(
"Failed to acquire lock within try period, Lock, id:" + System.identityHashCode(lock) + ", " //$NON-NLS-1$ //$NON-NLS-2$
+ lock.toString(),
ServiceException.DEADLOCK);
} catch (InterruptedException e) {
Thread.currentThread().interrupt(); // To appease sonar
return null;
}
}

Duration getLockTimeout() {
return lockTimeout;
}

ReentrantLock getLock() {
return lock;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void runWithLock(Runnable runnable) {
callWithLock(Executors.callable(runnable));
}
<T> T callWithLock(Callable<T> callable) {
try {
if (lock.tryLock(getLockTimeout().toMillis(), TimeUnit.MILLISECONDS)) {
try {
return callable.call();
} catch (InterruptedException e) {
Thread.currentThread().interrupt(); // To appease sonar
return null;
} catch (Exception e) {
throw new RuntimeException("Callable threw exception", e); //$NON-NLS-1$
} finally {
lock.unlock();
}
}
throw new ServiceException(
"Failed to acquire lock within try period, Lock, id:" + System.identityHashCode(lock) + ", " //$NON-NLS-1$ //$NON-NLS-2$
+ lock.toString(),
ServiceException.DEADLOCK);
} catch (InterruptedException e) {
Thread.currentThread().interrupt(); // To appease sonar
return null;
}
}
Duration getLockTimeout() {
return lockTimeout;
}
ReentrantLock getLock() {
return lock;
}
interface Locked extends AutoCloseable {
@Override
void close();
}
Locked lock() throws InterruptedException {
if (lock.tryLock(lockTimeout, TimeUnit.MILLISECONDS)) {
if (debug.DEBUG_SERVICES) {
Debug.println('[' + Thread.currentThread().getName() + "] getServiceLock[" //$NON-NLS-1$
+ context.getBundle() + "](" + this + "), id:" + System.identityHashCode(lock) //$NON-NLS-1$ //$NON-NLS-2$
+ ". holdCount:" + lock.getHoldCount() //$NON-NLS-1$
+ ", queued:" + lock.getQueueLength()); //$NON-NLS-1$
}
return lock::unlock;
}
throw new ServiceException(
"Failed to acquire lock within try period, Lock, id:" + System.identityHashCode(lock) + ", " //$NON-NLS-1$ //$NON-NLS-2$
+ lock.toString(),
ServiceException.DEADLOCK);
}

What do you and the others think about using the lock() method proposed above. It would allow to use the 'locked-state' as resource of a try-block in a safe manner, at the three-locations where use.getLock() is now used:

try (Locked locked = use.lock()) {
 // perform the protected action
} catch (InterruptedException e) {
  Thread.currentThread().interrupt(); // To appease sonar
  // if necessary return a reasonable default
}

More or less try (Locked locked = use.lock()) { would simply replace synchronized (use) {.
This would make the runWithLock() methods obsolete since other exceptions can simply be propagated outside the try-block and would also simplify the lock's use in ServiceRegistrationImpl.getService(). Furthermore code duplications would be avoided (e.g. the 'DeadLock'-ServiceException).

@HannesWell
Copy link
Member

It is not tested and maybe I have missed something, but I think the following code in ServiceUse could be a real deadlock detection:

	private final ReentrantLockExt lock = new ReentrantLockExt(); // needs an additional override of getOwner() to make it visible

	interface Locked extends AutoCloseable {
		@Override
		void close();
	}

	private static final ConcurrentMap<Thread, ReentrantLock> AWAITING_LOCKS = new ConcurrentHashMap<>();

	private final ThreadLocal<Set<ReentrantLock>> ownedLocks = ThreadLocal.withInitial(HashSet::new);

	Locked lock() throws InterruptedException {
		boolean setAwaiting = false;
		do  {
			if (lock.tryLock(10, TimeUnit.MILLISECONDS)) { // use a short time-out to not spin-wait for the lock to become available
				ownedLocks.get().add(lock);
				if (setAwaiting) {
					AWAITING_LOCKS.remove(Thread.currentThread());
				}
				if (debug.DEBUG_SERVICES) {
					Debug.println('[' + Thread.currentThread().getName() + "] getServiceLock[" //$NON-NLS-1$
							+ context.getBundle() + "](" + this + "), id:" + System.identityHashCode(lock) //$NON-NLS-1$ //$NON-NLS-2$
							+ ". holdCount:" + lock.getHoldCount() //$NON-NLS-1$
							+ ", queued:" + lock.getQueueLength()); //$NON-NLS-1$
				}
				return () -> {
					ownedLocks.get().remove(lock);
					lock.unlock();
				};
			}
			AWAITING_LOCKS.put(Thread.currentThread(), lock);
			setAwaiting = true;
			Thread owner = lock.getOwner();
			if (owner != null) { // lock could be released in the meantime
				ReentrantLock crossingLock = AWAITING_LOCKS.get(owner);
				if (ownedLocks.get().contains(crossingLock)) {
					throw new ServiceException(
							"Failed to acquire lock within try period, Lock, id:" + System.identityHashCode(lock) + ", " //$NON-NLS-1$ //$NON-NLS-2$
									+ lock.toString(),
							ServiceException.DEADLOCK);
				}
			}
			// Not a dead-lock. Lock was regularly hold by another thread. Just try again
		} while (true);
	}

@bjhargrave
Copy link
Member

I have forced pushed to this PR with a number of updates and improvements.

@bjhargrave
Copy link
Member

FYI, I authored all this ServiceUse code, so I am very familiar with it.

@bjhargrave
Copy link
Member

I don't understand enough to know what the build error means and if it related to changes in this PR.

Error: Failed to execute goal org.eclipse.tycho.extras:tycho-p2-extras-plugin:3.0.0-SNAPSHOT:compare-version-with-baselines (compare-attached-artifacts-with-release) on project org.eclipse.equinox.supplement: Only qualifier changed for (org.eclipse.equinox.supplement/1.10.500.v20220630-2245). Expected to have bigger x.y.z than what is available in baseline (1.10.500.v20220516-2155) -> [Help 1]

@bjhargrave bjhargrave changed the title Address encountered deadlock issues with service registration. Address encountered deadlock issues with service use. Jun 30, 2022
@laeubi
Copy link
Member

laeubi commented Jul 1, 2022

@bjhargrave you need to increment the version of org.eclipse.equinox.supplement in the maninfest (and probably pom if there is any) from 1.10.500 -> 1.10.600

@github-actions
Copy link

github-actions bot commented Jul 1, 2022

Unit Test Results

     24 files       24 suites   11m 59s ⏱️
2 134 tests 2 089 ✔️ 44 💤 1
2 178 runs  2 133 ✔️ 44 💤 1

For more details on these failures, see this check.

Results for commit ce17f97.

♻️ This comment has been updated with latest results.

@bjhargrave
Copy link
Member

java.lang.AssertionError: The StaticCircuit2 component shouldn't have bound objects expected:<0> but was:<1>
at org.eclipse.equinox.ds.tests.tbc.DSTest.testCircularityHandling(DSTest.java:1426)

This test failure may be the result of the PR changing to break the deadlock using the timeout on the ReentrantLock.

@tjwatson
Copy link
Contributor

tjwatson commented Jul 3, 2022

This test failure may be the result of the PR changing to break the deadlock using the timeout on the ReentrantLock.

This test seems to fail only in the github actions, no one has been able to determine why. The test run on eclipse CI looks more suspicious https://ci.eclipse.org/equinox/job/equinox/job/PR-68/7/

But I don't think it is related. It has this deadlock caused by some issue with closing the framework storage:

    java.base@17.0.1/java.io.FileDescriptor.close0(Native Method)
    java.base@17.0.1/java.io.FileDescriptor.close(FileDescriptor.java:297)
    java.base@17.0.1/java.io.FileOutputStream$1.close(FileOutputStream.java:392)
    java.base@17.0.1/java.io.FileDescriptor.closeAll(FileDescriptor.java:355)
    java.base@17.0.1/java.io.FileOutputStream.close(FileOutputStream.java:390)
    org.eclipse.osgi.storagemanager.StorageManager.closeOutputStream(StorageManager.java:912)
    org.eclipse.osgi.storagemanager.ManagedOutputStream.close(ManagedOutputStream.java:56)

This has been a known issue from the past, but I've not seen it happen on the eclipse CI for a long time.

@tjwatson
Copy link
Contributor

tjwatson commented Jul 3, 2022

I worry about the performance penalty of this change for such a high use code path. Particularly the move away from simple synchronization and the use of many new lambdas.

We will need to do some performance analysis on this change @jhanders34

@HannesWell
Copy link
Member

I worry about the performance penalty of this change for such a high use code path. Particularly the move away from simple synchronization and the use of many new lambdas.

If we use the AutoClosable-lock approach suggested above (#68 (comment)) there is only one location with a lambda, which can be easily replaced by a corresponding anonymous class.

@laeubi
Copy link
Member

laeubi commented Jul 4, 2022

As mentioned in the other change, I think it would be better trying to move to a more lock free approach than using time-outs.
I even wonder if according to the OSGi spec a caller has to expect such "unusual" exceptions? I think none of the documented options in org.osgi.framework.BundleContext.getService(ServiceReference<S>) allows for such, even returning null might be arguable as the server obviously is registered.

@castortech
Copy link
Contributor Author

Regarding the timeout, I can only say that when we captured debugging and analyzed the results, we could never see any delay during normal use and even cases that would produce deadlock would have very low delay. The original synchronized would obviously have delays of its own. IMHO the biggest question is the choice of the 10 secs timeout, which was arbitrary and might be better at a much lower value.

@bjhargrave
Copy link
Member

bjhargrave commented Jul 4, 2022

Particularly the move away from simple synchronization

I am not sure this is wise either. But calling ServiceFactories has always been an issue. We must prevent multiple concurrent calls to a factory (so I am not sure how a lock free design can handle that), so we hold a mutex lock. Using intrinsic locks is simple but provides no way to back off when a timeout elapses which can be evidence of a deadlock. Using a ReentrantLock allows a timeout so we can back off and throw an exception indication the ServiceFactory could not be called.

the use of many new lambdas.

There are now 3 lambda expressions. They can be replaced with anonymous classes if necessary.

I even wonder if according to the OSGi spec a caller has to expect such "unusual" exceptions? I think none of the documented options in org.osgi.framework.BundleContext.getService(ServiceReference<S>) allows for such, even returning null might be arguable as the server obviously is registered.

ServiceException can always be thrown. I think returning null is wrong for a timeout as it wallpapers over a real issue. Throwing an exception allows the caller to know something is wrong and tell someone (e.g. log it).

the biggest question is the choice of the 10 secs timeout, which was arbitrary and might be better at a much lower value.

The challenge is to distinguish between a real deadlock rather than a very slow ServiceFactory implementation which is running on another thread. For the latter, we want to wait for it. For the former, the deadlock will never resolve (or it wouldn't be a deadlock), so waiting a long time is of no great harm. I would be more concerned that 10 seconds is not long enough to be convinced that it is not just a very slow ServiceFactory implementation.

@HannesWell
Copy link
Member

the biggest question is the choice of the 10 secs timeout, which was arbitrary and might be better at a much lower value.

The challenge is to distinguish between a real deadlock rather than a very slow ServiceFactory implementation which is running on another thread. For the latter, we want to wait for it. For the former, the deadlock will never resolve (or it wouldn't be a deadlock), so waiting a long time is of no great harm. I would be more concerned that 10 seconds is not long enough to be convinced that it is not just a very slow ServiceFactory implementation.

As mentioned before I share this concern, therefore I suggested to use a 'real' deadlock detection, for which I described a prototype above: #68 (comment)
I have neither tested nor used it, so it might be flawed, but from my theoretical thinking it should work. But let me know if you see a problem.

@bjhargrave
Copy link
Member

But let me know if you see a problem.

The main issue is that it is a big hammer to hit an very rare problem. It uses N+1 maps (N being the number of threads which call this code) and the maps are pinned to the thread via ThreadLocals. With changes coming to Java for virtual threads, we should steer clear of adding more ThreadLocals.

@laeubi
Copy link
Member

laeubi commented Jul 4, 2022

so I am not sure how a lock free design can handle that

@HannesWell has made some experiments, and the deadlock could be prevented by putting a break point at the right place, so it is not really an unavoidable dead-lock but somewhere in the locking-chain there is an incomplete locking of required objects.

From the last investigations, I found that the sync is just done for too many things (e.g. service use increments) and also for factories and for service, while e.g. for a ServiceFactory one should actually lock on the BundleContext as far as I can see and for a plain service there need no sync at all ...

@bjhargrave I wonder if you as the author of this code could tell how well it is covered by test, e.g. if one wants to adjust the locking model used here, do we need manual tests or can we rely on the automatic ones so if build is okay we are quite certain nothing breaks here?

@bjhargrave
Copy link
Member

I wonder if you as the author of this code could tell how well it is covered by test

I don't know how extensive the build tests are.

The code has been used for many years in many production environments. So deep care must be taken with any changes here.

while e.g. for a ServiceFactory one should actually lock on the BundleContext as far as I can see

I assume you mean the user's bundle context. That is too coarse. That means a bundle can only get one service at a time (SCR might be upset). ServiceUse is the combination of the user of the service and the service which is why it is the lock object. It provides the finest grain for locking and, in general, should not be contended.

@laeubi
Copy link
Member

laeubi commented Jul 4, 2022

I assume you mean the user's bundle context. That is too coarse. That means a bundle can only get one service at a time (SCR might be upset)

Better slow/upset than dead-locking ;-)
The problem arise if two requestor of a service crossing dependency chains, then it is not a problem of contention... see #15 for more details.

@HannesWell
Copy link
Member

HannesWell commented Jul 4, 2022

But let me know if you see a problem.

The main issue is that it is a big hammer to hit an very rare problem. It uses N+1 maps (N being the number of threads which call this code) and the maps are pinned to the thread via ThreadLocals. With changes coming to Java for virtual threads, we should steer clear of adding more ThreadLocals.

It is indeed a big hammer. Do you know if there is a better approach? I mean deadlock detection is actually a quite generic programming problem (not even Java specific), so I assume that some smart people already got better ideas than I did when thinking ~15min about the problem.

IIRC Eclipse Jobs can detect deadlocks? But probably only between jobs-rules? Or is this something we can inherit.

so I am not sure how a lock free design can handle that

@HannesWell has made some experiments, and the deadlock could be prevented by putting a break point at the right place, so it is not really an unavoidable dead-lock but somewhere in the locking-chain there is an incomplete locking of required objects.

I want to clarify that I had to put the break-point and with that hold back thread B back before it obtained the lock that is later wanted by thread A. So basically I manually scheduled the threads so that one can complete its work before the other ones starts it.
If we want to implemented that it will probably be more complicated and has a worse performance (is my guess, but I might be wrong) than just detecting it and aborting with an exception.

EDIT: But of course it would be even better to resolve the dead-lock than just aborting.

@laeubi
Copy link
Member

laeubi commented Jul 4, 2022

If we want to implemented that it will probably be more complicated and has a worse performance (is my guess, but I might be wrong) than just detecting it and aborting with an exception.

I don't think that this is a performance problem

  1. we don't see heavy multi-treading here, actually two threads, bigger setups might have a few more but realistically acquiring services with hundreds of threads concurrently is not really a usual use-case
  2. Often one service acquires the other and activates and we have more a linear chain
  3. the performance penalty of starting all over again (if the code actually is able to recover!) is much higher, e.g. how/when should a failed lockup retried? Maybe with a backoff-algorithm? Compared to waiting a bit more for something I need to wait anyways (given we not have dead-locked) is nothing much to gain

@tjwatson
Copy link
Contributor

tjwatson commented Jul 5, 2022

If we want to implemented that it will probably be more complicated and has a worse performance (is my guess, but I might be wrong) than just detecting it and aborting with an exception.

I don't think that this is a performance problem

That cannot be known until we run performance measurements. We do this regularly in OpenLiberty when we upgrade the framework.

  1. we don't see heavy multi-treading here, actually two threads, bigger setups might have a few more but realistically acquiring services with hundreds of threads concurrently is not really a usual use-case

You cannot make that assumption. There are uses of Equinox that enable parallel bundle activation during framework startup where bundles within the same start-level are activated in parallel.

  1. Often one service acquires the other and activates and we have more a linear chain

My comment about performance was more about the sheer number of additional methods being called here vs a simple intrinsic lock and the use of a lambda which has proven to have difficulties to optimize with JIT (at least on some versions of Java) when used in high traffic code paths. All of that on a code path that is highly used in systems that have 1000s of service components.

  1. the performance penalty of starting all over again (if the code actually is able to recover!) is much higher, e.g. how/when should a failed lockup retried? Maybe with a backoff-algorithm? Compared to waiting a bit more for something I need to wait anyways (given we not have dead-locked) is nothing much to gain

I'm not saying we have a better option. Only that we need to understand the cost of this for something that is arguably a bug in the design of service components involved in the deadlock and perhaps should be addressed by SCR itself that allows its own service factories to trigger the deadlock in the first place. On the other hand I see the appeal to having the deadlock break out from the framework level. But the result isn't going to fix the components involved in the deadlock. They will still be in a broken state, correct?

@HannesWell
Copy link
Member

The tests are now complete and can be reviewed as well.

castortech and others added 14 commits July 24, 2022 18:54
Using a reentrant lock on Service use instead of simple synchronized blocks.

Also added quite a bit of logging to help tracing.

Signed-off-by: Alain Picard <picard@castortech.com>
Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
Signed-off-by: Alain Picard <picard@castortech.com>
Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
Numerous comments and asserts needed updating. All uses under lock
are run through withLock which has cleaned up the exception handling
to properly manage interrupt state. NLS is used for exception messages.
Lock subclass toString is cleaned up to avoid string concatenation when
we are already using a StringBuilder.

Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
The ServiceUse.lock() method returns an AutoCloseable so the
caller can use try-with-resources to obtain and then close the lock.

This is much closer to intrinsic locking in coding practice.

Instead of using a long-ish timeout for deadlock detection, which
could be wrong on slow systems, we now use a map to hold the thread
to lock parings and examine the lock chain to see if the current
thread is in the chain.

Signed-off-by: Hannes Wellmann <wellmann.hannes1@gmx.net>
Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
When throwing the deadlock exception we also need to remove the lock
from the awaiting map.

Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
This is close to behavior of intrinsic locking which does not
alter the thread interrupt status.

Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
Remove unnecessary elements and dead-lock detection logic into dedicated
isDeadLocked() method to make the code more readable.
Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
Prevent one interrupted check on each call to lock().

From the start of lock() to useLock.tryLock(... everything should be as
performant as possible. Afterwards we are usually waiting so it doesn't
really matter since we cannot wait faster.
Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
We move the concurrent map to be a instance field of
ServiceRegistry instead of a static. We want to avoid static mutable
state to support multiple framework instances in the same JVM.

We also include the DEADLOCK type in the exception.

Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
@tjwatson
Copy link
Contributor

I ran the proposed changes through an Open Liberty build and it passed all but one of our 86246 tests (the one failure appears to be an infrastructure issue). This helps confirm the fix does not have any functional issues, although we should get a run against the OSGi core TCK also.

I still need to get startup performance measurements done on a configuration that enables a very large set of service components during startup. I hope to have something in the next day or so.

Copy link
Contributor

@tjwatson tjwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance run across a wide range of configurations in Open Liberty showed no measurably significant slow down. In some cases we saw a small performance improvement.

I suspect the improvements observed are due to other improvements that we have yet to consume in Open Liberty though. For example eclipse-equinox/equinox.framework#54 and eclipse-equinox/equinox.framework#46. It is still possible that this change has some measurable cost but it could be getting offset by the other improvements. Regardless, it seems that the cost is not significant enough to be concerned about at this point.

Another observation is that the measurable CPU time did see a slight increase in some configurations, but that increased CPU utilization did not seem to impact the overall startup time of that configuration when 4 cores are assigned to the test run, implying that a bit more work is happening in the background. When restricting the run to a single core this did convert to a slightly slower startup time for these configurations. Regardless, that still seems to be a small cost to the overall startup time performance.

@tjwatson
Copy link
Contributor

I should also state that in Open Liberty we use equinox.start.level.thread.count to enable parallel bundle activation by setting it to something > 1. This causes a flood of service registry activity from multiple threads when we are starting up. This in turn should exercise the locking mechanism here to a great extent.

@tjwatson tjwatson merged commit 4b397a2 into eclipse-equinox:master Jul 26, 2022
@vogella
Copy link
Contributor

vogella commented Aug 2, 2022

@tjwatson with this fix, can you activate equinox.start.level.thread.count by default (setting it to > 1)?

@laeubi
Copy link
Member

laeubi commented Aug 2, 2022

@vogella AFAIK this is non standard behavior and should best be activated on the EPP if desired.

@tjwatson
Copy link
Contributor

tjwatson commented Aug 2, 2022

@tjwatson with this fix, can you activate equinox.start.level.thread.count by default (setting it to > 1)?

Recall my comments from https://bugs.eclipse.org/bugs/show_bug.cgi?id=540507#c17

You should know that only that first number from "Starting application" can really be influenced by the parallel bundle activation from the start-level launch. By this time all the bundles have been activated by the framework. The rest of the activation that occurs in a typical Eclipse launch happens in response to a class load with lazy activation.

This is one reason I think you will find very little gain in parallel bundle activation at this level, because most of Eclipse bundles are lazy activated with a trigger class load. That activation then happens on the thread that is loading the class.

Furthermore, I really only think this parallel bundle activation from the framework launch helps in cases where many bundles are installed that are eagerly activated.

I see very little advantage to using this option by default for any Eclipse RCP usage as long as most all bundles are lazy activated.

@vogella AFAIK this is non standard behavior and should best be activated on the EPP if desired.

There is nothing non-standard here. The OSGi specification does not mandate that bundles within the same start-level be started by a single thread or in any particular order within the start-level. The implementation is free to start in parallel or not.

@HannesWell
Copy link
Member

@tjwatson with this fix, can you activate equinox.start.level.thread.count by default (setting it to > 1)?

Can't you simply test with the latest I-build if a value greater one for that flag speeds up the Start-up?

@tjwatson
Copy link
Contributor

tjwatson commented Aug 5, 2022

Can't you simply test with the latest I-build if a value greater one for that flag speeds up the Start-up?

Yes, that is what I did when I first put this function into Equinox. I saw no improvement at all with it enabled. After some investigation it was because most everything is lazy-activated so all the real work only happens when classes are loaded. The majority of class loading does not happen until all the bundles have been started and control has been returned to the thread that runs the SWT display thread. Because DS is lazy also, even the registration of the DS components do not load any classes until the service is actually used (unless the component is immediate).

Let me clarify the term "started". That is really just Bundle.start. But for lazy activated bundles that does not mean the BundleActivator.start gets called immediately. Instead that will be delayed until the first class is defined using the class loader of the bundle. The changes to do parallel starts for start-level really just make sure Bundle.start is called. But for the vast majority of our bundles that results in very little work to be done until a class load is triggered.

I personally do not plan to retest if starting in parallel helps for Eclipse RCP/IDE bring up again because I don't think anything has changed that would show it helping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock between "SCR Component Actor" and "main" thread
7 participants