-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock between "SCR Component Actor" and "main" thread #15
Comments
Does clearing out our |
Although I'm a maintainer of Felix SCR, I must admit that I am not that familiar with how this "SCR Component Actor" thread is used. From looking at the code it seems that it does async work for calls to As to the locks on |
Yes! After adding Thank you Tom for the work-around. |
You can also provoke a lock if you do the following:
Even though the Spec warns about that case, I would have expected that in this case the service tracker simply remains empty unless the activator has started. Maybe this would be a simpler case to debug such blocking?
Do you remember where you blocked the SCR? It might then be that the locking has some dead-lock (e.g. locking in the wrong order or something), I took a quick look at the code in So if you can describe what are the code path to maybe provoke the lock (e.g. if Thread A is on line X and then Thread B calls Y), that would help I think. |
Looking at the stack trace I don't see anything 'crazy' at least the only non framework |
I share the assumption that there are some DS-components that reference each other and are created in opposite order by the SCR-thread and the main thread..
That bug sounds similar to the one I encountered. I'm also absolutely not familiar about the Felix-SCR code. But I would only add a time-out as last resort, because as you said: It is hard to find a good default that fits for everybody. And a user/downstream-consumer should not have to tweak settings to hopefully be able to successfully start the application. And if this dead-lock happens often this can also delay the application start for so long that the application at least appears to be hung up. If I understand it correctly Equinox holds locks for each service-object under construction? |
Thanks for that example.
I put a break-point here: I suspect that there is one
Stupidly I immediately cleaned the Eclipse installation and didn't copy it before to reproduce the issue later, so unfortunately I cannot reproduce the issue. But maybe it is possible to construct one from the description of cyclic components above. |
I must confess I'm a bit curious if it is really a circular dependency. Why should it be not circular once you clear some state? Beside that Felix reports circular dependencies as far as I know, and won't start them so now you should see some components being unsatisfied... |
My guess is that the start-order triggers the issue. When doing updates the updated bundles get larger bundle IDs so they end up getting started later (p2 does an uninstall-old, install-new bundle instead of update which would reuse the old bundle id). When you do -clean we start over and the bundles are installed by p2 ordered by the symbolic name (I think). |
I also noticed some issues with this behavior e.g. when a bundle contains native code ... is there any reason for this behavior?
But should the start-order not be irrelevant in case of service handling? So this is actually a framework issue and not a "user" issue? |
I imagine because of the location. From the
I cannot tell what you mean by the first sentence ... but: The type of cycle that would cause issues is something like this: A -> B -> (optional, dynamic, reference) -> C Assume each service component A, B, C come from a separate bundle and the start-order is A, B, C. Once B is started it will allow A to be satisfied. B can be satisfied without C's bundle being active yet so now A and B can be constructed and activated by SCR. Once C's bundle is activated it can now also be satisfied by A and be dynamically injected into B. On the other hand if C's bundle is started first then it comes into play as soon as possible which could trigger some crossing of paths while constructing the service instances by SCR. This is not a framework issue. I would think this is more of an SCR issue. |
I mean that, regardless of order, the framework should never "deadlock", e.g. even if there is a cycle (how?) between services?
Is it forbidden to call |
It is not forbidden. Only thing I can think of to "make this better" is to detect the deadlock somehow and throw a ServiceException immediately. But I don't see an obvious way to do that reliably. |
There are too many locks involved :-\ I really wonder why we need such a complex |
Not sure, but that is unrelated to the deadlock. The lock is required by the specification to ensure more than one thread does not enter ServiceFactory.getService to construct the service instance. |
At least both treads compete for locking the ServiceUse :-)
right but if @HannesWell could make it work by holding one thread with the debugger to enter the critical section, then it seems there is something wrong here in how this locking works (e.g. there is a solution and the lock is not inevitable), but I don't have a good idea yet how to circumvent this without a reproducing test-case or description why the deadlock occurs, I have poked a bit around in the Felix code but it seems there are no more locks involved and I haven't found any waits yet that explain this behavior. |
Any chance you know the version of eclipse you upgraded from? As the ResourcePlugin is involved, I think it might be an update of some of the "core-ide" bundles and fear other users might run into the same issue as well on the next release, so if we can find a solution here (even if it will not make it into this release) I think it would be very helpful.
Just wondering, is there a TCK that check this particular requirement so we can change here without the fear of breaking things? |
It is the
But I use a very similar setup on my private computer at home too and there the update worked flawlessly. |
That's what I suspect as well.
Yes that would be definitely important. I think the part of the
I think that was the part you are referring to when mentioning the Resources-Plugin? |
don't think so :-)
No I was referring to @tjwatson comment that due to an update some plugin might be started "later" and as the Resources-Plugin normally starts quite early, it might be because anything changed that the Resources-Plugin was updated by P2 and now starts a bit later... but that's just a guess.
I was just hoping one could reproduce this e.g. by upgrading a 2022-03 installation to 2022-09 ... but then we have to wait if we get this again somewhere and can capture some more context, given that a |
I was "lucky" and encountered the same problem today after re-running the Oomph setup of my I remote-debugged it and found the following service chains (in the same order as in the stack-trace): Thread
Thread
Derived from the implemented services the following DS-component chains cause the dead-lock.
So it is probably a combination of recent changes in the resources-plugin and m2e that revealed this issue. Lucky us m2e 2.0 did not made it into 2022-06 SimRel. So users might not be affected right away, unless it appears somewhere else too. I will try to reproduce the problem in my m2e workspace with updating the target-platform. |
The only two calls to
The runable executed for the second appeal is then finally stuck in the deadlock. What's interesting is that if I update the target-platform in my m2e-workspace at the current master to use the latest 4.24-I build (i.e. Do you know where I can find out such early/auto-start configuration? |
OK it seems to be the But even then a break-point at ServiceRegistrationImpl.getService(BundleContextImpl, ServiceConsumer):538 with condition
only stops in the main-thread. However I think with that knowledge it should be possible to construct a test-case (actually just what Tom suggested) where the Plug-ins are started explicitly in an order that causes the dead-lock. Maybe it can be even achieved reliably by levering some |
Many thanks @HannesWell for the analysis.
Go to the run config of your product, then to "plugins", ther you can adjust the start-levels and if a plugin is auto-started or not.
You probably should save the workspace and the eclipse install because I don't think the interesting things are stored in the workspace but in the configuration area of the product. Thus to reproduce it with another Eclipse-Launch you might need to point the config area to a copy of that as well (make sure not to clear it by accident!)
For me it would be interesting if you can again fix this by putting a brackpoint somewhere and if this causes any issues (e.g. warnings, failing components), if not could you try to add a dump |
I don't think that any of those are much interesting as these are "static" consumed services, you should maybe get what you expect by looking at the listener (e.g.
but can recover later from this when one of the components is activated, registered. So if you do not see this message, I thing it is just a "simple" dynamic reference here. |
Do you think you can list what service uses these currently hold? |
Sorry, I meant for an Eclipse installation. For launches that was clear to me.
Done that.
A conditional break-point at Since I'm remote debugging the installation I cannot simply change that, but I will try to replicate that with ordinary java-launch configuration that copies the command parameters passed by the launcher. Then that should be possible to wrap that section of the code with a single global lock. |
I added the
|
Do you mean the users of the services that are blocking each other? |
As always it depends ;-)
You can dynamically change the start level but don't know if P2 do this.
The lock-objects in your initial trace says:
So the interesting question is what are those ServiceFactoryUse objects and why we get in the undesirable state that one thread locks it while another require it (and how we can prevent this).
It is a bit cumbersome to setup, and I have not tested it but actually the following should work:
This then should allow to launch that product from within eclipse as if you launched it from commandline and debug it, if that works, you should be able to import individual plugins (e.g. m2e) of interest into the workspace to change/debug code with them, just make sure it is the version installed in the eclipse you are using so you don't get errors because of changed API. |
After detailed debugging/manual tracing to the start up of my dead-locking Eclipse I was able to create a protocol of the events that lead to that situation. With that I was able to create a test case that reproduces that situation reliably: TLDR;
Those The reference locking cycle is the following:
The problem is that the optional reference from The protocol of events that lead to this situation:
|
This is required for eclipse-equinox#15
This is required for eclipse-equinox#15
Using a reentrant lock on Service use instead of simple synchronized blocks. Also added quite a bit of logging to help tracing. Fixes eclipse-equinox#15 Also-By: BJ Hargrave <hargrave@us.ibm.com> Also-By: Hannes Wellmann <wellmann.hannes1@gmx.net>
This is required for eclipse-equinox#15 Also-By: BJ Hargrave <hargrave@us.ibm.com> Also-By: Alain Picard <apicard@benchmarkconsulting.com>
Using a reentrant lock on Service use instead of simple synchronized blocks. Also added quite a bit of logging to help tracing. Fixes eclipse-equinox#15 Also-By: BJ Hargrave <hargrave@us.ibm.com> Also-By: Hannes Wellmann <wellmann.hannes1@gmx.net>
This is required for eclipse-equinox#15 Also-By: BJ Hargrave <hargrave@us.ibm.com> Also-By: Alain Picard <apicard@benchmarkconsulting.com>
Using a reentrant lock on Service use instead of simple synchronized blocks. Also added quite a bit of logging to help tracing. Fixes eclipse-equinox#15 Also-By: BJ Hargrave <hargrave@us.ibm.com> Also-By: Hannes Wellmann <wellmann.hannes1@gmx.net>
After updating one of my Eclipse installations (Eclipse Modlling-tools provided by Oomph plus some more plugins) on Linux, it seemed not to start anymore (no UI showed up, not even the icon becomes visible in the task-bar). Attempting to start it a second time did not work because the workspace is blocked. Nevertheless the System Monitor showed that the eclipse process is running and VisualVM detected the following deadlock:
The text was updated successfully, but these errors were encountered: