Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible thread exhaustion issue #1998

Closed
morph166955 opened this issue Dec 23, 2020 · 20 comments
Closed

Possible thread exhaustion issue #1998

morph166955 opened this issue Dec 23, 2020 · 20 comments

Comments

@morph166955
Copy link

We've been working through an odd issue for the past few weeks on the lutron binding (openhab/openhab-addons#9178). For no reason we've been able to identify, the binding just "stops working" and OH requires a restart. As we hadn't seen this on other bindings, we attributed it to the new LEAP code and were working down that path. This morning I had a rule "misbehave" because I was missing a check. It caused the rule to effectively trigger itself in an endless loop. When this happened, the Lutron went offline again. My ecobee also stopped working (but with no errors in the log). Some of my periodic rules also stopped working (for example, I have a periodic speedtest run by cron and it never fired). The only fix was to restart OH. I'm happy to attempt to dig through debugs to figure out what is happening, I just have no idea which debugs to turn on to monitor the threadpools. Any help in getting started would be appreciated.

@kaikreuzer
Copy link
Member

If you have your installation in exactly such a situation, simply do a thread dump and post it here - this should help to identify which thread pool is exhausted and why.

@morph166955
Copy link
Author

Works for me. I'll wait until my system gets into a bad state and peel off a thread dump. Are there any debugs to peel off to try to capture this? I have org.openhab.core.common peeled off to a file as I believe the threadpool managers live in there right now.

While we wait for the system to go into a bad state, if there is an exhaustion issue, would that cause threads to not come back (e.g. lutron, ecobee, etc doing periodic "stuff")?

@kaikreuzer
Copy link
Member

No need for logs, the thread dump should say it all.

would that cause threads to not come back (e.g. lutron, ecobee, etc doing periodic "stuff")?

Those are not threads, but periodically scheduled jobs. So yes, if a thread pool is exhausted, those jobs would be impacted as well.

@morph166955
Copy link
Author

Sounds good. I've reversed the fix for my rule to see if it will stress the system into a bad state. I'll post the thread dump once it goes. Could be a few days, seems to happen randomly when it fails. I've put a notification in to let me know as soon as it goes sideways.

@morph166955
Copy link
Author

See attached thread dump. This was about 6 minutes after the Lutron things all went offline.
Thread-Exhaustion-1.txt

@kaikreuzer
Copy link
Member

kaikreuzer commented Dec 23, 2020

Thanks. I think this shows a few bugs in bindings:

  1. DenonMarantz:
    It permanently blocks one thread from the thread pool as it uses it to listen to a telnet connection. It should use its own thread instead of the shared thread pool for that.
"OH-thingHandler-2" #226 daemon prio=5 os_prio=0 cpu=76.63ms elapsed=3419.19s tid=0x00007fbd1402c000 nid=0x40cdb runnable  [0x00007fbd00825000]
   java.lang.Thread.State: RUNNABLE
	at java.net.SocketInputStream.socketRead0(java.base@11.0.9.1/Native Method)
	at java.net.SocketInputStream.socketRead(java.base@11.0.9.1/SocketInputStream.java:115)
	at java.net.SocketInputStream.read(java.base@11.0.9.1/SocketInputStream.java:168)
	at java.net.SocketInputStream.read(java.base@11.0.9.1/SocketInputStream.java:140)
	at sun.nio.cs.StreamDecoder.readBytes(java.base@11.0.9.1/StreamDecoder.java:284)
	at sun.nio.cs.StreamDecoder.implRead(java.base@11.0.9.1/StreamDecoder.java:326)
	at sun.nio.cs.StreamDecoder.read(java.base@11.0.9.1/StreamDecoder.java:178)
	- locked <0x00000000c45000f0> (a java.io.InputStreamReader)
	at java.io.InputStreamReader.read(java.base@11.0.9.1/InputStreamReader.java:185)
	at java.io.BufferedReader.fill(java.base@11.0.9.1/BufferedReader.java:161)
	at java.io.BufferedReader.readLine(java.base@11.0.9.1/BufferedReader.java:326)
	- locked <0x00000000c45000f0> (a java.io.InputStreamReader)
	at java.io.BufferedReader.readLine(java.base@11.0.9.1/BufferedReader.java:392)
	at org.openhab.binding.denonmarantz.internal.connector.telnet.DenonMarantzTelnetClient.run(DenonMarantzTelnetClient.java:71)
  1. The SamsungTV binding has almost every method of SamsungTvHandler marked as synchronized, which makes it get blocked when multiple things happen at the same time (here UPnP discovery and value reading).
"upnp-main-173" #5582 prio=5 os_prio=0 cpu=52.86ms elapsed=1184.08s tid=0x00007fbd78154800 nid=0x43ac1 waiting for monitor entry  [0x00007fbcf4eff000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:69)
	- waiting to lock <0x00000000c452b620> (a java.lang.Object)
	at org.openhab.binding.samsungtv.internal.protocol.RemoteControllerWebSocket.openConnection(RemoteControllerWebSocket.java:147)
	at org.openhab.binding.samsungtv.internal.service.RemoteControllerService.start(RemoteControllerService.java:203)
	at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.checkCreateManualConnection(SamsungTvHandler.java:312)
	- locked <0x00000000c452ca20> (a org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler)
	at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.checkAndCreateServices(SamsungTvHandler.java:253)
	at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.thingDiscovered(SamsungTvHandler.java:362)
	at org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl$1.run(DiscoveryServiceRegistryImpl.java:260)

"OH-thingHandler-1" #141 daemon prio=5 os_prio=0 cpu=5576.05ms elapsed=3419.65s tid=0x00007fbd20052800 nid=0x40c8a waiting for monitor entry  [0x00007fbd05d76000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.valueReceived(SamsungTvHandler.java:225)
	- waiting to lock <0x00000000c452ca20> (a org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler)
	at org.openhab.binding.samsungtv.internal.service.MediaRendererService.onValueReceived(MediaRendererService.java:205)
	at org.openhab.binding.samsungtv.internal.service.MediaRendererService.updateResourceState(MediaRendererService.java:240)
	at org.openhab.binding.samsungtv.internal.service.MediaRendererService.handleCommand(MediaRendererService.java:121)
	at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.poll(SamsungTvHandler.java:217)
	at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler$$Lambda$1161/0x0000000101137440.run(Unknown Source)

Additionally, it registers as a discovery listener (also not really allowed for a handler) and creates websocket connections to the remote device instead of returning the call immediately, which blocks the whole discovery infrastructure:

"upnp-main-173" #5582 prio=5 os_prio=0 cpu=52.86ms elapsed=1184.08s tid=0x00007fbd78154800 nid=0x43ac1 waiting for monitor entry  [0x00007fbcf4eff000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:69)
	- waiting to lock <0x00000000c452b620> (a java.lang.Object)
	at org.openhab.binding.samsungtv.internal.protocol.RemoteControllerWebSocket.openConnection(RemoteControllerWebSocket.java:147)
	at org.openhab.binding.samsungtv.internal.service.RemoteControllerService.start(RemoteControllerService.java:203)
	at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.checkCreateManualConnection(SamsungTvHandler.java:312)
	- locked <0x00000000c452ca20> (a org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler)
	at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.checkAndCreateServices(SamsungTvHandler.java:253)
	at org.openhab.binding.samsungtv.internal.handler.SamsungTvHandler.thingDiscovered(SamsungTvHandler.java:362)
	at org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl$1.run(DiscoveryServiceRegistryImpl.java:260)
  1. Ecobee:
    The EcobeeThermostatBridgeHandler does nasty stuff by starting discovery services itself (and not returning immediately, what would be anyhow expected from startBackgroundDiscovery()). This causes multiple discoveries running in parallel and blocking the 3 remaining threads of the thread pool:
"OH-thingHandler-3" #229 daemon prio=5 os_prio=0 cpu=6314.43ms elapsed=3419.08s tid=0x00007fbd14039000 nid=0x40cde waiting for monitor entry  [0x00007fbd00521000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl.thingDiscovered(DiscoveryServiceRegistryImpl.java:252)
	- waiting to lock <0x00000000c4f0d968> (a org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl)
	at org.openhab.core.config.discovery.AbstractDiscoveryService.thingDiscovered(AbstractDiscoveryService.java:266)
	at org.openhab.binding.ecobee.internal.discovery.SensorDiscoveryService.discoverSensors(SensorDiscoveryService.java:111)
	- locked <0x00000000cee7e450> (a org.openhab.binding.ecobee.internal.discovery.SensorDiscoveryService)
	at org.openhab.binding.ecobee.internal.discovery.SensorDiscoveryService.startBackgroundDiscovery(SensorDiscoveryService.java:84)
	at org.openhab.binding.ecobee.internal.handler.EcobeeThermostatBridgeHandler.discoverSensors(EcobeeThermostatBridgeHandler.java:847)

"OH-thingHandler-5" #238 daemon prio=5 os_prio=0 cpu=7112.63ms elapsed=3419.01s tid=0x00007fbd4c012800 nid=0x40cea waiting for monitor entry  [0x00007fbcff918000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl.thingDiscovered(DiscoveryServiceRegistryImpl.java:252)
	- waiting to lock <0x00000000c4f0d968> (a org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl)
	at org.openhab.core.config.discovery.AbstractDiscoveryService.thingDiscovered(AbstractDiscoveryService.java:266)
	at org.openhab.binding.ecobee.internal.discovery.ThermostatDiscoveryService.discoverThermostats(ThermostatDiscoveryService.java:118)
	- locked <0x00000000ce589800> (a org.openhab.binding.ecobee.internal.discovery.ThermostatDiscoveryService)
	at org.openhab.binding.ecobee.internal.discovery.ThermostatDiscoveryService.startBackgroundDiscovery(ThermostatDiscoveryService.java:88)
	at org.openhab.binding.ecobee.internal.handler.EcobeeAccountBridgeHandler.discoverThermostats(EcobeeAccountBridgeHandler.java:253)
	at org.openhab.binding.ecobee.internal.handler.EcobeeAccountBridgeHandler.refresh(EcobeeAccountBridgeHandler.java:226)
	at org.openhab.binding.ecobee.internal.handler.EcobeeAccountBridgeHandler$$Lambda$900/0x0000000100e30040.run(Unknown Source)

"OH-thingHandler-4" #237 daemon prio=5 os_prio=0 cpu=6348.00ms elapsed=3419.01s tid=0x00007fbd14049800 nid=0x40ceb waiting for monitor entry  [0x00007fbcff816000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl.thingDiscovered(DiscoveryServiceRegistryImpl.java:252)
	- waiting to lock <0x00000000c4f0d968> (a org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl)
	at org.openhab.core.config.discovery.AbstractDiscoveryService.thingDiscovered(AbstractDiscoveryService.java:266)
	at org.openhab.binding.ecobee.internal.discovery.SensorDiscoveryService.discoverSensors(SensorDiscoveryService.java:111)
	- locked <0x00000000cef465f8> (a org.openhab.binding.ecobee.internal.discovery.SensorDiscoveryService)
	at org.openhab.binding.ecobee.internal.discovery.SensorDiscoveryService.startBackgroundDiscovery(SensorDiscoveryService.java:84)
	at org.openhab.binding.ecobee.internal.handler.EcobeeThermostatBridgeHandler.discoverSensors(EcobeeThermostatBridgeHandler.java:847)

TL;DR: Close this issue, but open 3 new ones for the respective bindings 😎 .

@morph166955
Copy link
Author

Works for me! Before I kill it, could the SamsungTV been the cause of that long long long JuPnP bug that was going for forever?

@kaikreuzer
Copy link
Member

I didn't follow that bug, but indeed the SamsungTV does bad things within the UPnP thread - I just added a second finding in my comment above.

@morph166955
Copy link
Author

morph166955 commented Dec 23, 2020

The JuPnP "bug" was openhab/openhab-addons#5892

@morph166955
Copy link
Author

3 bugs opened, copy/pasted your comments and referenced this bug. Closing this. THANK YOU!

@kaikreuzer
Copy link
Member

Thanks! You've missed my "Additionally" part on the SamsungTV binding.

@morph166955
Copy link
Author

Not seeing what I missed. I see the Additionally upnp-main-173 note on 9495 when I pull it up.

@kaikreuzer
Copy link
Member

All there, all good!

@morph166955
Copy link
Author

Sweet. Hopefully fixing these up will fix some of the annoying thread issues floating around.

@morph166955
Copy link
Author

One last question, I promise. Shouldn't the thread pool grow automatically when all threads are in use so that even events like this don't cause exhaustion?

@kaikreuzer
Copy link
Member

No, these pools always have a maximum size. Anything else would only hide the issues and potentially result in resource leaks by an constantly growing number of threads.

@morph166955
Copy link
Author

Thank you. That makes sense. And this thread is now dead... (pun intended)

@lolodomo
Copy link
Contributor

lolodomo commented Dec 24, 2020

SamsungTV binding as source of problems, it was clear for me since ages and was discussed many times in the JUpnp issue. I even proposed a fix relative to the discovery for this binding but it triggered no interest, no review and no testing. The PR is probably still in 2.5 branch if I didn't close it.

@kaikreuzer
Copy link
Member

@lolodomo If you have a fix for openhab/openhab-addons#9495, it would be great if you could comment on that issue and port your existing PR. I am pretty sure that it will trigger interest.

@morph166955
Copy link
Author

@kaikreuzer I've loaded each of the 3 fixed bindings to see if this issue is resolved. I had to compile 2 from the PRs but confirmed that the git log included the patches. I'll let this stew for a bit to make sure the problems are gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants