-
Notifications
You must be signed in to change notification settings - Fork 1
Conversation
Other variant where our threads are stuck:
^ These threads get stuck in TIMED_WAITING. |
I ran the unit tests for solr-core for this branch, I see no newer failures than release/8.8 branch. The failures I see on both branches: I feel safe merging this PR. |
Another unrelated good reason to upgrade is jetty/jetty.project#6072 (Fixed in 9.4.39). This is covered in GHSA-26vr-8j45-3r4w. In Solr 8.9, Jetty was upgraded to a version (9.4.39) where this CVE was fixed, but the hang/stuck bug was fixed later (9.4.42). |
Jetty was upgraded to this version in Solr 8.10.1 as per https://issues.apache.org/jira/browse/SOLR-15677. We should too. |
@patsonluk can you please look this! |
TLDR
Long versionSimilar issue can be reproduced by issue simple queries (such as With a single threaded load generator, the QA node hangs after several minutes However, with the load generator using 3 threads, the QA node will start hanging on the The interesting thing is that, in "normal" execution path, there are only 2 threads of concern on the solr side:
With debugging, it's found that for thread 2 (jetty http2 thread) in normal circumstances, from line 100 to line 124 ( That means if thread ( While pausing the VM when the QA threads hang, it's found that thread 1 ( This alone is not really a bug, as the jetty task There has to be some weird timing and patterns in jetty thread that trigger such condition. In fact, if we run the load generator with 3 threads, get all of them to hang (might take at least 10+ mins before all of them hang). And then go into JVM, debug and pause the threads briefly and then resume, all those 3 hanging The 3 threads load almost simulates the same behaviors observed on our prod, except that when they resume, it would print out some exceptions, which is NOT observed in the originally reported issue (?)
|
From initial testing with 9.4.44.v20210927, the random hangs still happen 😭 |
We should use a combination of this and #106 |
I've finally managed to reproduce this problem, thanks @patsonluk. Seems like this situation happens under heavy query load. On a 8 core (AMD Ryzen 5700G) machine, I was unable to reproduce this problem even after 15-20 minutes with 3 or 8 queries at a time. Increasing this to 16 queries at a time reproduced the problem. This looks like a case of resource exhaustion, Jetty not doing the right thing under such situations. Seems to me that #106 is a good workaround, since that patch will terminate such queries when system is under heavy load. |
Most of our threads are waiting at
java.base@11.0.8/java.lang.Object.wait(Native Method) java.base@11.0.8/java.lang.Object.wait(Object.java:328) org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:318) org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:90) org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:99) org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:217) org.apache.solr.common.util.JavaBinCodec._init(JavaBinCodec.java:211) org.apache.solr.common.util.JavaBinCodec.initRead(JavaBinCodec.java:202) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:195)
We find this issue suspiciously similar to what Jetty has fixed in 9.4.44
jetty/jetty.project#2570 (comment)