Skip to content
This repository has been archived by the owner on Mar 29, 2024. It is now read-only.

upgrade jetty to 9.4.44.v20210927 #121

Closed
wants to merge 1 commit into from

Conversation

noblepaul
Copy link
Collaborator

@noblepaul noblepaul commented Nov 23, 2021

Most of our threads are waiting at

java.base@11.0.8/java.lang.Object.wait(Native Method) java.base@11.0.8/java.lang.Object.wait(Object.java:328) org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:318) org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:90) org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:99) org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:217) org.apache.solr.common.util.JavaBinCodec._init(JavaBinCodec.java:211) org.apache.solr.common.util.JavaBinCodec.initRead(JavaBinCodec.java:202) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:195)

We find this issue suspiciously similar to what Jetty has fixed in 9.4.44

jetty/jetty.project#2570 (comment)

@chatman
Copy link
Collaborator

chatman commented Nov 23, 2021

Other variant where our threads are stuck:

2021-11-08 00:01:06.485 ERROR (pool-2-thread-1) [    ] o.a.s.s.TimeOutPatrol Stack trace for httpShardExecutor-120-thread-2698527-processing-x:H2QQD http:////10.240.17.206:8983//solr//H2QQD_shard16_replica_n16// n:10.240.0.180:8984_solr c:H2QQD rid:18fc88214bd91462d78d82e764a954d5-77f52525a341357d [http:////10.240.17.206:8983//solr//H2QQD_shard16_replica_n16//]: [java.base@11.0.8/jdk.internal.misc.Unsafe.park(Native Method)
	java.base@11.0.8/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
	java.base@11.0.8/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1079)
	java.base@11.0.8/java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1369)
	java.base@11.0.8/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:278)
	org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:214)
	org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:401)
	org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:761)
	org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
	org.apache.solr.handler.component.HttpShardHandler.request(HttpShardHandler.java:80)
	org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:136)
	org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:45)
	java.base@11.0.8/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	java.base@11.0.8/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	java.base@11.0.8/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)
	org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)
	org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$199/0x00000017feb29040.run(Unknown Source)
	java.base@11.0.8/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	java.base@11.0.8/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	java.base@11.0.8/java.lang.Thread.run(Thread.java:834)]

^ These threads get stuck in TIMED_WAITING.

@chatman
Copy link
Collaborator

chatman commented Nov 23, 2021

@chatman
Copy link
Collaborator

chatman commented Nov 23, 2021

I ran the unit tests for solr-core for this branch, I see no newer failures than release/8.8 branch.

The failures I see on both branches:
[junit4]
[junit4] Tests with failures [seed: 3A53112B710015DB]:
[junit4] - org.apache.solr.metrics.reporters.solr.SolrCloudReportersTest.testDefaultPlugins
[junit4] - org.apache.solr.cloud.LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
[junit4] - org.apache.solr.search.facet.DistributedFacetSimpleRefinementLongTailTest.test

I feel safe merging this PR.

@chatman
Copy link
Collaborator

chatman commented Nov 23, 2021

Another unrelated good reason to upgrade is jetty/jetty.project#6072 (Fixed in 9.4.39). This is covered in GHSA-26vr-8j45-3r4w. In Solr 8.9, Jetty was upgraded to a version (9.4.39) where this CVE was fixed, but the hang/stuck bug was fixed later (9.4.42).

@chatman
Copy link
Collaborator

chatman commented Nov 23, 2021

Jetty was upgraded to this version in Solr 8.10.1 as per https://issues.apache.org/jira/browse/SOLR-15677. We should too.

@hiteshk25
Copy link

@patsonluk can you please look this!

@patsonluk
Copy link

patsonluk commented Nov 23, 2021

TLDR

  1. The hang appears to be triggered by jetty under certain conditions, only occasionally
  2. Similar hangs can be reproduced locally on my machine (both solr and load generation are on my mac book) by issuing requests continuously (even with just 1 thread) after a while (several minutes). Might need a solr collection with a lot of shards (1000 in our test). The stack traces observed are similar to previous finding. Test driver used : https://gist.github.com/patsonluk/4e5ee44a649c3cee07ae1cba1962654c
  3. Such hang usually resolves itself at exactly 120 secs (detecting connection going away, trigger timeout exception), but in multiple thread load test, such timeout might not get triggered at 120 secs, sometimes it takes much longer before the hanging threads resume.
  4. The reproduced hang (3 threads) is very similar to what was reported - long hang (> 10 mins) and Solr appears to be still healthy (can still serve other external requests). But the difference is that an exception is observed from this test, but probably not on the originally reported issue?
  5. When the threads hang, and then later on resume, they all seem to resume at the same time, even if they started hanging from different time - this is also very similar to what we observed on prod

Long version

Similar issue can be reproduced by issue simple queries (such as http://127.0.0.1:8984/solr/fsloadtest/select?q=*%3A*) to the QA node repeatedly. My test environment is a single collector with 1000 shards (and 10m docs).

With a single threaded load generator, the QA node hangs after several minutes org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:318), it will eventually self recovered by timing out (at 120 secs), which is not exactly the same as what we observed (hangs for longer than 120 secs)

However, with the load generator using 3 threads, the QA node will start hanging on the httpShardExecutor threads one by one (could be several minutes apart). Eventually all 3 threads from the load generator would not get any response (> 120 secs). The QA node however, still appears to be healthy (still responsive to manual queries), by inspecting at the threads, they are indeed all hang at InputStreamResponseListener.java:318

The interesting thing is that, in "normal" execution path, there are only 2 threads of concern on the solr side:

  1. The httpShardExecutor thread which lock and wait and 2 points - Countdown latch at InputStreamResponseListener.java:214 and Object lock at InputStreamResponseListener.java:318)
  2. The jetty http2 thread (h2sc..., usually h2sc-121-thread-1 for single threaded loads) which unlock such 2 points - InputStreamResponseListener.onHeaders (line 100) that unlocks countdown latch and InputStreamResponseListener.onContent (line 124) that unlocks the Object lock

With debugging, it's found that for thread 2 (jetty http2 thread) in normal circumstances, from line 100 to line 124 (InputStreamResponseListener) should be run by the same jetty thread (invokeTask(task) in doProduce line 303 in EatWhatYouKill, the task is org.eclipse.jetty.io.ChannelEndPoint$1, the debugger does not advance to next line, which means from line 100 to line 124(InputStreamResponseListener), it's all handled by the same task within the same thread. again, this is what happening under "normal circumstances" with no hangs.

That means if thread (httpShardExecutor) is waiting at line 318, the the countdown latch (line 214) should have been unlocked by thread 2 at line 100, and assuming thread 2 keeps running, it should eventually unlock the Object lock at line 124 (which it did not happen).

While pausing the VM when the QA threads hang, it's found that thread 1 (httpShardExecutor) does hang at line 318, however the thread 2 (jetty http2 thread, h2sc-121-thread-1) is NOT executing any code with the org.eclipse.jetty.io.ChannelEndPoint$1 task invoked by invokeTask(task) in doProduce line 303 in EatWhatYouKill, instead, it's not even invoking at task, but waiting at produceTask() in doProduce line 184 in EatWhatYouKill. Therefore, this is different form the "normal circumstances" flow. As this time, the task somehow exit before reaching InputStreamResponseListener:124 that unlocks the Object lock.

This alone is not really a bug, as the jetty task org.eclipse.jetty.io.ChannelEndPoint$1 itself is non blocking (as frame org.eclipse.jetty.util.thread.Invocable.invokeNonBlocking(Invocable.java:69) suggests), which means it might exit (hard to find out what conditions tho) before reaching InputStreamResponseListener:124. The issue tho, is that somehow with such condition, it takes long time before thread h2sc-121-thread-1 goes back to handling org.eclipse.jetty.io.ChannelEndPoint$1 which eventually unlocks at InputStreamResponseListener:124 - It did happen, but could sometimes be hours after.

There has to be some weird timing and patterns in jetty thread that trigger such condition. In fact, if we run the load generator with 3 threads, get all of them to hang (might take at least 10+ mins before all of them hang). And then go into JVM, debug and pause the threads briefly and then resume, all those 3 hanging httpShardExecutor threads InputStreamResponseListener.java:318 will be unlocked by the jetty threads.

The 3 threads load almost simulates the same behaviors observed on our prod, except that when they resume, it would print out some exceptions, which is NOT observed in the originally reported issue (?)

2021-11-23 18:56:35.191 ERROR (qtp1426725223-3065) [c:fsloadtest  r:127.0.0.1 x:fsloadtest rid:2556] o.a.s.h.RequestHandlerBase org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at null: parsing error
        at org.apache.solr.client.solrj.impl.Http2SolrClient.processErrorsAndResponse(Http2SolrClient.java:698)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:412)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:761)
        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
        at org.apache.solr.handler.component.HttpShardHandler.request(HttpShardHandler.java:80)
        at org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:130)
        at org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:41)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.solr.common.SolrException: parsing error
        at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:53)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.processErrorsAndResponse(Http2SolrClient.java:696)
        ... 14 more
Caused by: java.nio.channels.AsynchronousCloseException
        at org.eclipse.jetty.http2.client.http.HttpConnectionOverHTTP2.close(HttpConnectionOverHTTP2.java:133)
        at org.eclipse.jetty.http2.client.http.HttpClientTransportOverHTTP2.onClose(HttpClientTransportOverHTTP2.java:170)
        at org.eclipse.jetty.http2.client.http.HttpClientTransportOverHTTP2$SessionListenerPromise.onClose(HttpClientTransportOverHTTP2.java:232)
        at org.eclipse.jetty.http2.api.Session$Listener.onClose(Session.java:206)
        at org.eclipse.jetty.http2.HTTP2Session.notifyClose(HTTP2Session.java:1166)
        at org.eclipse.jetty.http2.HTTP2Session.onClose(HTTP2Session.java:598)
        at org.eclipse.jetty.http2.HTTP2Session.onGoAway(HTTP2Session.java:477)
        at org.eclipse.jetty.http2.parser.Parser$Listener$Wrapper.onGoAway(Parser.java:396)
        at org.eclipse.jetty.http2.parser.BodyParser.notifyGoAway(BodyParser.java:192)
        at org.eclipse.jetty.http2.parser.GoAwayBodyParser.onGoAway(GoAwayBodyParser.java:169)
        at org.eclipse.jetty.http2.parser.GoAwayBodyParser.parse(GoAwayBodyParser.java:139)
        at org.eclipse.jetty.http2.parser.Parser.parseBody(Parser.java:198)
        at org.eclipse.jetty.http2.parser.Parser.parse(Parser.java:127)
        at org.eclipse.jetty.http2.HTTP2Connection$HTTP2Producer.produce(HTTP2Connection.java:261)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135)
        at org.eclipse.jetty.http2.HTTP2Connection.produce(HTTP2Connection.java:183)
        at org.eclipse.jetty.http2.HTTP2Connection.onFillable(HTTP2Connection.java:138)
        at org.eclipse.jetty.http2.HTTP2Connection$FillableCallback.succeeded(HTTP2Connection.java:361)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
        at org.eclipse.jetty.util.thread.Invocable.invokeNonBlocking(Invocable.java:69)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.invokeTask(EatWhatYouKill.java:348)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:303)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135)
        ... 4 more

@patsonluk
Copy link

From initial testing with 9.4.44.v20210927, the random hangs still happen 😭

@noblepaul
Copy link
Collaborator Author

We should use a combination of this and #106

@chatman
Copy link
Collaborator

chatman commented Nov 29, 2021

I've finally managed to reproduce this problem, thanks @patsonluk.
Here's the config: http://sprunge.us/s5u81z
Here's the matchalldocs.json: http://sprunge.us/wb0ttN
Running the above with the ecommerce dataset (download link in README of github.com/fullstorydev/solr-bench) and code using ishan/qa-nodes branch.

Seems like this situation happens under heavy query load. On a 8 core (AMD Ryzen 5700G) machine, I was unable to reproduce this problem even after 15-20 minutes with 3 or 8 queries at a time. Increasing this to 16 queries at a time reproduced the problem. This looks like a case of resource exhaustion, Jetty not doing the right thing under such situations. Seems to me that #106 is a good workaround, since that patch will terminate such queries when system is under heavy load.

@noblepaul noblepaul closed this Jun 16, 2022
@noblepaul noblepaul deleted the noble/upgrade-jetty branch June 16, 2022 11:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants