-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idle timeout occured sometimes on HTTP/2 client with InputStreamResponseListener
#8558
Comments
@clardeur thanks for the detailed report! I can definitely see a race condition, we're working to analyze it and fix it. |
…istener. The issue was that HttpReceiverOverHTTP2.ContentNotifier.offer() was racy, as a network thread could have offered a DATA frame, but not yet called process() -- yet an application thread could have stolen the DATA frame completed the response and started another response, causing the network thread to interact with the wrong response. The implementation has been changed so that HttpReceiverOverHTTP2.ContentNotifier does not have a queue anymore and it demands DATA frames to the Stream only when the application demands more -- a simpler model that just forwards the demand. Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
@clardeur would you be able to try branch |
Hi @sbordet, could you backport the commit on the If it's not possible, I can backport the changes myself locally just for testing purpose. But in any case, we need the fix on the |
…istener. Updates after review. Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
@clardeur I just merged the changes to Jetty 11 too, so you can build locally branch |
…istener. Fixed override of reset(). Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
I have re-run a job that crash every time due to this issue and it goes well 👍 My 2 cents |
@clardeur we have added your test to our suite, if that's what you meant. The problem is that we can have a million edge cases, with 7 different transports, with many different ways of doing the same thing, and while we try, we cannot possibly cover all of them. Your case only happens with HTTP/2, only with I'm open to suggestions, but the number of permutations (and some of them like thread scheduling can only be stochastic) is typically too large to have a perfect coverage. |
IMO my test case is very specific and related to this issue. I'm not sure if it's a good test to add.
Hard to find, indeed 😅
Yeah I know, a complex problem that requires time and expertise in testing. Since we have been using Jetty in production for 7 years now, some issues (#7107) (486829) that have been reported previously could have been avoided with some "basic" scenario. It's likely that some issues are more related to the client part of the project which is maybe less used in production compared to the server part. The first step might be to add a nominal test case for each transport layer and parametrize only one major component at a time to avoid an exponential explosion, like the connection pool, which is a core component in my opinion. And in a second step, adding some "chaos" like randomness, latency, etc. I know it's far from being a perfect coverage but it's like a safety net that does not require too much effort. In any case, @sbordet, it's always a pleasure to take time to explore the code, reproduce the problem, understand how it's works to submit a proper issue, open source maintainers deserve it ! |
A "randomized testing" philosophy is intended to address this. Given enough randomized executions, you may well cover all of the scenarios :-). Apache Lucene and Solr embrace this philosophy. We randomize the JDKs, GCs, TimeZones, and within Lucene & Solr, arbitrary choices in implementations or various things specific to a test. Like randomly picking some data to use to test on. The main requirement is an ability to have repeatable randomness so that there is a greater chance of being able to reproduce a problem that is discovered. Lucene & Solr use a testing utility (JUnit runners, etc.) with an apt name for this -- randomizedtesting. It's not all wonderful. Debugging a failure that does not reproduce, despite the seed, is a challenge. Sometimes the outcome is the test needing to be less random or guard against something that can occur based on the randomness -- thus not a real success. But it does find obscure bugs! |
Jetty version(s)
11.0.8
Java version/vendor
openjdk version "11.0.16" 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu118.04)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu118.04, mixed mode, sharing)
OS type/version
Linux 5.4.0-125-generic 141~18.04.1-Ubuntu x86_64 x86_64 x86_64 GNU/Linux
Description
We have switched in production the response listener of the HTTP/2 client, before we used the
FutureResponseListener
and now theInputStreamResponseListener
, since this change we are observing regularly failures with timeouts and it happens randomly.This is very similar in symptoms to the issue #7192
How to reproduce?
The problem is difficult to reproduce due to its randomness. As described in the comment of the similar issue (#7192), when we increase the log level to
DEBUG
, the problem is even less reproducible.I still managed to reproduce the problem thanks to a test in the jetty project that triggers the timeout almost every time.
Test case - Code
Test case - Failure Stacktrace
Test case - Output
Full output
Interesting snippets
The failure happened after that the
ContentNotifier
will be marked asinactive
andstalled
in this code path :After that the content notifier does not resume the processing and therefore does not notify the
InputStreamResponseListener
, hence the final timeout.After that my understanding of the code and the design of the components is not sufficient to go further in the analysis. I don't know if I'm completely on the wrong track or not, WDYT ?
The text was updated successfully, but these errors were encountered: