-
Notifications
You must be signed in to change notification settings - Fork 40.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reactor netty in spring boot webflux application generates metrics with CancelledServerWebExchangeException exception and UNKNOWN outcome where are no issues #33300
Comments
I created example repo to reproduce this issue: https://github.com/MBalciunas/spring-webflux-netty-metrics-issue |
This looks like a duplicate of #29599. |
It's not the duplicate of #29599 as that issue is concerned with the status of UNKNOWN metric. |
This metric is produced when a client closes the connection prematurely. From a Framework perspective, we have no way to know if the client went away because the response took too much time and cancelled the call or if a intermediary network device is at fault. This is why we're using the I'm happy to improve the situation. Any suggestion? Note: if I remember correctly, apache bench opens and closes a connection for each requests (arguably, this is a strange approach for benchmarks). Benchmarking the TCP layer like this locally might explain why some connections might be closed before the exchange is complete, from a server perspective. |
Just tried an example that @MBalciunas provided and for me is the same.
|
And also that seems weird that the sum of metrics is bigger that request count.
|
I'd like to focus on the issue being discussed here: in the case of connections closed prematurely by clients, how is the current behavior incorrect? What would be the expected behavior? Note that apache bench is not highly regarded as a benchmarking tool in general and in this case, it's not even using the keep-alive feature. |
I tried with another benchmarking tool wrk.
Still getting considerable amount of unknown responses even tho there are no errors or indications of client closes. Considering there is a case of premature connection close, the behaviour like this is fine. So you don't think there's a possibility that some bug/issue arises from weblux/netty metrics side? |
With your latest tests, is it still the case that this behavior cannot be reproduced without Spring Security? |
Without Spring Security:
|
@MBalciunas Thanks - I'm reaching out to the Security team internally. |
After reaching out to the security and testing things further, here's an update. A colleague from the security team couldn't reproduce the issue on their (more powerful?) machine. I don't think this is really related to Spring Security anyway. I think the behavior is easier to reproduce when security is involved, because request processing takes longer as we're creating new sessions all over the place. I've carefully reproduced this behavior without Spring Security being involved, by adding an artificial delay to a test controller method as follows:
The cancellation signals reliably happen at the end of the benchmark run. I think that what's happening is that the benchmarking tool is sending lots of concurrent requests and stops when the chosen count/duration is reached and then cancels all the remaining in-flight requests at that time. This explains why the request count is not the same in the benchmarking report and in the metrics count. This is also reported as an expected behavior by the wrk maintainers. When running the benchmark for longer periods with an application in debug mode, my breakpoints in the metrics filter were only reached by the end of the benchmark. I think this explains the behavior we're seeing here. Unless there is another data point that could lead us to a bug in Spring or Reactor, I think it would be interesting to dig a bit more into your production use case and understand where those client disconnects could come from. Could it be possible that a client (even Back to the actual metric recorded, let me know if you've got ideas to improve this part. From my perspective, the |
After some more testing locally with other kotlin service java.net.http.HttpClient requests to example endpoint, these are the results:
There are no indication of client cancels and those UNKNOWN metrics are logged consistently through, not only at the end. So based on this it really doesn't look like the issue is related to our production services or the delay time, because 1ms delay shouldn't have influence like this. |
I don't think we're making progress here; we're changing the "load test" infrastructure with each comment and we're spending time on something that will not be useful to your problem. In my opinion, there are only three ways out of this:
In the meantime, I'm closing this issue as I can't justify spending more time on this right now. |
We have a bunch of spring boot webflux services in our project and almost all has this same issue. We use prometheus for metrics and track the success of the requests. However in those services from 1% to 20% http server requests metrics consists of outcome=UNKNOWN with exception=CancelledServerWebExchangeException while there are no other indications of any issues in server responses or indication that clients cancel that many requests. examples:
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",platform="UNKNOWN",status="401",uri="UNKNOWN",} 87.0
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",platform="UNKNOWN",status="200",uri="UNKNOWN",} 110.0
I successfully reproduced this locally with basic webflux application template and single controller bombarding with https://httpd.apache.org/docs/2.4/programs/ab.html :
ab -n 15000 -c 50 http://localhost:8080/v1/hello
.I tried substituting tomcat for netty and there were no more of these metrics logs.
While it seems it doesn't cause direct issues on services running in production, it still interferes in the correctness of the metrics and alerts. We can ignore all the UKNOWN outcomes but we can't know if those UNKNOWN come from actual server/client cancels or just this netty issue.
Someone already had this issue in the past but it was never resolved: https://stackoverflow.com/questions/69913027/webflux-cancelledserverwebexchangeexception-appears-in-metrics-for-seemingly-no
Versions used:
SpringBoot: 2.7.2 and SpringBoot: 2.6.2, Kotlin: 1.7.10, JVM: 17
The text was updated successfully, but these errors were encountered: