SpringFW 6 / SpringBoot 3.1.4 http_server_requests_seconds_count metrics incorrectly reports outcome=UNKNOWN for success requests and not reporting exception/errors for client cance #31367
Labels
status: duplicate
A duplicate of another issue
We use reactive programming and new observation framework.
In our testing we noticed that metrics "http_server_requests_seconds_count" reporting is incorrectly and result are misleading.
For example if we make 1 requests (warmup after restart) plus 2000 requests (20 threads, 100 times) from Jmeter, we expect to see 2001 success responses reported by metrics. Jmeter reports 2001 success response as expected.
However http_server_requests_seconds_count reports less that 2001 success
http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",uri="/customers/",ver="3.1.10-SNAPSHOT",} 1868.0
And we see one or more status="UNKNOWN", outcome="UNKNOWN"
http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="/customers/",ver="3.1.10-SNAPSHOT",} 134.0
Total count might be larger than requests count, in our case we got one alltogether 2002 response when we expect to get 2001 success.
Jmeter is using connection pool and closing 20 connection only after the test.
If we use single thread issue is not seen. Looks that something in spring FW is generating false doOnCancel.
There is identical error created for springBoot #31028 but apparently issue not progressing as FW issue, not springboot issue.
Other metrics in system reports correctly 2001 requests.
Spring security, and reactor netty metrics showns correctly 2001 requests, also http client:
spring_security_filterchains_seconds_count{app="service-v3",cid="",error="none",host="hostx",spring_security_filterchain_position="9",spring_security_filterchain_size="9",spring_security_reached_filter_name="AuthorizationWebFilter",spring_security_reached_filter_section="before",springBoot="3.1.4",ver="3.1.10-SNAPSHOT",} 2001.0
http_client_requests_seconds_count{app="service-v3",cid="",client_name="localhost",clientName="localhost",error="none",exception="none",host="hostx",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",uri="/customers/{customer-id}",ver="3.1.10-SNAPSHOT",} 2001.0
reactor_netty_http_server_data_received_time_seconds_count{app="service-v3",cid="",host="hostx",method="GET",springBoot="3.1.4",uri="/",ver="3.1.10-SNAPSHOT",} 2001.0
reactor_netty_http_server_data_sent_time_seconds_count{app="service-v3",cid="",host="hostx",method="GET",springBoot="3.1.4",status="200",uri="/",ver="3.1.10-SNAPSHOT",} 2001.0
Seems that issue happens in ServerHttpObservationFilter class filter and doOnCancel get triggered randomly resulting reporting SUCCESS as UNKNOWN.
In addition we think there is regression for exception/error label handling. doOnCancel does not set exception/error label values.
If we generate client error by canceling JMETER forcefully during the test pressing stop (jmeter close socket or sends reset while waiting response), in springboot 2.5 onwards it reported consistently
"status="200", outcome="UNKNOWN", exception="CancelledServerWebExchangeException."
In SpringBoot 2 initially http_server metrics did not implement doOnCancel at all for metrics at all, so no metrics were seen. We raised the issue to and it was added to 2.5 SpringBoot, fix was apparently done to spring FW.
Unfortunately status was still incorrectly "200" as there is no standard error codes for client canceling connection (de facto 599 could be used) and spring developer made unfortunate decision to leave status to 200
Now in 3.1 it looks that client cancel is not detected at all or very seldom. And cancel errors are reported just like these incorrectly reported "UNKNOWN".
I can see that "cancels ends to be handled in ServerHttpObservationFilter filter as "doOnCancel", but now in 3.1 there are no exception/error set.
Actually ServerHttpObservationFilter filter has code which should catch DISCONNECTED_CLIENT_EXCEPTIONS but apparently lower level stack throw doOnCancel which get triggered instead.
http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="/customers/",ver="3.1.10-SNAPSHOT",}
If service can response fast when requests is canceled sometimes we might get these metrics, but very seldom. AbortedException probably means that cancel is propagated, but not happening consistently. Sometimes also uri is changed to "UNKNOWN.
http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="UNKNOWN",ver="3.1.10-SNAPSHOT",} 5.0
http_server_requests_seconds_count{app="service-v3",cid="",error="AbortedException",exception="AbortedException",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="/customers/",ver="3.1.10-SNAPSHOT",} 2.0
if service use http client and client calls timeout then WebClientRequestException is added correctly to server and client metrics.
http_server_requests_seconds_count{app="service-v3",cid="",error="WebClientRequestException",exception="WebClientRequestException",host="hostx",method="GET",outcome="SERVER_ERROR",springBoot="3.1.4",status="500",uri="/customers"} 1.0
http_client_requests_seconds_count{app="dbf-customerinfo-service-v3",cid="",client_name="localhost",clientName="localhost",error="WebClientRequestException",exception="WebClientRequestException",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="CLIENT_ERROR",uri="/customers/{customer-id}} 1.0
Sounds that DISCONNECTED_CLIENT_EXCEPTIONS is catching only http client calls the service initiates but fails due timeouts etc, but not cancel which http server receiver from calling client.
I think that "CancelledServerWebExchangeException" exception/error should somehow also propagated to "doOnCancel"
All this means that http_server metrics are now very unreliable, inaccurate, random , thus very hard or impossible to use for tracking and correlating to errors.
doOnCancel should always set exception/error to other than "none", to differentiate form success.
The text was updated successfully, but these errors were encountered: