Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpringFW 6 / SpringBoot 3.1.4 http_server_requests_seconds_count metrics incorrectly reports outcome=UNKNOWN for success requests and not reporting exception/errors for client cancel #31371

Closed
jtorkkel opened this issue Oct 6, 2023 · 5 comments
Labels
status: duplicate A duplicate of another issue

Comments

@jtorkkel
Copy link

jtorkkel commented Oct 6, 2023

Current situation:

There are 3 main issues in http_server_requests_seconds_count/sum/max/bucket metrics in SpringBoot 3.x/Springframework 6.x

  1. success response are randomly often counted as status="UNKNOWN", outcome="UNKNOWN", error="none". Up to 10% of the calls might be incorrectly reported. The calls itself works correctly, so some doOnCancel due race before metrics are recorded
  2. random extra status="UNKNOWN", outcome="UNKNOWN", error="none" metrics added not matching the request count. Probability << 1%. Those are not real meatrics but some kind of duplicate false doOnCancel events, maybe race double doOnCancels
  3. calling client cancel (TCP close/reset) are not recorded with exception/error label but reported as status="UNKNOWN", outcome="UNKNOWN",error="none. doNotCancel source not recording to exception/error like in SringBoot2.5

Mixing success and UNKNOWN, and reporting incorrect counts results that http_server_requests_seconds_count/sum/max/bucket cannot be used to alerting or monitoring purpose. Data is false and cause false alerts.

metrics spring_security_filterchains_seconds_count and reactor_netty_http_server_data_sent_time_seconds_count report correct counts and status="200" for 1 and 2. Also all metrics generate by service like http_client_* and related netty metrics are correct.

canceled case 3 requests reported correctly in spring_security_filterchains_seconds_count filter spring_security_reached_filter_name="AuthorizationWebFilter"
Not reported in filter spring_security_reached_filter_name="ServerWebExchangeReactorContextWebFilter", also reactor_netty_http_server_data_sent_time_seconds_count seems not to report other than 200

Expected behavior:

  • metrics success count status="200, outcome="SUCCESS", error="none" to match requests count when no errors, success not reported as status="UNKNOWN", outcome="UNKNOWN", error="none"
  • no extra/duplicate status="UNKNOWN", outcome="UNKNOWN", error="none" metrics without matching requests
  • client calling the spring application and requests canceled should be counted as status="UNKNOWN", outcome="UNKNOWN", exception/error="CancelledServerWebExchangeException" instead of status="UNKNOWN", outcome="UNKNOWN", exception/error="none"

Tested Versions:

  • SpringBoot 3.1.4 (also 3.1.2)
  • Spring Framework 6.x matching springBoot
  • Wndows and Linux
  • Java 17
  • We use reactive programming and new observation framework.

How to reproduce:

1 and 2. Simplest way just generate requests usign Jmeter with paraller threads for long enough duration, for example 20 threads and 100000 request for each thread, then check that metrics report 2M success and no UNKNOWN.
3 can be tested by generating traffic similar way, but stopping the jmeter forcefully. Metrics should report canceled call with exception/error label different than "none", or as success if specific thread was not canceled. No status="UNKNOWN, outcome=UNKNOWN", error="none" should be reported.

Detailed description:

In our testing we noticed that metrics "http_server_requests_seconds_count" reporting is incorrectly and result are misleading.

For example if we make 1 requests (warmup after restart) plus 2000 requests (20 threads, 100 times) from Jmeter, we expect to see 2001 success responses reported by metrics. Jmeter reports 2001 success response as expected.

However http_server_requests_seconds_count reports less that 2001 success

http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",uri="/customers/",ver="3.1.10-SNAPSHOT",} 1868.0

And we see one or more status="UNKNOWN", outcome="UNKNOWN"

http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="/customers/",ver="3.1.10-SNAPSHOT",} 134.0

Total count might be larger than requests count, in our case we got one alltogether 2002 response when we expect to get 2001 success.

Jmeter is using connection pool and closing 20 connection only after the test.

If we use single thread issue is not seen. Looks that something in spring FW is generating false doOnCancel.

There is identical error created for springBoot Spring Boot (version 3.1.2) has UNKNOWN status in prometheus metrics but apparently issue not progressing as FW issue, not springboot issue.

Other metrics in system reports correctly 2001 requests.
Spring security, and reactor netty metrics showns correctly 2001 requests, also http client:
spring_security_filterchains_seconds_count{app="service-v3",cid="",error="none",host="hostx",spring_security_filterchain_position="9",spring_security_filterchain_size="9",spring_security_reached_filter_name="AuthorizationWebFilter",spring_security_reached_filter_section="before",springBoot="3.1.4",ver="3.1.10-SNAPSHOT",} 2001.0
http_client_requests_seconds_count{app="service-v3",cid="",client_name="localhost",clientName="localhost",error="none",exception="none",host="hostx",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",uri="/customers/{customer-id}",ver="3.1.10-SNAPSHOT",} 2001.0
reactor_netty_http_server_data_received_time_seconds_count{app="service-v3",cid="",host="hostx",method="GET",springBoot="3.1.4",uri="/",ver="3.1.10-SNAPSHOT",} 2001.0
reactor_netty_http_server_data_sent_time_seconds_count{app="service-v3",cid="",host="hostx",method="GET",springBoot="3.1.4",status="200",uri="/",ver="3.1.10-SNAPSHOT",} 2001.0

Seems that issue happens in ServerHttpObservationFilter class filter and doOnCancel get triggered randomly resulting reporting SUCCESS as UNKNOWN.

	private static final Set<String> DISCONNECTED_CLIENT_EXCEPTIONS = Set.of("AbortedException",
			"ClientAbortException", "EOFException", "EofException");

	private Publisher<Void> filter(ServerWebExchange exchange, ServerRequestObservationContext observationContext, Mono<Void> call) {
		Observation observation = ServerHttpObservationDocumentation.HTTP_REACTIVE_SERVER_REQUESTS.observation(this.observationConvention,
				DEFAULT_OBSERVATION_CONVENTION, () -> observationContext, this.observationRegistry);
		observation.start();
		return call.doOnEach(signal -> {
					Throwable throwable = signal.getThrowable();
					if (throwable != null) {
						if (DISCONNECTED_CLIENT_EXCEPTIONS.contains(throwable.getClass().getSimpleName())) {
							observationContext.setConnectionAborted(true);
						}
						observationContext.setError(throwable);
					}
					onTerminalSignal(observation, exchange);
				})
				.doOnCancel(() -> {
					observationContext.setConnectionAborted(true);
					observation.stop();
				})
				.contextWrite(context -> context.put(ObservationThreadLocalAccessor.KEY, observation));
	}

In addition we think there is regression for exception/error label handling. doOnCancel does not set exception/error label values.

If we generate client error by canceling JMETER forcefully during the test pressing stop (jmeter close socket or sends reset while waiting response), in springboot 2.5 onwards it reported consistently
"status="200", outcome="UNKNOWN", exception="CancelledServerWebExchangeException."

In SpringBoot 2 initially http_server metrics did not implement doOnCancel at all for metrics at all, so no metrics were seen. We raised the issue to and it was added to 2.5 SpringBoot, fix was apparently done to spring FW.
Unfortunately status was still incorrectly "200" as there is no standard error codes for client canceling connection (de facto 599 could be used) and spring developer made unfortunate decision to leave status to 200

Now in 3.1 it looks that client cancel is not detected at all or very seldom. And cancel errors are reported just like these incorrectly reported "UNKNOWN".
I can see that "cancels ends to be handled in ServerHttpObservationFilter filter as "doOnCancel", but now in 3.1 there are no exception/error set.

Actually ServerHttpObservationFilter filter has code which should catch DISCONNECTED_CLIENT_EXCEPTIONS but apparently lower level stack throw doOnCancel which get triggered instead.

http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="/customers/",ver="3.1.10-SNAPSHOT",}

If service can response fast when requests is canceled sometimes we might get these metrics, but very seldom. AbortedException probably means that cancel is propagated, but not happening consistently. Sometimes also uri is changed to "UNKNOWN.

http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="UNKNOWN",ver="3.1.10-SNAPSHOT",} 5.0
http_server_requests_seconds_count{app="service-v3",cid="",error="AbortedException",exception="AbortedException",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="/customers/",ver="3.1.10-SNAPSHOT",} 2.0

if service use http client and client calls timeout then WebClientRequestException is added correctly to server and client metrics.
http_server_requests_seconds_count{app="service-v3",cid="",error="WebClientRequestException",exception="WebClientRequestException",host="hostx",method="GET",outcome="SERVER_ERROR",springBoot="3.1.4",status="500",uri="/customers"} 1.0
http_client_requests_seconds_count{app="dbf-customerinfo-service-v3",cid="",client_name="localhost",clientName="localhost",error="WebClientRequestException",exception="WebClientRequestException",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="CLIENT_ERROR",uri="/customers/{customer-id}} 1.0

Sounds that DISCONNECTED_CLIENT_EXCEPTIONS is catching only http client calls the service initiates and reports http client timeouts etc, but not cancel which http server receiver from calling client.

I think that "CancelledServerWebExchangeException" exception/error should somehow also propagated to "doOnCancel"

All this means that http_server metrics are now very unreliable, inaccurate, random, thus very hard or impossible to use for tracking and correlating to errors.

doOnCancel should always set exception/error to other than "none", to differentiate form success.

@jtorkkel
Copy link
Author

jtorkkel commented Oct 6, 2023

I create this as #31365 and #31367 were incorrectly closed and not reopened.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Oct 6, 2023
@snicoll
Copy link
Member

snicoll commented Oct 6, 2023

@jtorkkel stop creating duplicate issues and review our guidelines for contributing before contributing again. There are comments that justify why this was closed in the original issue you've raised.

@snicoll snicoll closed this as not planned Won't fix, can't repro, duplicate, stale Oct 6, 2023
@snicoll snicoll added status: duplicate A duplicate of another issue and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Oct 6, 2023
@sepanniemi
Copy link

@snicoll the original one was closed because it was reported with a typo in Spring Boot version and thus thought it was for unsupported version. This is a problem that appeared when Spring Boot version was updated from 2.7.x to 3.1.x and is preventing our production services now to be upgraded to latest Spring Boot.

@bclozel
Copy link
Member

bclozel commented Oct 6, 2023

@sepanniemi it was closed mainly because of the previous comments I added. I've reopened the original issue to reconsider it anyway.

@jtorkkel

This comment was marked as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: duplicate A duplicate of another issue
Projects
None yet
Development

No branches or pull requests

5 participants