Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebFlux server metrics record http status 200 for CancelledServerWebExchangeException #29599

Closed
lhahne opened this issue Jan 31, 2022 · 5 comments
Labels
status: declined A suggestion or change that we don't feel we should currently apply type: enhancement A general enhancement

Comments

@lhahne
Copy link

lhahne commented Jan 31, 2022

Version: Spring Boot 2.6.3.RELEASE with Webflux & Metrics

Scenario

  • Client aborts the connection before server sends a response.
  • Server request processing takes forever (server configured with read/write timeout). Server closes the connection

In both the above cases, metrics are generated. However they are stored with status = 200 implying success. Eg:

http_server_requests_seconds{app="test-v0",cid="",exception="CancelledServerWebExchangeException",host="xxx",method="GET",outcome="UNKNOWN",springBoot="2",status="200",uri="/slow",ver="0.0.1",quantile="0.5",} 0.872415232

Please see earlier issue #23606 for reference source code.

The correct status code for this situation is clearly not well defined. However by making it 200, the result looks like success if we monitor fot HTTP Status code. Something from the 400 series would suit this better.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Jan 31, 2022
@bclozel
Copy link
Member

bclozel commented Jan 31, 2022

The HTTP status doesn't imply success or failure from an application nor a client perspective, it just points to the HTTP response status. If the status was not set when the connection was interrupted, then we can only imply the default: 200 OK.

Note that the outcome tag is set to "UNKNOWN", which is really the case here: we don't know if the server took longer than the configured timeout on the client side, or if the client didn't need the response after all. This can happen a lot with scatter/gather use cases with WebClient.

Here's the problem with setting HTTP 400 here:

  • if the client has a really short timeout or cancelled because it didn't need the info after all, should we still count this as an invalid request?
  • if server takes too long, maybe we would have hit a server timeout and a 500 status instead?
  • the exception and outcome tags already describe quite well the situation: client disconnected and we don't know why

As discussed in #18444 (comment), there is no strong agreement there on the metrics side of things, and "UNKNOWN" is the recommendation so far. As for the HTTP status itself, I don't think we should change its meaning - we should either take what was set or leave the default.

@bclozel bclozel added the status: waiting-for-feedback We need additional information before we can continue label Jan 31, 2022
@jtorkkel
Copy link

In metrics / Grafana we normally use only "status", outcome is not used as "overlapping". outcome was added only in springboot2. In spingboot1 outcome was missing, at least from initial versions.
Using both complicates all the queries because you must filter outcome="UNKNOWN", from success queries. To avoid extra filtering any status=4xx error code eliminate need to update tens of dashboards.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Jan 31, 2022
@bclozel
Copy link
Member

bclozel commented Feb 14, 2022

I completely understand the problem with dashboards, unfortunately there is no way to solve that without breaking other apps or making tags up from unknown data. I'm closing this issue for now, feel free to comment it with new information.

@bclozel bclozel closed this as completed Feb 14, 2022
@bclozel bclozel added status: declined A suggestion or change that we don't feel we should currently apply type: enhancement A general enhancement and removed status: waiting-for-triage An issue we've not yet triaged status: feedback-provided Feedback has been provided labels Feb 14, 2022
@s-volkov-1
Copy link

Hello, I described related problem in stackoverflow post: https://stackoverflow.com/questions/69913027/webflux-cancelledserverwebexchangeexception-appears-in-metrics-for-seemingly-no
TLDR: up to ~10% of requests to my server end up with CancelledServerWebExchangeException, according to prometheus metrics. But from the client perspective, everything with these requests looks fine.

@s-volkov-1
Copy link

@bclozel could you have a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: declined A suggestion or change that we don't feel we should currently apply type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

5 participants