Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of cancelled requests when publishing WebClient metrics #18444

Closed
rubasace opened this issue Sep 30, 2019 · 10 comments
Closed
Assignees
Labels
type: enhancement A general enhancement
Milestone

Comments

@rubasace
Copy link

rubasace commented Sep 30, 2019

Currently, MetricsWebClientFilterFunction does handle the signal via doOnEach() method. When a timeout cancels the pipeline, there's no signal emitted so no metric will be recorded.
It seems to me that this can cause the WebClient metrics to be misleading, as they will only show numbers lower than the timeout applied to the pipeline.

@rubasace rubasace changed the title timeout MetricsWebClientFilterFunction doesn't handle timeout Sep 30, 2019
@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Sep 30, 2019
@philwebb philwebb added the for: team-attention An issue we'd like other members of the team to review label Sep 30, 2019
@bclozel
Copy link
Member

bclozel commented Oct 1, 2019

I'm not really sure about this.

We're recording metrics on client operations and we're adding tags using data available.
For HTTP clients, we're tagging the "status" as the HTTP status itself, IO_ERROR in case of IOException and CLIENT_ERROR for other errors.

In this case, it seems that the application code is cancelling the operation with a timeout operator. From an infrastructure point of view, we don't know if the cancel signals comes from an expected use case, such as using the fastest responding service and cancelling the others, or an actual application exception that cancels the demand upstream.
We don't support the now deprecated AsyncRestTemplate with an AsyncClientHttpRequestInterceptor, so we can't really align with an existing behavior of Future.cancel()

I've got a fix for this issue, but:

  • it only works for reactor 3.3.0 as I'm leveraging a new operator (so only applicable to Spring Boot 2.2.x)
  • so far the produced metrics will be outcome:UNKNOWN and status:CLIENT_ERROR, but I'm not sure this really makes sense

From metrics point of view, fixing this problem could show many errors with a timer right at the timeout value. If the timeout is really a feature, then dashboards will be overwhelmed by errors and would create false positives or hide actual errors. If the timeout is considered as an error, then it would help developers to tune their timeout value.

Maybe using a specific CANCEL status tag could at least avoid confusion.

@rubasace
Copy link
Author

rubasace commented Oct 2, 2019

I was thinking of a specific tag. As you state, UNKNOWN can cause a lot of false positives, while CANCEL will give enough granularity.

Regarding the new operator, can you provide any link to follow up/understand the context?

Also, I'm not sure about why the new operator is needed for this particular scenario, isn't the doOnCancel() good enough? I understand that current implementation depends on the SubscriberContext to store the startTime, and the context won't be available from the doOnCancel() runnable, but would it have any impact if the start time is just stored in a variable outside the pipeline?

@bclozel
Copy link
Member

bclozel commented Oct 2, 2019

I was thinking of a specific tag. As you state, UNKNOWN can cause a lot of false positives, while CANCEL will give enough granularity.

So you think the following tags outcome:UNKNOWN and status:CANCELLED would work?

Regarding the new operator, can you provide any link to follow up/understand the context?

I'm using Mono.deferredWithContext; there is another way to achieve the same thing but the CANCEL signal is not taken into account in my implementation in that case.

but would it have any impact if the start time is just stored in a variable outside the pipeline?

See this commit for why we're using the reactor context: bdd95f0

@rubasace
Copy link
Author

rubasace commented Oct 2, 2019

Sorry, on the comment above I meant CLIENT_ERROR can cause a lot of false positives.

So you think the following tags outcome:UNKNOWN and status:CANCELLED would work?

I think so, it offers enough information to filter them. Though I don't see the outcome tag now on the http_client_requests_* entries, only on the http_server_requests_* ones. Is this something being introduced on 2.2.x?

@philwebb philwebb added for: team-meeting An issue we'd like to discuss as a team to make progress and removed for: team-attention An issue we'd like other members of the team to review labels Oct 2, 2019
@bclozel
Copy link
Member

bclozel commented Oct 11, 2019

Before adding any new outcome tag value, we need to discuss this with @shakuzen as this use case might happen in with other parts of the infrastructure.

@rubasace
Copy link
Author

Sure, there's definitely no rush from my side on such scenario, and definitely I can hack a workaround on my side if needed. Could you, in the meantime, clarify this part? :

Though I don't see the outcome tag now on the http_client_requests_* entries, only on the http_server_requests_* ones. Is this something being introduced on 2.2.x?

@bclozel
Copy link
Member

bclozel commented Oct 15, 2019

@rubasace yes, see #15594

@wilkinsona wilkinsona removed the for: team-meeting An issue we'd like to discuss as a team to make progress label Jan 31, 2020
@wilkinsona wilkinsona added type: enhancement A general enhancement and removed status: waiting-for-triage An issue we've not yet triaged labels Mar 12, 2020
@wilkinsona wilkinsona added this to the 2.3.x milestone Mar 12, 2020
@bclozel bclozel modified the milestones: 2.3.x, 2.3.0.RC1 Apr 10, 2020
@rubasace
Copy link
Author

@bclozel what's the current implementation behaviour when we are consuming the body as a Flux and we limit it with something like next() or take(n). I implemented something very similar on a custom ExchangeFilterFunction for logging and in those scenarios, I was logging false positives for timeouts as after receiving the requested elements a cancellation signal is sent upstream to stop the producer. I think the same might happen here with the current implementation.

The way I solved it was simply flagging somehow on the doOnNext() operations that we already received something, so if that happens I just ignore the cancellation signal assuming it's desired behaviour. Of course, there's the edge case where we are actually timing out on a Flux if a particular element takes too long, but I think it's an acceptable tradeoff given that we don't know the reason of the cancellation.

What do you think?

@bclozel
Copy link
Member

bclozel commented Apr 11, 2020

@rubasace this feature is not about recording timeouts, but recording cancelled requests. This can happen for various operators like timeout or take, and in all cases we can't know whether this is the intended behavior or an error.

From a pure WebClient point of view, we're recording cancellations because the request did not complete and we cannot record tags with the response (status, execution time).

The workaround you're describing might work, but we'd still get "false negatives" if the client receives the response headers but no body, or only a partial body.
You're right that in this case we could still record metrics twice for the same request: once when receiving the response, and another time if a cancel event happens while reading the body.

I think it's still a good compromise and I'll adapt these changes to only record cancellations if we did not receive any response at all.

@bclozel bclozel reopened this Apr 11, 2020
@PyvesB
Copy link

PyvesB commented Apr 23, 2020

I also just bumped into this issue recently. Mapping cancelled signals for client requests to CLIENT_ERROR sounds sensible and happens to be consistent with the status tag value that gets used when you configure a ReadTimeoutHandler or WriteTimeoutHandler at the Netty level. Thanks for working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

6 participants