-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On Cancel, ObservationWebFilterDecorator Starts After-Filter Span without Stopping It #14031
Comments
Hi, @jtorkkel! Thanks for the detailed report. I've reproduced the issue you described and will work towards a fix. |
Thanks, I also suspect that some normal counters like http.server.requests had also double counting spring-projects/spring-framework#31417. You can see from commit spring-projects/spring-framework@da95542 how they prevented the double counting. |
Gotcha. We already need to do something similar for the before and after spans, so I think I'd try and reuse that. Will you please open a separate ticket for that so that it can be tracked independently of this fix for cancellations? |
Depending on when a request is cancelled, the before and after observation starts and stops may be called out of order due to the order in which their doOnCancel handlers are invoked. To address this, the before filter-wrapper now always closes both the before observation and the after observation. Since the before filter- wrapper wraps the entire request, this ensures that either that was started is stopped, and either that has not been started yet cannot inadvertently be started by any unexpected ordering of events that follows. Closes spring-projectsgh-14031
Thanks, I verified using 6.1.6 snapshot that active counters are now correct. |
Current situation:
We are using SpringBoot 3.1 in preproduction and we have noticed that new longTaskTimer metrics "spring_security_filterchains_active_seconds_active_count{spring_security_reached_filter_name="none",spring_security_reached_filter_section="after"}" keep increasing constantly. We have also noted that http_server_request metrics reports inaccurate request counts and status="UNKNOWN", outcome="UNKNOWN".
In 24h tests where we made 50 Million requests to service using ~20 parallel threads using connection pool, "spring_security_filterchains_active_seconds_active_count increases to value 8000, after 10d the value was still 8000 (server was running, no more requests).
At the same time http_server_requests count was 20000 larger than netty_server metrics, hinting that some requests calculated twice.
Same issue happen when test client like Jmeter is forced to "stop" requests before getting response. When repeated > 10000 times the spring_security_filterchains_active_seconds_active_count increase to over 10000 indicating no max limit in single longTaskTimermetrics. Each longTaskTimeris stored to micrometer registry, so this is considered as memory leak and eventually service will OOM or slow down.
The cancel() also happen duding few second tests where connection pool is used and no socket close during the tests.
Our analysis indicates that
In all these cases the cancel() cause the following issues
Issue status and ralated created issues
For example after 5000 request, 10 threads, 500 requests the metrics might show 4836.0 status="200", 190.0 status="UNKNOWNN totalling 5026 requests, when netty shown 5000 was handled.
Similar way 135 spring_security_filterchain "after" longTaskTimer were created but never stopped resulting leak.
spring_security_filterchains_active_seconds_active_count{spring_security_filterchain_position="0",spring_security_filterchain_size="0",spring_security_reached_filter_name="none",spring_security_reached_filter_section="after",} 135.0
spring_security_filterchains_seconds_count{error="none",spring_security_filterchain_position="13",spring_security_filterchain_size="13",spring_security_reached_filter_name="ServerWebExchangeReactorContextWebFilter",spring_security_reached_filter_section="after",} 4853.0
custom_channel_invocations_total{source="None",uri="/monoCancelNoLog",} 5000.0
reactor_netty_http_server_data_sent_bytes_count{uri="/",} 5000.0
reactor_netty_http_server_data_received_time_seconds_count{method="GET",uri="/",} 5000.0
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/monoCancelNoLog",} 4836.0
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="UNKNOWN",status="UNKNOWN",uri="/monoCancelNoLog",} 190.0
Desired situation:
spring_security_filterchains_active_seconds_active_count if used in production might request longTaskTimercount to increase to very high values, probably to millions.
btw, cancel() is not impacting to business logic and actual data streams, in testing no errors seen in HTTP responses. It seems only to impact "observations".
Environment:
Repro steps:
We initailly thought that this only happen with JWT but cancel() and spring_security_filterchains_active_seconds_active_count reproduce repro also with BASIC auth and also without authentication.
Similar way it repro with external calls like WebClient, calls but it also reproduce with restController simple Mono handler.
Attaced example application included which can be used to repro the cancel() and spring_security_filterchains_active_seconds_active_count increase
example.zip
How to repro random cancel():
Make few thousands calls to endpoint "GET localhost:8081/monoCancel" with minimum 10 threads. If no cancel() use 50 threads. cancel() and spring_security_filterchains_active_seconds_active_count increase is not reproducing if less threads are used.
Also seems that in Windows
cancel() is much more frequent than in linux, in Windows even 5-10% of the calls might result cancel().
In Windows I can repro always with 10 threads, and 100 request on each thread, i I use 10 threads, and 10 requests each I have to try few times.
If too many console entries without cancel(), use "GET localhost:8081/monoCancelNoLog", then you see only stackTrace when cancel() happen
/monoCancel is the simplest possible restController
Expected results with onComplete() and no cancel() when two call made using single thread to "GET localhost:8081/monoCancel"
In above logs threre are reactor netty HttpServerOperations logs enabled to show that client pooling works, socket reused.
CustomDefaultMeterObservationHandler extra logging is enabled which logs http_server_requests and spring_security_filterchains onStart and onStop.
CustomWebFilter is disabled because reactive stream logging enabled in API, CustomWebFilter can be used to log API not having log enabled
CustomServerRequestObservationConvention customize http_server__requests status, outceme an exception values, status will be fixed in next springFW.
They are enabled using properties
This result show random cancel() on request number 99 and another test slighly later in sample 761. I was using 10 thread, 100 request each and use console seach to find cancel
In second test only 10x10 but had to run twice to get once cancel(). Extra onStart and onStop cannot be correlated and not show here, but when cancel() exists the STOPname= spring.security.filterchains is missing incicating that cancel() is not handled and because of that longTaskTimermetrics keep running forever
What is interesting here is that cancel() timestamp is 5-6ms after onComplete(), maybe the latencies trigger extra cancel() somehow.
Full log in attached file channelOperationsExceptionCancel.txt
Enabling also logging.level.reactor.netty.channel.ChannelOperations=TRACE we can see different exception during the cancel
Full log in attachment channelOperationsExceptionCancel.txt
How to repro external cancel():
Simply make single call to call "GET localhost:8081/monoCancelDelay" and stop request within 5s. Each stopped call result cancel() and spring_security_filterchains_active_seconds_active_count increase ny one
Can be also tested using api "GET localhost:8081/statusLocalDelay" which make calls webclient call to own status/health with delay allowing client to be stopped or making webclient call to external API using /status and having delay in response
Example app also contains extra features which were used during repro
There are also few other API
Use mock server to external server, adding delay allow cancel while waiting for response.
The text was updated successfully, but these errors were encountered: