Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spring Boot (version 3.1.2) has UNKNOWN status in prometheus metrics #31028

Closed
dimon8829 opened this issue Aug 10, 2023 · 14 comments
Closed

Spring Boot (version 3.1.2) has UNKNOWN status in prometheus metrics #31028

dimon8829 opened this issue Aug 10, 2023 · 14 comments
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) status: duplicate A duplicate of another issue

Comments

@dimon8829
Copy link

Hi all. When I switched to spring boot from version 2.2.13.RELEASE to version 3.1.2, there was a problem with the metrics.

application.yml:

management:
  endpoint:
    health:
      show-details: "ALWAYS"
    endpoint:
      metrics:
        enabled: true
      prometheus:
        enabled: true
  endpoints:
    web:
      exposure:
        include: "*"
  metrics:
    export:
      prometheus:
        enabled: true

Namely, in some metrics, the UNKNOWN status appeared. Although in fact all requests were successful and should have a status of 200:

http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 9993.0
http_server_requests_seconds_sum{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 80.860505828
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="UNKNOWN",status="UNKNOWN",uri="/actuator/prometheus",} 10.0
http_server_requests_seconds_sum{error="none",exception="none",method="GET",outcome="UNKNOWN",status="UNKNOWN",uri="/actuator/prometheus",} 0.170536004

This can be easily reproduced by sending 1000 parallel requests to the /actuator/prometheus endpoint:

    public static void main(String[] args) {
        int numThreads = 10;
        int numRequestsPerThread = 1000;

        ExecutorService executorService = Executors.newFixedThreadPool(numThreads);
        HttpClient httpClient = HttpClient.newHttpClient();

        for (int i = 0; i < numThreads; i++) {
            executorService.submit(() -> {
                for (int j = 0; j < numRequestsPerThread; j++) {
                    sendHttpRequest(httpClient, "http://localhost:8080/actuator/prometheus");
                }
            });
        }

        executorService.shutdown();
    }

    private static void sendHttpRequest(HttpClient httpClient, String url) {
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create(url))
                .GET()
                .build();

        try {
            HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
            if (response.statusCode() != 200) {
                System.out.println("Response Code: " + response.statusCode());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

What could be causing this problem and how to fix it? Thank you!

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Aug 10, 2023
@mhalbritter
Copy link
Contributor

mhalbritter commented Aug 10, 2023

Hello!

I can't reproduce this:

> curlie :8080/actuator/prometheus | grep -i http_server_requests_seconds
HTTP/1.1 200
Content-Type: text/plain;version=0.0.4;charset=utf-8
Content-Length: 12198
Date: Thu, 10 Aug 2023 12:01:02 GMT

# HELP http_server_requests_seconds
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 20006.0
http_server_requests_seconds_sum{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 18.324825225
# HELP http_server_requests_seconds_max
# TYPE http_server_requests_seconds_max gauge
http_server_requests_seconds_max{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 0.0143335

Maybe it has something to do with dropped requests? Do you have a sample application where this happens all the time?

@mhalbritter mhalbritter added the status: waiting-for-feedback We need additional information before we can continue label Aug 10, 2023
@dimon8829
Copy link
Author

dimon8829 commented Aug 10, 2023

Thank you for response! Yes, I runnig spring boot app:

Reproducer: untitled.zip

And then I get this result:

# HELP http_server_requests_seconds  
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 996.0
http_server_requests_seconds_sum{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 143.624052605
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="UNKNOWN",status="UNKNOWN",uri="/actuator/prometheus",} 5.0
http_server_requests_seconds_sum{error="none",exception="none",method="GET",outcome="UNKNOWN",status="UNKNOWN",uri="/actuator/prometheus",} 0.855936009

Nobody dropped this requests.
Thank you!

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Aug 10, 2023
@snicoll

This comment was marked as resolved.

@snicoll snicoll added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Aug 10, 2023
@dimon8829

This comment was marked as resolved.

@mhalbritter
Copy link
Contributor

Those are the "successful" requests:

http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 996.0

Those are the "unsuccessful" ones:

http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="UNKNOWN",status="UNKNOWN",uri="/actuator/prometheus",} 5.0

So something is different with these requests. Let's see if we can find this when we get the sample.

@mhalbritter mhalbritter removed the status: waiting-for-feedback We need additional information before we can continue label Aug 10, 2023
@mhalbritter
Copy link
Contributor

Nope, sorry, I can't reproduce this:

curlie :8080/actuator/prometheus | grep -i http_server_requests_seconds
HTTP/1.1 200 OK
Content-Type: text/plain;version=0.0.4;charset=utf-8
Content-Length: 11984

# HELP http_server_requests_seconds
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 1000.0
http_server_requests_seconds_sum{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 22.91518926
# HELP http_server_requests_seconds_max
# TYPE http_server_requests_seconds_max gauge
http_server_requests_seconds_max{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 0.197442417

@dimon8829
Copy link
Author

And if you increase the number of requests from 1000 to 10000?

 int numRequests = 1000;

@mhalbritter
Copy link
Contributor

After running it multiple times with 10000 requests, I now got it:

# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 19995.0
http_server_requests_seconds_sum{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 570.818983592
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="UNKNOWN",status="UNKNOWN",uri="/actuator/prometheus",} 9.0
http_server_requests_seconds_sum{error="none",exception="none",method="GET",outcome="UNKNOWN",status="UNKNOWN",uri="/actuator/prometheus",} 0.353837208
# HELP http_server_requests_seconds_max
# TYPE http_server_requests_seconds_max gauge
http_server_requests_seconds_max{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 0.282042959
http_server_requests_seconds_max{error="none",exception="none",method="GET",outcome="UNKNOWN",status="UNKNOWN",uri="/actuator/prometheus",} 0.163144375

@wilkinsona
Copy link
Member

wilkinsona commented Aug 10, 2023

There's some overlap between this issue and spring-projects/spring-boot#33300. However, the concerns about the benchmarking tool closing connections prematurely (from the server's perspective) doesn't apply here unless the HTTP client is doing something similar.

@mhalbritter
Copy link
Contributor

mhalbritter commented Aug 10, 2023

It looks like the culprit is org.springframework.http.server.reactive.observation.DefaultServerRequestObservationConvention#status, which is in Spring Framework, not in Boot:

if (context.isConnectionAborted()) {
  return HTTP_OUTCOME_UNKNOWN;
}

Not sure if this can be triggered in any other case.

When setting a breakpoint in org.springframework.web.filter.reactive.ServerHttpObservationFilter#filter(org.springframework.web.server.ServerWebExchange, org.springframework.http.server.reactive.observation.ServerRequestObservationContext, reactor.core.publisher.Mono<java.lang.Void>), you'll see that observationContext.setConnectionAborted(true) is called when the Mono call is cancelled.

@mhalbritter
Copy link
Contributor

Maybe it's a duplicate of #29720

@snicoll snicoll transferred this issue from spring-projects/spring-boot Aug 10, 2023
@dimon8829
Copy link
Author

mhalbritter This problem was not in version 2.2.13.RELEASE, but it appeared in version 3.1.2.
Therefore, I'm not sure that this issue is a duplicate of #29720

@rohinisb
Copy link

rohinisb commented Sep 26, 2023

We are seeing the same issue in production. When hitting the API through Postman, I can see 200 response code, but the same is captured as UNKNOWN in /actuator/prometheus. We are using Spring-boot 3.1.0

@bclozel
Copy link
Member

bclozel commented Oct 5, 2023

Closing because of #29720 (comment)

@bclozel bclozel closed this as not planned Won't fix, can't repro, duplicate, stale Oct 5, 2023
@bclozel bclozel added status: duplicate A duplicate of another issue in: web Issues in web modules (web, webmvc, webflux, websocket) and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) status: duplicate A duplicate of another issue
Projects
None yet
Development

No branches or pull requests

7 participants