Async support for HttpSender #3717

lenin-jaganathan · 2023-03-25T09:20:55Z

Please describe the feature request.
HttpUrlConnectionSender currently sends data in a synchronous fashion I,e the metrics are sent and we wait for the response blocking the thread. Since the publishing happens on a dedicated thread this might seem problematic when combined with timeouts for long pending requests. But, when applications have thousands of metrics that need to be batched and sent, which might result in metrics being measured for different intervals rather than for the same step.

Scenario:
Assume an application has 15,000 meters in the registry and the registry has a default batch size of 10,000. The scheduler calls the publishing method at 00:00:00. Reading meters is a faster task and we would complete reading 10,000 meters in x milliseconds but most of the registries call HttpUrlConnectionSender.send() for this first batch. Since this is a blocking I/O this depends on network latency and other external factors. And the reading of the next 5000 meters is delayed by the response time of the first HTTP request. This ensures that the first and second batch of meters doesn't measure the measurement for the same period of time.

This is mostly true for most of the registries that export each Histogram bucket as a Gauge. 1000 timers with 25 buckets can easily take up 25k meters and reading this in batches and sending them synchronously will take away the Step concept since the data sent is no more normalized to a fixed window.

Rationale
To make sure that we are reporting the measurements for the same step period. (more so that we minimize the gap that is there in the reading of data).

Probably, something like an AsyncHttpSender which will return a CompleteableFuture instead of blocking the thread. And all these futures can be awaited once all of them are submitted.

shakuzen · 2023-03-27T03:19:00Z

I'm not sure we need an async abstraction to deal with the problem you mentioned. We just need to separate the steps of preparing the data to send and sending it, no?

lenin-jaganathan · 2023-03-27T05:31:35Z

Hmm. I believe batches are there for a reason and one of them I believe is not to overload a single HTTP request with a Huge payload. But if you are referring to just constructing the payload at once but just splitting them into multiple requests, I guess that will lead to duplicate implementations in most of the Registries whereas a client that can return a CompletableFuture immediately can help in such scenarios. And you also don't want to queue the HTTP requests, especially in case of a high number of meters(some of the registries export Histograms as separate meters which can increase the publishing time to a huge extent and hence a huge gap in the values recorded for the step).

lenin-jaganathan · 2023-03-27T06:33:48Z

#2818 is not a completely related issue. But, this comment of mine again points to registry implementations taking longer time to complete the publishing.

shakuzen · 2023-03-27T06:45:12Z

The batching feature was implemented because some backends have a limit on the number of metrics that can be in a single request.

But if you are referring to just constructing the payload at once but just splitting them into multiple requests, I guess that will lead to duplicate implementations in most of the Registries whereas a client that can return a CompletableFuture immediately can help in such scenarios.

I'm not sure I see the point. Each registry already has to implement the logic of preparing the payload which will be unique in each registry implementation. That's already there. The only difference is collecting a list of payloads and then sending them outside of the batch loop.

And you also don't want to queue the HTTP requests

I assume you mean that you don't want to send the requests in serial. Therein lies the complexity of what's being proposed here compared to the current arrangement. We currently publish on a single thread in most cases (some SDKs for metrics backends may use more threads under the hood, but they don't use HttpSender). If you're talking about making requests concurrently on multiple threads, now there is a lot more complexity to manage. We had (but never used) a configuration method called numThreads before on PushRegistryConfig that we deprecated.

If you're wanting to minimize the time between a payload being prepared (metrics read) and its metrics stored in the backend, then the current arrangement should be as good or better than a parallel async sender one.
If you're wanting to minimize the time between different payloads (batches) being prepared (metrics read), then my suggestion earlier should work.
If you're wanting to minimize the time between different payloads being stored, that's largely a resource problem, and we're also trying to minimize the resources that Micrometer uses to keep overhead down.

How long are you seeing publishing taking with how many meters and which backend(s)? Is it the time taken to send the HTTP request that is costly?
If it is taking time to send these in serial due to some resource constraint, it's not entirely clear to me that sending more requests in a shorter period of time is going to not cause more issues. It sounds like we need some metrics around publishing metrics. Of course the problem with this is that their usability is limited when we need them the most (publishing is failing for some reason).

Sending more metrics payloads at once (what I think you're proposing) will exacerbate the issue in #2818.

lenin-jaganathan · 2023-03-27T10:20:05Z

If you're wanting to minimize the time between different payloads (batches) being prepared (metrics read), then my suggestion earlier should work.

This is the problem, I am trying to bring up. Basically, the reading of meters in the registry should not be slowed down due to a blocking network call because this will skew the measurement window.

How long are you seeing publishing taking with how many meters and which backend(s)? Is it the time taken to send the HTTP request that is costly?

We have tried sending around 5-10 batches per publishing with batches ranging around 10k meters. This is a problem we started to see especially when we introduce a back-end that exports Histograms as individual metrics (unlike OTLP where everything is part of a single metric.) and has seen publishing lasting for more than 10 seconds. (step=60seconds)

It sounds like we need some metrics around publishing metrics. Of course, the problem with this is that their usability is limited when we need them the most (publishing is failing for some reason).

Agreed.

jonatan-ivanov · 2023-03-28T17:17:41Z

This is the problem, I am trying to bring up. Basically, the reading of meters in the registry should not be slowed down due to a blocking network call because this will skew the measurement window.

I think what Tommy is trying to say is that we can change this in a way that blocking network calls will not slow down reading of meters. If you "read-and-send" in a loop, they will be slowed down but if you read everything first (which usually does not include IO) and then send everything in batches, it should solve the issue, right? At least that's how I interpret this:

We just need to separate the steps of preparing the data to send and sending it, no?

lenin-jaganathan · 2023-05-30T09:29:04Z

What I was looking for is more like a "fire and forget" HTTP call which will reduce the publish time.

marcingrzejszczak · 2023-12-20T16:44:27Z

Have we managed to agree on something here?

github-actions · 2024-01-02T10:00:27Z

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

lenin-jaganathan · 2024-01-08T12:10:03Z

I think resource consumption is a valid concern to not use the async sender. For now, I will let this one go. Let's see if we need something in the future.

lenin-jaganathan added the waiting-for-triage label Mar 25, 2023

shakuzen added waiting for feedback We need additional information before we can continue and removed waiting-for-triage labels Mar 27, 2023

This was referenced Apr 6, 2023

Avoid global publishing alignment #2818

Closed

Start publish at randomized time in next step #3750

Merged

github-actions bot added the feedback-reminder label Jan 2, 2024

lenin-jaganathan closed this as completed Jan 8, 2024

jonatan-ivanov removed waiting for feedback We need additional information before we can continue feedback-reminder labels Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async support for HttpSender #3717

Async support for HttpSender #3717

lenin-jaganathan commented Mar 25, 2023 •

edited

Loading

shakuzen commented Mar 27, 2023

lenin-jaganathan commented Mar 27, 2023

lenin-jaganathan commented Mar 27, 2023

shakuzen commented Mar 27, 2023

lenin-jaganathan commented Mar 27, 2023

jonatan-ivanov commented Mar 28, 2023

lenin-jaganathan commented May 30, 2023

marcingrzejszczak commented Dec 20, 2023 •

edited

Loading

github-actions bot commented Jan 2, 2024

lenin-jaganathan commented Jan 8, 2024

Async support for HttpSender #3717

Async support for HttpSender #3717

Comments

lenin-jaganathan commented Mar 25, 2023 • edited Loading

shakuzen commented Mar 27, 2023

lenin-jaganathan commented Mar 27, 2023

lenin-jaganathan commented Mar 27, 2023

shakuzen commented Mar 27, 2023

lenin-jaganathan commented Mar 27, 2023

jonatan-ivanov commented Mar 28, 2023

lenin-jaganathan commented May 30, 2023

marcingrzejszczak commented Dec 20, 2023 • edited Loading

github-actions bot commented Jan 2, 2024

lenin-jaganathan commented Jan 8, 2024

lenin-jaganathan commented Mar 25, 2023 •

edited

Loading

marcingrzejszczak commented Dec 20, 2023 •

edited

Loading