Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async support for HttpSender #3717

Closed
lenin-jaganathan opened this issue Mar 25, 2023 · 10 comments
Closed

Async support for HttpSender #3717

lenin-jaganathan opened this issue Mar 25, 2023 · 10 comments

Comments

@lenin-jaganathan
Copy link
Contributor

lenin-jaganathan commented Mar 25, 2023

Please describe the feature request.
HttpUrlConnectionSender currently sends data in a synchronous fashion I,e the metrics are sent and we wait for the response blocking the thread. Since the publishing happens on a dedicated thread this might seem problematic when combined with timeouts for long pending requests. But, when applications have thousands of metrics that need to be batched and sent, which might result in metrics being measured for different intervals rather than for the same step.

Scenario:
Assume an application has 15,000 meters in the registry and the registry has a default batch size of 10,000. The scheduler calls the publishing method at 00:00:00. Reading meters is a faster task and we would complete reading 10,000 meters in x milliseconds but most of the registries call HttpUrlConnectionSender.send() for this first batch. Since this is a blocking I/O this depends on network latency and other external factors. And the reading of the next 5000 meters is delayed by the response time of the first HTTP request. This ensures that the first and second batch of meters doesn't measure the measurement for the same period of time.

This is mostly true for most of the registries that export each Histogram bucket as a Gauge. 1000 timers with 25 buckets can easily take up 25k meters and reading this in batches and sending them synchronously will take away the Step concept since the data sent is no more normalized to a fixed window.

Rationale
To make sure that we are reporting the measurements for the same step period. (more so that we minimize the gap that is there in the reading of data).

Probably, something like an AsyncHttpSender which will return a CompleteableFuture instead of blocking the thread. And all these futures can be awaited once all of them are submitted.

@shakuzen
Copy link
Member

I'm not sure we need an async abstraction to deal with the problem you mentioned. We just need to separate the steps of preparing the data to send and sending it, no?

@shakuzen shakuzen added waiting for feedback We need additional information before we can continue and removed waiting-for-triage labels Mar 27, 2023
@lenin-jaganathan
Copy link
Contributor Author

Hmm. I believe batches are there for a reason and one of them I believe is not to overload a single HTTP request with a Huge payload. But if you are referring to just constructing the payload at once but just splitting them into multiple requests, I guess that will lead to duplicate implementations in most of the Registries whereas a client that can return a CompletableFuture immediately can help in such scenarios. And you also don't want to queue the HTTP requests, especially in case of a high number of meters(some of the registries export Histograms as separate meters which can increase the publishing time to a huge extent and hence a huge gap in the values recorded for the step).

@lenin-jaganathan
Copy link
Contributor Author

#2818 is not a completely related issue. But, this comment of mine again points to registry implementations taking longer time to complete the publishing.

@shakuzen
Copy link
Member

The batching feature was implemented because some backends have a limit on the number of metrics that can be in a single request.

But if you are referring to just constructing the payload at once but just splitting them into multiple requests, I guess that will lead to duplicate implementations in most of the Registries whereas a client that can return a CompletableFuture immediately can help in such scenarios.

I'm not sure I see the point. Each registry already has to implement the logic of preparing the payload which will be unique in each registry implementation. That's already there. The only difference is collecting a list of payloads and then sending them outside of the batch loop.

And you also don't want to queue the HTTP requests

I assume you mean that you don't want to send the requests in serial. Therein lies the complexity of what's being proposed here compared to the current arrangement. We currently publish on a single thread in most cases (some SDKs for metrics backends may use more threads under the hood, but they don't use HttpSender). If you're talking about making requests concurrently on multiple threads, now there is a lot more complexity to manage. We had (but never used) a configuration method called numThreads before on PushRegistryConfig that we deprecated.

If you're wanting to minimize the time between a payload being prepared (metrics read) and its metrics stored in the backend, then the current arrangement should be as good or better than a parallel async sender one.
If you're wanting to minimize the time between different payloads (batches) being prepared (metrics read), then my suggestion earlier should work.
If you're wanting to minimize the time between different payloads being stored, that's largely a resource problem, and we're also trying to minimize the resources that Micrometer uses to keep overhead down.

How long are you seeing publishing taking with how many meters and which backend(s)? Is it the time taken to send the HTTP request that is costly?
If it is taking time to send these in serial due to some resource constraint, it's not entirely clear to me that sending more requests in a shorter period of time is going to not cause more issues. It sounds like we need some metrics around publishing metrics. Of course the problem with this is that their usability is limited when we need them the most (publishing is failing for some reason).

Sending more metrics payloads at once (what I think you're proposing) will exacerbate the issue in #2818.

@lenin-jaganathan
Copy link
Contributor Author

If you're wanting to minimize the time between different payloads (batches) being prepared (metrics read), then my suggestion earlier should work.

This is the problem, I am trying to bring up. Basically, the reading of meters in the registry should not be slowed down due to a blocking network call because this will skew the measurement window.

How long are you seeing publishing taking with how many meters and which backend(s)? Is it the time taken to send the HTTP request that is costly?

We have tried sending around 5-10 batches per publishing with batches ranging around 10k meters. This is a problem we started to see especially when we introduce a back-end that exports Histograms as individual metrics (unlike OTLP where everything is part of a single metric.) and has seen publishing lasting for more than 10 seconds. (step=60seconds)

It sounds like we need some metrics around publishing metrics. Of course, the problem with this is that their usability is limited when we need them the most (publishing is failing for some reason).

Agreed.

@jonatan-ivanov
Copy link
Member

This is the problem, I am trying to bring up. Basically, the reading of meters in the registry should not be slowed down due to a blocking network call because this will skew the measurement window.

I think what Tommy is trying to say is that we can change this in a way that blocking network calls will not slow down reading of meters. If you "read-and-send" in a loop, they will be slowed down but if you read everything first (which usually does not include IO) and then send everything in batches, it should solve the issue, right? At least that's how I interpret this:

We just need to separate the steps of preparing the data to send and sending it, no?

@lenin-jaganathan
Copy link
Contributor Author

What I was looking for is more like a "fire and forget" HTTP call which will reduce the publish time.

@marcingrzejszczak
Copy link
Contributor

marcingrzejszczak commented Dec 20, 2023

Have we managed to agree on something here?

Copy link

github-actions bot commented Jan 2, 2024

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

@lenin-jaganathan
Copy link
Contributor Author

I think resource consumption is a valid concern to not use the async sender. For now, I will let this one go. Let's see if we need something in the future.

@jonatan-ivanov jonatan-ivanov removed waiting for feedback We need additional information before we can continue feedback-reminder labels Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants