Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid global publishing alignment #2818

Closed
shakuzen opened this issue Oct 13, 2021 · 17 comments · Fixed by #3750
Closed

Avoid global publishing alignment #2818

shakuzen opened this issue Oct 13, 2021 · 17 comments · Fixed by #3750
Assignees
Labels
bug A general bug module: micrometer-core An issue that is related to our core module
Milestone

Comments

@shakuzen
Copy link
Member

Could you confirm my understanding that all replicas will now push their points at the exact same time?

Yes, that appears to be the case now.
The distributed coordination on publication time is a side effect we probably overlooked in merging #1218. Thank you for the investigation on that. While we want consistency within a process (or registry), we don't want coordination across processes on publishing, as you suggested.
That said, the publication happening at a different millisecond across instances should not be relied upon either.

Originally posted by @shakuzen in #2787 (comment)

@shakuzen shakuzen added bug A general bug module: micrometer-core An issue that is related to our core module labels Oct 13, 2021
@shakuzen shakuzen added this to the 1.7.x milestone Oct 13, 2021
@shakuzen
Copy link
Member Author

shakuzen commented Oct 14, 2021

To be clear, the issue with global alignment on publishing time is that it causes a concentration of work handling metrics from all application instances at the same time rather than naturally disbursing that work out over the step interval based on when application instances start up. This results in spikes of network traffic moving metrics to the backend, and spikes in CPU and memory usage processing all the incoming metrics on the metrics backend. These spikes are followed by idle time where nothing is published for the rest of the step interval for any application instance.

@shakuzen shakuzen changed the title Avoid global publication alignment Avoid global publishing alignment Oct 14, 2021
@shakuzen
Copy link
Member Author

shakuzen commented Nov 10, 2021

I will use diagrams to illustrate the relation between step boundaries, registry start, and publications. Each timeline represents a different application instance. The legend is as follows:

  • - marks the passage of time
  • | marks step interval boundaries
  • p marks publications
  • s marks when start is called on the MeterRegistry
  • S1/S2 indicate ordinal step intervals since the start

When a publication happens, metrics based on the values recorded during the previous step interval are published. Step boundaries are relative to epoch time.

Before the fix for #1218 the publication times would naturally be distributed throughout a step interval by application instances starting up at different times.

     S1         S2         S3         S4         S5
|------s---|------p---|------p---|------p---|------p---|
|---------s|---------p|---------p|---------p|---------p|
|---s------|---p------|---p------|---p------|---p------|

After the fix for #1218, we have the following situation.

     S1         S2         S3         S4         S5
|------s---|p---------|p---------|p---------|p---------|
|---------s|p---------|p---------|p---------|p---------|
|---s------|p---------|p---------|p---------|p---------|

Regardless of when applications startup, publication times are scheduled for the same time across all instances.

@shakuzen
Copy link
Member Author

One potential solution would be to record the time a PushMeterRegistry is constructed and pass that to StepMeters so they can align their step to the creation time of the registry, and likewise align the publish time to that. That should avoid the original issue described in #1218 and this issue. If someone would like to try implementing that, or has feedback on it, let us know.

@igr
Copy link

igr commented Mar 22, 2022

The solution that worked for us is this: /pull/3054 We added a configuration and removed it from the alignment by simply allowing components to start on their own time. In our microservices mesh, that produced enough randomness to align the IO spikes. PR is included.

@shakuzen shakuzen modified the milestones: 1.7.x, 1.8.x May 11, 2022
@karniemi
Copy link

karniemi commented Sep 29, 2022

We just figured out that this problem together with #1679 hits our influxdb server. The hundreds/thousands of jvms start a new TCP connection at the same wall clock time and this causes SYN buffer overflows on the RHEL-server that hosts influxdb. Even worse, this probably causes a spike on our datacenter, because all the jvms perform the reporting at the very same moment. The load spike(cpu/disk/network) on one jvm might not be huge, but multiplying that with hundreds/thousands of jvms doing it synchronously on a datacenter could be quite a spike ( I don't have measurements ie. facts for that, though).

Edit: fix ticket reference

@shakuzen
Copy link
Member Author

In case it helps others visualize and understand, I'll continue with the diagrams I've previously used to explain the behavior with the proposed fix. It gets a little more complicated because everything won't be aligned to the epoch-based step intervals, we need to differentiate between those and the step intervals used for StepMeterRegistry Meters.

  • \ marks epoch-based step interval boundaries
  • | continues to mark the step boundary for meters
  • t is passage of time since the epoch in seconds (with a step interval of 60 seconds below)
                S1          S2          S3          S4          S5
    \------|s---\------|p---\------|p---\------|p---\------|p---\
    \---------|s\---------|p\---------|p\---------|p\---------|p\
    \---|s------\---|p------\---|p------\---|p------\---|p------\
t = 0           60         120         180         240         300

The step used by meters and for publishing will be based on the registry creation, by default. This allows the time when different application instances publish metrics to be different, while keeping the publishing aligned with the step values as was fixed in #1218.

@brharrington
Copy link
Contributor

One concern with this change is that it makes reasoning about the aggregate more difficult. To create an aggregate value across instances the data will have to get normalized to a consistent step and that normalization can skew the data and create confusion. As a simple example, consider a service with a single node where the node gets replaced for deployments. Assume we have a 1m step and receive 1 request/second uniformly over the interval. The traffic is all being handled by node1 with an offset of 10s from the epoch step. For a deployment the traffic will transition to node2 and it gets an offset of 50s from the epoch step. Suppose the traffic transitions at an offset of 55s and it happens with no issues. The user looking at an aggregate sum of the RPS for the service would expect a steady line showing 1 request per second. With the scheme here the data published would be:

node1
- from 00:10 to 01:10 -- 1.00 RPS
- from 01:10 to 02:10 -- 0.75 RPS (traffic transitions at 01:55, so 45 / 60)
- from 02:10 to 03:10 -- 0.00 RPS

node2
- from 00:50 to 01:50 -- 0.00 RPS
- from 01:50 to 02:50 -- 0.92 RPS (traffic transitions at 01:55, so 55 / 60)
- from 02:50 to 03:50 -- 1.00 RPS

If we normalize the data to the epoch step using a simple weighted scheme, we get something like:

  1.00 RPS * 10/60 (node1 had 1 RPS for the first 10s of the epoch step)
+ 0.75 RPS * 50/60 (node1 had 0.75 RPS for the final 50s of the epoch step)
------------------
  0.79 RPS         (normalized PRS for the epoch step)

The normalized sum assuming a simple weighted scheme:

                     node1        node2       sum
00:00 to 01:00       1.00         0.00        1.00
01:00 to 02:00       0.79         0.17        0.96
02:00 to 03:00       0.13         0.93        1.06
03:00 to 04:00       0.00         1.00        1.00

A user viewing the sum would see a drop in the rate followed by a spike that didn't really happen. When looking at aggregates over thousands or millions of time series these sort of effects can add up and be really difficult to debug.

In spectator, the meters are always aligned to the epoch step, but publishing is randomized over the subsequent step interval to spread out load. To avoid the issues in #1218 we give a buffer of around 10% of the step, so for a 1m step it would try to publish sometime between :06 - :54. This has worked fairly well for us.

@lenin-jaganathan
Copy link
Contributor

lenin-jaganathan commented Nov 18, 2022

@brharrington I agree that this will have some difficulty since the steps are completely randomized(but I believe this will be very negligible). The first option I tried was to just randomize the read and not the actual meters. The issue I observed is, we have 10's of thousands of meters in the registry and the time taken to read them varies significantly we can't really guarantee that the entire data will be read within a small interval and that made us increase the buffer to 20% and then to 30% and so on when the number of meters significantly increased. And this brought us to the original problem of publishing happening at the same time (or in this case a small interval of 10-20 seconds). And since the micrometer has different implementations where the reading of data is implemented by the implementations in their own way, we went with aligning the meters.

@manolama
Copy link

manolama commented Nov 29, 2022

+1 to what @brharrington said, @lenin-jaganathan. If Micrometer is step based, rather than counter based, then randomly aligning step boundaries re-introduces the problem counter based metrics systems have with aggregation skew. For step systems, they need to be aligned to wall-clock boundaries but the actual publishing can be spread across the step interval. I'd revisit the registry architecture if 10's of thousands of meters can't be read within 10s on a moderately busy JVM.

And in practice, I've seen the step skew issue during aggregation many times and it's really not fun discussing it with customers.

@lenin-jaganathan
Copy link
Contributor

Agreed. I could see where the problem comes from.

" the time taken to read them varies significantly" - For more clarity, let me try to use more accurate words. The time to complete one publish cycle varies significantly.

This is primarily because not the reading is taking time rather the implementations read in batches and for every batch, some of the implementations try to post the data to the back end synchronously before reading the next batch. This is what really makes it really a slow publishing process as a whole.

I see 2 approaches here,

  • Go with what is suggested above by @brharrington "In spectator, the meters are always aligned to the epoch step, but publishing is randomized over the subsequent step interval to spread out the load. To avoid the issues in PushMeterRegistry doesn't align publish() with step boundaries #1218 we give a buffer of around 10% of the step, so for a 1m step it would try to publish sometime between 06-:54. This has worked fairly well for us."
  • Another thing I have tried is to read at top of the step to have a large window for reading metrics (unlike relying on a 10% window above), we have a queue to offer the data that was read and we can just truly de-synchronize the publishing data to back-end.

I would really love to have this thread going on as this is one critical issue that will have huge cost implications for micrometer users.

@shakuzen
Copy link
Member Author

shakuzen commented Apr 5, 2023

Thank you everyone for the input, and sorry for the delay in getting back to this. I think we can go with the approach suggested by @brharrington of scheduling publishing within the epoch-aligned step interval with a random offset as long as we avoid some buffer at the end of the step interval to avoid #1218. Is there any reason to avoid scheduling at the beginning of a step interval, though?

@shakuzen shakuzen modified the milestones: 1.11 backlog, 1.11.0-RC1 Apr 5, 2023
@lenin-jaganathan
Copy link
Contributor

@shakuzen I just went over it again but, If my understanding is right what @brharrington might not be straightforward implementation in a micrometer. Some, of this, is due to the fact that Micrometer is not a truly "Step-based" design. Rolling over of step happens only while reading the meters and not for every activity which might prevent us from implementing what was proposed above. (This will lead us to the same issue that was in my PR)

Example: Assume a counter with step 60s and there is an increment operation for every second, so at the end of the minute count will be zero. At the 61st second, the spectator will do rollCount() and set the previous to 60, and record the values as 1 whereas the micrometer will record it as 61. If the poll is called at the 61st second, the spectator will return 60, and the current value will be 1 whereas the micrometer will return 61 and the current value will be 0.

So, just rescheduling the publishing will not solve this problem in micrometer.

@shakuzen
Copy link
Member Author

shakuzen commented Apr 6, 2023

Thanks for taking a look at it again @lenin-jaganathan and pointing that out. We've been masking that behavior by publishing at the beginning of the step since #1218. Prior to #1218, we did publish at an arbitrary offset to the start of a step based on when the registry was started. It seems the rolling behavior difference between Micrometer's StepCounter and the Spectator AtlasCounter was an intentional decision, but I don't know what the reasoning behind it was.

@lenin-jaganathan
Copy link
Contributor

The easiest thing I can think of is it just reduces the additional rollCount() call for every recording. Assuming we record 1M times per minute cumulatively on StepCounter we are probably looking at 1M calls to rollCount() and maybe that was the reason for the trade-off. Probably @jkschneider knows more about that. Git history seems to point to him for that comment.

@shakuzen
Copy link
Member Author

shakuzen commented Apr 6, 2023

I too suspect the reason was to reduce overhead. I'll try to reach out to Jon to see if he remembers or has any thoughts on this.

I had a chat internally and @marcingrzejszczak had an idea, which was to schedule preparing (reading) the metrics at the start of the interval (as now) but to add a random delay between that and sending the payloads to the metrics backend. That would minimize the deviation between the epoch-based step and the period for which published values correspond to. That's similar to @lenin-jaganathan's suggested approach in #2818 (comment) but I think the delay is still important so we avoid too much concentration of sending metrics to the backend at the same time from different instances (the problem described in this issue). A challenge of this approach is that it requires changes in each registry implementation to achieve that; unlike changing when publishes are scheduled which we can change in one place.

@lenin-jaganathan
Copy link
Contributor

lenin-jaganathan commented Apr 6, 2023

I thought about it for the moment preparing payload and sending them later but that might be a memory-inefficient solution depending on registries because it almost guarantees that we'll hold to a huge payload (which will be garbage collected) a lot longer than expected. At least, I have seen some registries where there are no provisions to add common metadata (see Otlp resource attributes which are one way of adding common tags a static thing, and sending only once per request) and we just create and add the same tags over and over. (This I am thinking to raise as a new issue.)

I did have one more implementation where polling and rolling over happen independently. Remember, rolling over on StepMeters (which does not include Gauge) is a trivial operation. But during publishing different registries poll data differently and most of them construct the payload during this phase and some export them with blocking calls too. So, for Step Registries, I think about rollOver at the Step and exporting randomly in the middle. But that too will depend on the implementations sending the data within the step (see #3717) which anyway would have more things to be stored for longer time.

For registries with smaller steps (say under 5/10s), this problem may not be huge since the data is almost exported consistently. This issue is real for use cases where the Step is the 30s or more (most systems use 1-minute steps though this is my assumption that 1-minute is a more common window).

PS: Getting too obsessed with the micrometer these days it seems🙂. But just a thought on this.

@shakuzen
Copy link
Member Author

PS: Getting too obsessed with the micrometer these days it seems🙂.

We appreciate all the contributions you've been making. Micrometer is better for having you involved. Thank you!

So the approach I'm currently thinking of is to not make things worse than now while solving the issue at hand here. A generic way we can hopefully resolve this is to schedule a background thread to poll meters at the start of each step, and move publishing from the beginning of a step to somewhere randomly in the step (but not too close to the end). That should keep the values reported as close to epoch-aligned steps as now (should be even a little better if batching would have cause some meters to roll over later). We can separately consider a configuration option for strict step alignment like e.g. AtlasCounter, but I would like to decouple that from this issue.

shakuzen added a commit that referenced this issue Apr 11, 2023
This avoids different instances calling publish at the same time. It will avoid scheduling publishing at the last 20% of a step interval to avoid publishing spilling over into a subsequent step.
An additional background thread is added for StepMeterRegistry that will poll all meters at the start of the step, separate from when publishing happens. This is so the values published are the ones recorded in the previous step. We call rollCount for step meters as early as possible in the step to minimize the number of recordings that will be reported for the wrong step. This is not perfect, but it is no worse than the status quo.

Resolves gh-2818
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug module: micrometer-core An issue that is related to our core module
Projects
None yet
6 participants