-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid global publishing alignment #2818
Comments
To be clear, the issue with global alignment on publishing time is that it causes a concentration of work handling metrics from all application instances at the same time rather than naturally disbursing that work out over the step interval based on when application instances start up. This results in spikes of network traffic moving metrics to the backend, and spikes in CPU and memory usage processing all the incoming metrics on the metrics backend. These spikes are followed by idle time where nothing is published for the rest of the step interval for any application instance. |
I will use diagrams to illustrate the relation between step boundaries, registry start, and publications. Each timeline represents a different application instance. The legend is as follows:
When a publication happens, metrics based on the values recorded during the previous step interval are published. Step boundaries are relative to epoch time. Before the fix for #1218 the publication times would naturally be distributed throughout a step interval by application instances starting up at different times.
After the fix for #1218, we have the following situation.
Regardless of when applications startup, publication times are scheduled for the same time across all instances. |
One potential solution would be to record the time a |
The solution that worked for us is this: /pull/3054 We added a configuration and removed it from the alignment by simply allowing components to start on their own time. In our microservices mesh, that produced enough randomness to align the IO spikes. PR is included. |
We just figured out that this problem together with #1679 hits our influxdb server. The hundreds/thousands of jvms start a new TCP connection at the same wall clock time and this causes SYN buffer overflows on the RHEL-server that hosts influxdb. Even worse, this probably causes a spike on our datacenter, because all the jvms perform the reporting at the very same moment. The load spike(cpu/disk/network) on one jvm might not be huge, but multiplying that with hundreds/thousands of jvms doing it synchronously on a datacenter could be quite a spike ( I don't have measurements ie. facts for that, though). Edit: fix ticket reference |
In case it helps others visualize and understand, I'll continue with the diagrams I've previously used to explain the behavior with the proposed fix. It gets a little more complicated because everything won't be aligned to the epoch-based step intervals, we need to differentiate between those and the step intervals used for StepMeterRegistry Meters.
The step used by meters and for publishing will be based on the registry creation, by default. This allows the time when different application instances publish metrics to be different, while keeping the publishing aligned with the step values as was fixed in #1218. |
One concern with this change is that it makes reasoning about the aggregate more difficult. To create an aggregate value across instances the data will have to get normalized to a consistent step and that normalization can skew the data and create confusion. As a simple example, consider a service with a single node where the node gets replaced for deployments. Assume we have a 1m step and receive 1 request/second uniformly over the interval. The traffic is all being handled by
If we normalize the data to the epoch step using a simple weighted scheme, we get something like:
The normalized sum assuming a simple weighted scheme:
A user viewing the sum would see a drop in the rate followed by a spike that didn't really happen. When looking at aggregates over thousands or millions of time series these sort of effects can add up and be really difficult to debug. In spectator, the meters are always aligned to the epoch step, but publishing is randomized over the subsequent step interval to spread out load. To avoid the issues in #1218 we give a buffer of around 10% of the step, so for a 1m step it would try to publish sometime between |
@brharrington I agree that this will have some difficulty since the steps are completely randomized(but I believe this will be very negligible). The first option I tried was to just randomize the read and not the actual meters. The issue I observed is, we have 10's of thousands of meters in the registry and the time taken to read them varies significantly we can't really guarantee that the entire data will be read within a small interval and that made us increase the buffer to 20% and then to 30% and so on when the number of meters significantly increased. And this brought us to the original problem of publishing happening at the same time (or in this case a small interval of 10-20 seconds). And since the micrometer has different implementations where the reading of data is implemented by the implementations in their own way, we went with aligning the meters. |
+1 to what @brharrington said, @lenin-jaganathan. If Micrometer is step based, rather than counter based, then randomly aligning step boundaries re-introduces the problem counter based metrics systems have with aggregation skew. For step systems, they need to be aligned to wall-clock boundaries but the actual publishing can be spread across the step interval. I'd revisit the registry architecture if 10's of thousands of meters can't be read within 10s on a moderately busy JVM. And in practice, I've seen the step skew issue during aggregation many times and it's really not fun discussing it with customers. |
Agreed. I could see where the problem comes from. " the time taken to read them varies significantly" - For more clarity, let me try to use more accurate words. The time to complete one publish cycle varies significantly. This is primarily because not the reading is taking time rather the implementations read in batches and for every batch, some of the implementations try to post the data to the back end synchronously before reading the next batch. This is what really makes it really a slow publishing process as a whole. I see 2 approaches here,
I would really love to have this thread going on as this is one critical issue that will have huge cost implications for micrometer users. |
Thank you everyone for the input, and sorry for the delay in getting back to this. I think we can go with the approach suggested by @brharrington of scheduling publishing within the epoch-aligned step interval with a random offset as long as we avoid some buffer at the end of the step interval to avoid #1218. Is there any reason to avoid scheduling at the beginning of a step interval, though? |
@shakuzen I just went over it again but, If my understanding is right what @brharrington might not be straightforward implementation in a micrometer. Some, of this, is due to the fact that Micrometer is not a truly "Step-based" design. Rolling over of step happens only while reading the meters and not for every activity which might prevent us from implementing what was proposed above. (This will lead us to the same issue that was in my PR) Example: Assume a counter with step 60s and there is an increment operation for every second, so at the end of the minute count will be zero. At the 61st second, the spectator will do rollCount() and set the previous to 60, and record the values as 1 whereas the micrometer will record it as 61. If the poll is called at the 61st second, the spectator will return 60, and the current value will be 1 whereas the micrometer will return 61 and the current value will be 0. So, just rescheduling the publishing will not solve this problem in micrometer. |
Thanks for taking a look at it again @lenin-jaganathan and pointing that out. We've been masking that behavior by publishing at the beginning of the step since #1218. Prior to #1218, we did publish at an arbitrary offset to the start of a step based on when the registry was started. It seems the rolling behavior difference between Micrometer's StepCounter and the Spectator AtlasCounter was an intentional decision, but I don't know what the reasoning behind it was. |
The easiest thing I can think of is it just reduces the additional rollCount() call for every recording. Assuming we record 1M times per minute cumulatively on StepCounter we are probably looking at 1M calls to rollCount() and maybe that was the reason for the trade-off. Probably @jkschneider knows more about that. Git history seems to point to him for that comment. |
I too suspect the reason was to reduce overhead. I'll try to reach out to Jon to see if he remembers or has any thoughts on this. I had a chat internally and @marcingrzejszczak had an idea, which was to schedule preparing (reading) the metrics at the start of the interval (as now) but to add a random delay between that and sending the payloads to the metrics backend. That would minimize the deviation between the epoch-based step and the period for which published values correspond to. That's similar to @lenin-jaganathan's suggested approach in #2818 (comment) but I think the delay is still important so we avoid too much concentration of sending metrics to the backend at the same time from different instances (the problem described in this issue). A challenge of this approach is that it requires changes in each registry implementation to achieve that; unlike changing when publishes are scheduled which we can change in one place. |
I thought about it for the moment preparing payload and sending them later but that might be a memory-inefficient solution depending on registries because it almost guarantees that we'll hold to a huge payload (which will be garbage collected) a lot longer than expected. At least, I have seen some registries where there are no provisions to add common metadata (see Otlp resource attributes which are one way of adding common tags a static thing, and sending only once per request) and we just create and add the same tags over and over. (This I am thinking to raise as a new issue.) I did have one more implementation where polling and rolling over happen independently. Remember, rolling over on StepMeters (which does not include Gauge) is a trivial operation. But during publishing different registries poll data differently and most of them construct the payload during this phase and some export them with blocking calls too. So, for Step Registries, I think about rollOver at the Step and exporting randomly in the middle. But that too will depend on the implementations sending the data within the step (see #3717) which anyway would have more things to be stored for longer time. For registries with smaller steps (say under 5/10s), this problem may not be huge since the data is almost exported consistently. This issue is real for use cases where the Step is the 30s or more (most systems use 1-minute steps though this is my assumption that 1-minute is a more common window). PS: Getting too obsessed with the micrometer these days it seems🙂. But just a thought on this. |
We appreciate all the contributions you've been making. Micrometer is better for having you involved. Thank you! So the approach I'm currently thinking of is to not make things worse than now while solving the issue at hand here. A generic way we can hopefully resolve this is to schedule a background thread to poll meters at the start of each step, and move publishing from the beginning of a step to somewhere randomly in the step (but not too close to the end). That should keep the values reported as close to epoch-aligned steps as now (should be even a little better if batching would have cause some meters to roll over later). We can separately consider a configuration option for strict step alignment like e.g. AtlasCounter, but I would like to decouple that from this issue. |
This avoids different instances calling publish at the same time. It will avoid scheduling publishing at the last 20% of a step interval to avoid publishing spilling over into a subsequent step. An additional background thread is added for StepMeterRegistry that will poll all meters at the start of the step, separate from when publishing happens. This is so the values published are the ones recorded in the previous step. We call rollCount for step meters as early as possible in the step to minimize the number of recordings that will be reported for the wrong step. This is not perfect, but it is no worse than the status quo. Resolves gh-2818
Yes, that appears to be the case now.
The distributed coordination on publication time is a side effect we probably overlooked in merging #1218. Thank you for the investigation on that. While we want consistency within a process (or registry), we don't want coordination across processes on publishing, as you suggested.
That said, the publication happening at a different millisecond across instances should not be relied upon either.
Originally posted by @shakuzen in #2787 (comment)
The text was updated successfully, but these errors were encountered: