-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
system.mem.used doesn't agree with procps free(1) #3106
Comments
This calculation is now done in system.py: dd-agent/checks/system/unix.py Lines 375 to 382 in be37ae6
If you are using kernel 3.14+ you shouldn't be affected as the check is using the kernel's own estimation of available memory (MemAvailable in /proc/meminfo). Issue arise when you run on kernels without MemAvailable, where agent does not take at lest reclaimable slabs (SReclaimable) into consideration. I've reported this issue to support back in November. |
@pjoakimsson Are we talking about the same thing? I'm referring to the |
Hi @mfischer-zd . It's because at the time of this commit procps was doing kb_main_used = kb_main_total - kb_main_free; See: That's why we introduced the It seems that since procps has changed that. I'm going to close this issue as fixing that would induce a big change in the meaning of the metric and we already have a metric that reports the right thing. |
What is the metric that "reports the right thing"? |
Hey sorry for the confusion. So what happened is that, traditionally linux memory tools (procps' free) was including shared and cached memory in its total used memory (In practice it was just getting the total amount of memo memory and subtracting the amount of memory in used reported by We decided to mimic that behavior in our However, as you know, this metric is not that useful as cached and buffered memory can be reclaimable to be used by the OS / processes. That's why we introduced a However it seems that procps changed that 2 years ago in the following commit: We did not change our metric because:
So tl;dr:
|
We think the actual amount of used memory should be graphed in Even There's an open question as to whether preserving the current formula is a good thing for your remaining customer base. My feeling is that they'd benefit more from future accuracy than in preserving the broken behavior of previous versions. Besides, past data wouldn't change; only metrics submitted after an agent upgrade would. However, if any dashboards already correct for the existing incorrect data, those would begin to break, which is a shame. On the other hand, I don't think forcing the user to make corrections in dashboards should be encouraged as a way to solve the agents' misbehavior. To keep multiple parties happy, perhaps it's worth making a compromise such as providing a new metric like |
@mfischer-zd The mem_used in new procps you are refering to does not count cache and buffers as used memory. To me cache and buffers are still used memory, thus dd-agent (and kernel) is displaying the correct value. You could use system.mem.total - system.mem.usable to get a value that matches the new procps, but it will only be correct if you use newer kernel versions. |
Suppose "used" memory included buffers and cache (which Datadog's metric does today, but procps no longer does). Why would you want to graph it? What would such a value suggest when it reaches the amount of installed (total) memory? The natural conclusion, particularly for a less-experienced observer, would be that the system is running out of memory and that immediate action needs to be taken. But that is not necessarily so, and that is the fallacy that http://www.linuxatemyram.com/ tried to dispel. That's also why procps started discounting for buffers and cache as long as 5 years ago, and no longer includes it at all in more recent versions. We want metrics that are not merely accurate (for whatever sense of the word you want to use), but are useful. But, as @remh tacitly admits, A useful graph of memory on Linux looks like this (perhaps with some amendments): The blue area, used memory, should be what The user shouldn't have to go through these sorts of gymnastics to get something like this. Nor should the user have to risk embarrassing herself when she submits a plot of |
Agreed, we can improve things to make clearer what's useful. Reopening this for discussion. We are going to make the default host dashboard use memory usable as a first step. (should be done beg of next week) We are considering adding a new metric, however it doesn't seem necessary as we have the usable memory. We could add a new metric (that had yet to be named) that would be |
If you don't create a new metric or correct its calculation in the agent, anyone who wants to plot the discounted value of used memory (which is the correct value for most diagnostic purposes) will have to create an arithmetic expression in the UI to plot it. This is a suboptimal user experience, in my view. There are a few different options here:
Solution (2) seems to achieve the best balance between correcting the behavior and giving users the opportunity to adjust to it in due course. |
@mfischer-zd I wanted to follow up on our conversation regarding this issue. My 0.02 would be that of the outlined options #3 would be preferred. As a general rule of thumb, I believe we should not change the definition or behavior of a metric once it has been released; and metrics should have consistent definitions. In the past we've renamed metrics if a change in behavior was required, and I believe we should stick with this approach. With that in mind I think we should come to an agreement on a name for the a undiscounted memory metric, while we look into implementing it in a future agent release. I have made this similar recommendation internally. To be transparent on timelines though:
As far as next steps, I am discussing with our product team and agent engineering team what the timeline for the following release will be, and how this feature request may fit in. I believe the Integration SDK work may be be able speed this process along, but I'll check back in with you here on this issue next week with updates on my findings. |
As of a very old commit (ca59049) system.mem.used no longer agrees with the
used
column offree
in procps. (Oddly, the comment in the commit says the change is to be consistent with procps, but it isn't.) This should be fixed right away, because thesystem.mem.used
metric is currently too high in many cases, impeding our diagnoses of problems.The algorithm can be found at https://gitlab.com/procps-ng/procps/blob/master/proc/sysinfo.c#L679-797
In sum:
See #64 where this issue was first raised.
The text was updated successfully, but these errors were encountered: