-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get metrics regarding opened file handlers #853
Comments
This is a good idea. Thanks very much. |
The name is confusing but it already exists in the process check: https://github.com/DataDog/dd-agent/blob/master/checks.d/process.py#L13 |
Sorry misread your use case to get it as a percentage. |
I think we also want the system limit, not just handles open per process. |
@clutchski you are right. Now we may have another problem, there is a system limit and a limit per process, probably we will need both. From my experience you may want to monitor file handler for the monitored processes, as these are the ones that do "surprise you from time to time. Also, I have no idea on how to enable these metrics. I checked the process.yaml file and it contains information only on how to monitor different processes not on how to enable these metrics (obviously I tried to search them using the web UI and they are not there) And regarding documentation, the best way to improve it is to improve the yaml templates and to include all supported parameters to them. If something is to hard / complex to explain in the yaml file, you can alway put an URL to a knowledge base article :) |
Just discovered that psutil was not installed. Should I open another bug as "the installed does not try to install psutil by default"? I installed psutil but now what do I need to do? do I need to restart dd-agent, change someone in the config? ... i wasn't able to see any error related to psutil in the dd-agent logs. |
We don't bundle check dependencies with the agent to avoid conflicts with existing versions on the user's system. But we are working towards a self contained agent which would actually install these dependencies. So there is no need to open another bug for that. Regarding the process check, currently it doesn't collect the system limit but it collects the number of open files descriptors for your watched processes. Can you get in touch with support@datadoghq.com to help you configure the check ? |
I will contact support, they are really good and also quick :) Now, just as a customer experience: I find annoying that by default only ⅓ of the functionality is available just because you do not have the required libraries installed. I hope the next installer will try to install them one way or another, I don't care how. The 2nd annoyance is that the default .yaml files are far from being extensive enough. I do think that you should make a rule of updating these with all available options, to use them as primary source of documentation. Most linux tools do have config files with commented options inside and most time this is all you need in order to configure the products. That's what I call self-documented. Thanks. Also it would be great to build a list of metrics with a description for each one. So we would know which one we do want to track or not and also to know exactly what a metric means, sometimes the name is not explicit enough and you may not be aware of the range of values it will take, unit of measure, .... |
Please do try to install psutils when installing agent, otherwise you are just providing a bad user experience. It is ok to ignore if it fails but doing an |
Thanks for the feedback @ssbarnea As of Agent 5.0.0, psutil is bundled in the deb, rpm and msi packages of the agent, and is installed on the fly with source installs. $ /opt/datadog-agent/embedded/bin/python -c "import psutil; print psutil.__version__"
2.2.1 We will work on this issue to implement the count of opened file handlers, as it's an important metric but feel free to open a pull request if you've already done so! Thanks again for the feedback! |
I am quite busy fixing other broken things around but be sure that if I implement something in datadog I will make pull requests. I prefer not to run my owned patched versions. I had an outage due to file handlers being ousted for one of the monitored processes (nginx) and it took me some time to find out the cause. So if Data dog can monitor the % of file handlers it would be perfect as we can have a single rule: if % open files (curr/max) is over 90% raise alarm. I do like being able to have relative conditions as it is much easier to manage them and also you do not have to update the monitors when you tune the configurations on the server side.
|
Looks like we could get that from /proc/sys/fs/file-nr @ssbarnea what do you think ? |
This doesn't seem to fix the issue, we need to be able to read the number of file descriptors per user and this seems to return the same result for any user. |
Thanks for the feedback @ssbarnea It's not possible to get the number of open FD per user without root access. Reading the number of open FD and the limit in /proc/sys/fs/file-nr would be on the other hand pretty straightforward, fast to execute, doesn't require root access and will let you the visibility to detect FD leaks. So it's likely the way we will go. What do you think ? |
We are running ~6 serious JVM applications on the same bare-metal machine, each o them under its own username, and they all have custom ulimits. We never went out of filehandlers for the system itself but every 3-4 months we have an issue releated to them, caused by either a bug or just normal usage increase. If we would monitor only the global number of file handlers we would not be able to stop who is generating the problem. As an workaround I could setup the same limits for all applicaitions, having them at 90% of total system limits for each of them and monitor only the total values. I do agree that under no circumstance we should count all FD for each PID. Needing root acees is not a problem from my point of view, doing a proper monitoring almost always required root access. There are way to secure this, allow datadog user to run a specific command that runs as root could be one option. |
I hope someone from DataDog will pull this DataDog/ansible-datadog#13 which is needed for this bug. |
@ssbarnea thanks for the feedback One way to do that would be for you to add lsof access to dd-agent in the sudoer file. Then we could have the process check to call lsof on the pids found by the process check. |
Yes it, in addition with the correct configuration of lsof via ansible this would work. Thanks! |
Just for the record, and to save others the time searching for it, currently Datadog only supports open file handlers per process. For system, |
@alexef Thanks for the information. Is there any possibilty that the total open file handlers per system will be included in the future? |
+1 - @remh what happened to monitoring the relevant values in |
👍 the aggregate of /proc/sys/fs/file-nr would be super useful! |
Reopening we can indeed add the content of /proc/sys/fs/file-nr although it's not as precise. |
Any movement on this issue? This is quite an important metrics for us. |
@abeluck, I've got a PR at DataDog/integrations-core#715 but some changes were requested before it can be considered to be merged. I don't have time to implement them right now, though. FWIW, we are using this patch as is since August. |
On Linux, the Agent now reports the total number of open file handles over the system limit (as the fraction
|
Another common problem with systems is the number of opened files.
Data dog should provide metrics regarding their use and more important to present one metric that measure % use, allowing us to add alerts if usage is above, let's say 80%.
The numeric value is not a big use by itself, but when measured agains the maximum value, which is configurable its value grows considerably.
http://www.cyberciti.biz/tips/linux-procfs-file-descriptors.html
The text was updated successfully, but these errors were encountered: