Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

procstat_lookup always shows 1 when monitoring systemd_unit #5300

Closed
Frakenz opened this issue Jan 16, 2019 · 8 comments
Closed

procstat_lookup always shows 1 when monitoring systemd_unit #5300

Frakenz opened this issue Jan 16, 2019 · 8 comments
Labels
area/procstat bug unexpected problem or unintended behavior
Milestone

Comments

@Frakenz
Copy link
Contributor

Frakenz commented Jan 16, 2019

System info:

[Include Telegraf version, operating system name, and other relevant details]
Telegraf 1.9.1 (git: HEAD 2063609)
CentOS 7
InfluxDB shell version: 1.7.2

Steps to reproduce:

  1. sudo systemctl stop puppet.service
  2. Wait for telegraf interval to update, in my case 1 minute
  3. Enter into influxdb CLI with the corresponding database and run this query:
    SELECT last("pid_count") FROM "autogen"."procstat_lookup" WHERE ("systemd_unit" = 'puppet.service') AND time > now()-2m GROUP BY "myTag"

Expected behavior:

name: procstat_lookup
tags: myTag1
time last


1547670360000000000 0

Actual behavior:

name: procstat_lookup
tags: myTag1
time last


1547670360000000000 1

Additional info:

I only learned a bit of GO to read the source code, but from what I understood, the problem is that here len(pids) is not enough, since running systemctl show puppet.service when the service is not running will give MainPID=0 (see also #3612) and systemdUnitPIDs() does not check for zero values.

[Include gist of relevant config, logs, etc.]

Relevant telegraf.d/puppet.conf:

[[inputs.procstat]]
systemd_unit = "puppet.service"

@glinton
Copy link
Contributor

glinton commented Jan 17, 2019

There are cases where MainPID=0 when the service is running, which seems to be caused by using the incorrect systemd unit, or incorrectly constructed file. Maybe systemdUnitPIDs() should check for zero values to encourage the proper use of the units/files. ...or maybe it should check for the pids in a different way as many of the units/files are created by default when installing a service.

@Frakenz
Copy link
Contributor Author

Frakenz commented Jan 17, 2019

A different way to check if the service is running could be by checking the SubState or the ActiveState in the same way that the MainPID is being currently obtained.
In any case it would be good if someone could confirm if this is a bug or if it has something to do with my setup, so I can know how to proceed :)

@glinton
Copy link
Contributor

glinton commented Jan 17, 2019

That's a great suggestion. It seems like pid_count is a bad place to populate if MainPID=0 as the service has many child processes. There would also need to be a new path to determine "running" if that is the case, likely using the method you linked to - systemctl show -p SubState something.service. I suppose for compatibility sake, we set pid_count to 1 if the 'SubState' is "running" and 0 otherwise.

@glinton glinton added bug unexpected problem or unintended behavior area/procstat labels Jan 17, 2019
@danielnelson
Copy link
Contributor

Since 0 is not a valid PID, we should just check for it and skip it when reading the MainPID. The pid_count isn't meant to indicate if the service is running or not, just how many pids were found. We will likely have a solution for monitoring systemd service status in #4823.

@glinton
Copy link
Contributor

glinton commented Jan 17, 2019

Agreed, though I feel we should drop the pid_count metric entirely if systemd_unit is specified because it doesn't give any sort of accurate count as there will always only be a single pid in MainPID

edit:
If we do keep pid_count for systemd, I do think we should set it to 1 if the service is running (even if mainpid is 0) for more consistent behavior with properly configured/named services.

@Frakenz
Copy link
Contributor Author

Frakenz commented Jan 17, 2019

@danielnelson I thought that was what it was meant to do as per #4237

@danielnelson
Copy link
Contributor

This plugin doesn't report if systemd thinks a service is running or not, it searches for PIDs and then reads the process information for them. This field should contain the number of PIDs that have been found, if MainPID=0 then it hasn't found any valid PIDs so it should be zero.

This is why if you are using the pidfile method and the file exists containing a PID, the pid_count will report 1 no matter if the process is running or not.

It may be that we are not finding the PIDs correctly for systemd, which would be another issue, but I don't think we need to special case systemd in any way.

Frakenz added a commit to Frakenz/telegraf that referenced this issue Jun 8, 2019
Attempt at fixing influxdata#5300. Not sure if using an OR statement or writing a separate if was a better practice. Since both check for valid PIDs I used an OR statement.
Also I tested this on golang and know almost nothing of GO, so there might be better ways to write this.
@danielnelson danielnelson added this to the 1.11.1 milestone Jun 14, 2019
@danielnelson
Copy link
Contributor

Closed in #5972

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/procstat bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants