procstat_lookup always shows 1 when monitoring systemd_unit #5300

Frakenz · 2019-01-16T20:38:53Z

System info:

[Include Telegraf version, operating system name, and other relevant details]
Telegraf 1.9.1 (git: HEAD 2063609)
CentOS 7
InfluxDB shell version: 1.7.2

Steps to reproduce:

sudo systemctl stop puppet.service
Wait for telegraf interval to update, in my case 1 minute
Enter into influxdb CLI with the corresponding database and run this query:
SELECT last("pid_count") FROM "autogen"."procstat_lookup" WHERE ("systemd_unit" = 'puppet.service') AND time > now()-2m GROUP BY "myTag"

Expected behavior:

name: procstat_lookup
tags: myTag1
time last

1547670360000000000 0

Actual behavior:

name: procstat_lookup
tags: myTag1
time last

1547670360000000000 1

Additional info:

I only learned a bit of GO to read the source code, but from what I understood, the problem is that here len(pids) is not enough, since running systemctl show puppet.service when the service is not running will give MainPID=0 (see also #3612) and systemdUnitPIDs() does not check for zero values.

[Include gist of relevant config, logs, etc.]

Relevant telegraf.d/puppet.conf:

[[inputs.procstat]]
systemd_unit = "puppet.service"

The text was updated successfully, but these errors were encountered:

glinton · 2019-01-17T00:56:25Z

There are cases where MainPID=0 when the service is running, which seems to be caused by using the incorrect systemd unit, or incorrectly constructed file. Maybe systemdUnitPIDs() should check for zero values to encourage the proper use of the units/files. ...or maybe it should check for the pids in a different way as many of the units/files are created by default when installing a service.

Frakenz · 2019-01-17T14:06:38Z

A different way to check if the service is running could be by checking the SubState or the ActiveState in the same way that the MainPID is being currently obtained.
In any case it would be good if someone could confirm if this is a bug or if it has something to do with my setup, so I can know how to proceed :)

glinton · 2019-01-17T17:26:32Z

That's a great suggestion. It seems like pid_count is a bad place to populate if MainPID=0 as the service has many child processes. There would also need to be a new path to determine "running" if that is the case, likely using the method you linked to - systemctl show -p SubState something.service. I suppose for compatibility sake, we set pid_count to 1 if the 'SubState' is "running" and 0 otherwise.

danielnelson · 2019-01-17T21:28:32Z

Since 0 is not a valid PID, we should just check for it and skip it when reading the MainPID. The pid_count isn't meant to indicate if the service is running or not, just how many pids were found. We will likely have a solution for monitoring systemd service status in #4823.

glinton · 2019-01-17T22:04:47Z

Agreed, though I feel we should drop the pid_count metric entirely if systemd_unit is specified because it doesn't give any sort of accurate count as there will always only be a single pid in MainPID

edit:
If we do keep pid_count for systemd, I do think we should set it to 1 if the service is running (even if mainpid is 0) for more consistent behavior with properly configured/named services.

Frakenz · 2019-01-17T22:10:19Z

@danielnelson I thought that was what it was meant to do as per #4237

danielnelson · 2019-01-17T22:31:42Z

This plugin doesn't report if systemd thinks a service is running or not, it searches for PIDs and then reads the process information for them. This field should contain the number of PIDs that have been found, if MainPID=0 then it hasn't found any valid PIDs so it should be zero.

This is why if you are using the pidfile method and the file exists containing a PID, the pid_count will report 1 no matter if the process is running or not.

It may be that we are not finding the PIDs correctly for systemd, which would be another issue, but I don't think we need to special case systemd in any way.

Attempt at fixing influxdata#5300. Not sure if using an OR statement or writing a separate if was a better practice. Since both check for valid PIDs I used an OR statement. Also I tested this on golang and know almost nothing of GO, so there might be better ways to write this.

danielnelson · 2019-06-14T19:46:41Z

Closed in #5972

glinton added bug unexpected problem or unintended behavior area/procstat labels Jan 17, 2019

glinton mentioned this issue Jan 17, 2019

Fix systemd_unit in procstat_lookup #5307

Closed

Frakenz mentioned this issue Jun 8, 2019

Fix pid_count with systemd_unit in procstat_lookup #5972

Merged

3 tasks

danielnelson added this to the 1.11.1 milestone Jun 14, 2019

danielnelson closed this as completed Jun 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

procstat_lookup always shows 1 when monitoring systemd_unit #5300

procstat_lookup always shows 1 when monitoring systemd_unit #5300

Frakenz commented Jan 16, 2019 •

edited

Loading

glinton commented Jan 17, 2019

Frakenz commented Jan 17, 2019

glinton commented Jan 17, 2019

danielnelson commented Jan 17, 2019

glinton commented Jan 17, 2019 •

edited

Loading

Frakenz commented Jan 17, 2019

danielnelson commented Jan 17, 2019

danielnelson commented Jun 14, 2019

procstat_lookup always shows 1 when monitoring systemd_unit #5300

procstat_lookup always shows 1 when monitoring systemd_unit #5300

Comments

Frakenz commented Jan 16, 2019 • edited Loading

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

Relevant telegraf.d/puppet.conf:

glinton commented Jan 17, 2019

Frakenz commented Jan 17, 2019

glinton commented Jan 17, 2019

danielnelson commented Jan 17, 2019

glinton commented Jan 17, 2019 • edited Loading

Frakenz commented Jan 17, 2019

danielnelson commented Jan 17, 2019

danielnelson commented Jun 14, 2019

Frakenz commented Jan 16, 2019 •

edited

Loading

glinton commented Jan 17, 2019 •

edited

Loading