Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf 0.11.1 (Influx Backend) - Stops Gathering Metrics #1067

Closed
sharkannon opened this issue Apr 20, 2016 · 11 comments · Fixed by #1125
Closed

Telegraf 0.11.1 (Influx Backend) - Stops Gathering Metrics #1067

sharkannon opened this issue Apr 20, 2016 · 11 comments · Fixed by #1125
Labels
bug unexpected problem or unintended behavior

Comments

@sharkannon
Copy link

We have Telegraf running on hundreds of servers, but I've been keeping an eye on a specific server as we keep losing data for it. It appears that every once in awhile Telegraf just stops gathering metrics. There doesn't appear to be anything in the logs other than:

2016/04/20 16:29:00 Gathered metrics, (1m0s interval), from 12 inputs in 225.847914ms
2016/04/20 16:29:00 Wrote 43 metrics to output influxdb in 23.440481ms
2016/04/20 16:30:00 Gathered metrics, (1m0s interval), from 12 inputs in 119.598162ms
2016/04/20 16:30:00 Wrote 44 metrics to output influxdb in 13.196372ms
2016/04/20 16:31:00 Wrote 38 metrics to output influxdb in 98.190384ms
2016/04/20 16:32:00 Wrote 4 metrics to output influxdb in 6.661739ms
2016/04/20 16:33:00 Wrote 0 metrics to output influxdb in 8.779388ms
2016/04/20 16:34:00 Wrote 0 metrics to output influxdb in 4.029673ms
2016/04/20 16:35:00 Wrote 0 metrics to output influxdb in 9.429614ms
2016/04/20 16:36:00 Wrote 0 metrics to output influxdb in 7.780781ms
2016/04/20 16:37:00 Wrote 0 metrics to output influxdb in 1.074654ms
2016/04/20 16:38:00 Wrote 0 metrics to output influxdb in 3.897474ms
2016/04/20 16:39:00 Wrote 0 metrics to output influxdb in 26.61162ms
2016/04/20 16:40:00 Wrote 0 metrics to output influxdb in 17.096148ms
2016/04/20 16:41:00 Wrote 0 metrics to output influxdb in 1.129129ms
2016/04/20 16:42:00 Wrote 0 metrics to output influxdb in 2.935744ms

Just suddenly stops "Gathering".

[agent]
  hostname = "SANITIZED"
  interval = "60s"
  round_interval = true
  metric_buffer_limit = 1000
  flush_buffer_when_full = true
  collection_jitter = "0s"
  flush_interval = "60s"
  flush_jitter = "0s"
  debug = false
  quiet = false

we have the following inputs:

inputs.cpu
inputs.diskio
inputs.kernel
inputs.mem
inputs.net
inputs.netstat
inputs.ping
inputs.processes
inputs.statsd
inputs.swap
inputs.system
@sharkannon
Copy link
Author

Note: A simple restart of the service gets things going again.

@sparrc
Copy link
Contributor

sparrc commented Apr 20, 2016

Does this only happen on one server? Any chance you could send a SIGQUIT (^\) to the process when this is happening and provide the output?

@sharkannon
Copy link
Author

@sparrc It doesn't appear to happen on just one server, the next time it happens I'll try to get a dump for you.

@sparrc sparrc added the bug unexpected problem or unintended behavior label Apr 26, 2016
@sharkannon
Copy link
Author

sharkannon commented Apr 27, 2016

This help?

"telegraf.log" [readonly] 1232L, 83060C                                                                                                                                                                                                                                                                                                                   79,1          Top
net.runtime_pollWait(0x7fd6deac2e58, 0x72, 0xc8200620a0)
        /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc82000ba30, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc82000ba30, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).Read(0xc82000b9d0, 0xc820266000, 0x1000, 0x1000, 0x0, 0x7fd6dc6a9028, 0xc8200620a0)
        /usr/local/go/src/net/fd_unix.go:232 +0x23a
net.(*conn).Read(0xc820022478, 0xc820266000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:172 +0xe4
net/http.noteEOFReader.Read(0x7fd6deac31c8, 0xc820022478, 0xc82014a9f8, 0xc820266000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/transport.go:1370 +0x67
net/http.(*noteEOFReader).Read(0xc8205b37a0, 0xc820266000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        <autogenerated>:126 +0xd0
bufio.(*Reader).fill(0xc82024c9c0)
        /usr/local/go/src/bufio/bufio.go:97 +0x1e9
bufio.(*Reader).Peek(0xc82024c9c0, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/bufio/bufio.go:132 +0xcc
net/http.(*persistConn).readLoop(0xc82014a9a0)
        /usr/local/go/src/net/http/transport.go:876 +0xf7
created by net/http.(*Transport).dialConn
        /usr/local/go/src/net/http/transport.go:685 +0xc78

goroutine 6902 [syscall, 951 minutes]:
syscall.Syscall6(0x3d, 0x7b83, 0xc820318b34, 0x0, 0xc82052b830, 0x0, 0x0, 0xc820318b08, 0xc820318b10, 0xc820318b00)
        /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.wait4(0x7b83, 0xc820318b34, 0x0, 0xc82052b830, 0x90, 0x0, 0x0)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:172 +0x72
syscall.Wait4(0x7b83, 0xc820318b7c, 0x0, 0xc82052b830, 0xc82006a0b8, 0x0, 0x0)
        /usr/local/go/src/syscall/syscall_linux.go:256 +0x55
os.(*Process).wait(0xc8200faf20, 0x5, 0x0, 0x0)
        /usr/local/go/src/os/exec_unix.go:22 +0x105
os.(*Process).Wait(0xc8200faf20, 0x0, 0x0, 0x0)
        /usr/local/go/src/os/doc.go:45 +0x2d
os/exec.(*Cmd).Wait(0xc82016c8c0, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:380 +0x211
os/exec.(*Cmd).Run(0xc82016c8c0, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:258 +0x64
os/exec.(*Cmd).CombinedOutput(0xc82016c8c0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:424 +0x310
github.com/influxdata/telegraf/plugins/inputs/ping.hostPinger(0xc820096780, 0x8, 0xa, 0x0, 0x0, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/ping/ping.go:125 +0xe6
github.com/influxdata/telegraf/plugins/inputs/ping.(*Ping).Gather.func1(0xc82032c2d0, 0xc82000c960, 0xc820254540, 0xc820184801, 0xa, 0x7fd6deac35d0, 0xc820252f40)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/ping/ping.go:77 +0xc1
created by github.com/influxdata/telegraf/plugins/inputs/ping.(*Ping).Gather
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/ping/ping.go:101 +0x18c

goroutine 6889 [semacquire, 951 minutes]:
sync.runtime_Semacquire(0xc82032c2dc)
        /usr/local/go/src/runtime/sema.go:43 +0x26
sync.(*WaitGroup).Wait(0xc82032c2d0)
        /usr/local/go/src/sync/waitgroup.go:126 +0xb4
github.com/influxdata/telegraf/plugins/inputs/ping.(*Ping).Gather(0xc82000c960, 0x7fd6deac35d0, 0xc820252f40, 0x0, 0x0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/ping/ping.go:104 +0x1b9
github.com/influxdata/telegraf/agent.(*Agent).gatherParallel.func1(0xc820412d60, 0xc820060b40, 0xc8200220c0, 0x0, 0xc8200e1110)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:137 +0x432
created by github.com/influxdata/telegraf/agent.(*Agent).gatherParallel
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:141 +0x442

rax    0x0
rbx    0xc820326d80
rcx    0xffffffffffffffff
rdx    0x200
rdi    0x10
rsi    0xc8200ae600
rbp    0x0
rsp    0xc820020418
r8     0x0
r9     0x0
r10    0x0
r11    0x246
r12    0x17b3d80
r13    0x128578c
r14    0x1fffffffffffff
r15    0x39
rip    0x4a05c2
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

@sparrc
Copy link
Contributor

sparrc commented Apr 27, 2016

yes, very helpful, it looks like the ping command is hanging, I'll see what I can do about setting a timeout for that.

@sparrc
Copy link
Contributor

sparrc commented Apr 27, 2016

actually it looks like the timeout is settable, but right now it defaults to "no timeout", could you try setting timeout to something like 5s and see if the problem is solved?

I'm going to change the ping plugin to default to 5s as well.

@sharkannon
Copy link
Author

I can, it doesn't show up very often so may take a week or 2 to actually notice the diff.

@sharkannon
Copy link
Author

[[inputs.ping]]
  count = 1
  timeout = 1.0
  urls = ["MASKED"]

Looks like we already had it set :(

@sparrc
Copy link
Contributor

sparrc commented Apr 27, 2016

that's strange, my only guess is that your system's ping command is hanging for some reason when this happens?

I think I can implement a timeout for this in Telegraf but it might leave behind a zombie process. Do you have any relevant system logs when this occurs?

@sharkannon
Copy link
Author

I think I can implement a timeout for this in Telegraf but it might leave behind a zombie process. Do you have any relevant system logs when this occurs?

Nothing obvious.

sparrc added a commit that referenced this issue Apr 29, 2016
First is to write an internal CombinedOutput and Run function with a
timeout.

Second, the following instances of command runners need to have timeouts:

    plugins/inputs/ping/ping.go
    125:	out, err := c.CombinedOutput()

    plugins/inputs/exec/exec.go
    91:	if err := cmd.Run(); err != nil {

    plugins/inputs/ipmi_sensor/command.go
    31:	err := cmd.Run()

    plugins/inputs/sysstat/sysstat.go
    194:	out, err := cmd.CombinedOutput()

    plugins/inputs/leofs/leofs.go
    185:	defer cmd.Wait()

    plugins/inputs/sysstat/sysstat.go
    282:	if err := cmd.Wait(); err != nil {

closes #1067
sparrc added a commit that referenced this issue Apr 29, 2016
First is to write an internal CombinedOutput and Run function with a
timeout.

Second, the following instances of command runners need to have timeouts:

    plugins/inputs/ping/ping.go
    125:	out, err := c.CombinedOutput()

    plugins/inputs/exec/exec.go
    91:	if err := cmd.Run(); err != nil {

    plugins/inputs/ipmi_sensor/command.go
    31:	err := cmd.Run()

    plugins/inputs/sysstat/sysstat.go
    194:	out, err := cmd.CombinedOutput()

    plugins/inputs/leofs/leofs.go
    185:	defer cmd.Wait()

    plugins/inputs/sysstat/sysstat.go
    282:	if err := cmd.Wait(); err != nil {

closes #1067
sparrc added a commit that referenced this issue Apr 29, 2016
First is to write an internal CombinedOutput and Run function with a
timeout.

Second, the following instances of command runners need to have timeouts:

    plugins/inputs/ping/ping.go
    125:	out, err := c.CombinedOutput()

    plugins/inputs/exec/exec.go
    91:	if err := cmd.Run(); err != nil {

    plugins/inputs/ipmi_sensor/command.go
    31:	err := cmd.Run()

    plugins/inputs/sysstat/sysstat.go
    194:	out, err := cmd.CombinedOutput()

    plugins/inputs/leofs/leofs.go
    185:	defer cmd.Wait()

    plugins/inputs/sysstat/sysstat.go
    282:	if err := cmd.Wait(); err != nil {

closes #1067
sparrc added a commit that referenced this issue Apr 29, 2016
First is to write an internal CombinedOutput and Run function with a
timeout.

Second, the following instances of command runners need to have timeouts:

    plugins/inputs/ping/ping.go
    125:	out, err := c.CombinedOutput()

    plugins/inputs/exec/exec.go
    91:	if err := cmd.Run(); err != nil {

    plugins/inputs/ipmi_sensor/command.go
    31:	err := cmd.Run()

    plugins/inputs/sysstat/sysstat.go
    194:	out, err := cmd.CombinedOutput()

    plugins/inputs/leofs/leofs.go
    185:	defer cmd.Wait()

    plugins/inputs/sysstat/sysstat.go
    282:	if err := cmd.Wait(); err != nil {

closes #1067
sparrc added a commit that referenced this issue Apr 29, 2016
First is to write an internal CombinedOutput and Run function with a
timeout.

Second, the following instances of command runners need to have timeouts:

    plugins/inputs/ping/ping.go
    125:	out, err := c.CombinedOutput()

    plugins/inputs/exec/exec.go
    91:	if err := cmd.Run(); err != nil {

    plugins/inputs/ipmi_sensor/command.go
    31:	err := cmd.Run()

    plugins/inputs/sysstat/sysstat.go
    194:	out, err := cmd.CombinedOutput()

    plugins/inputs/leofs/leofs.go
    185:	defer cmd.Wait()

    plugins/inputs/sysstat/sysstat.go
    282:	if err := cmd.Wait(); err != nil {

closes #1067
sparrc added a commit that referenced this issue Apr 29, 2016
First is to write an internal CombinedOutput and Run function with a
timeout.

Second, the following instances of command runners need to have timeouts:

    plugins/inputs/ping/ping.go
    125:	out, err := c.CombinedOutput()

    plugins/inputs/exec/exec.go
    91:	if err := cmd.Run(); err != nil {

    plugins/inputs/ipmi_sensor/command.go
    31:	err := cmd.Run()

    plugins/inputs/sysstat/sysstat.go
    194:	out, err := cmd.CombinedOutput()

    plugins/inputs/leofs/leofs.go
    185:	defer cmd.Wait()

    plugins/inputs/sysstat/sysstat.go
    282:	if err := cmd.Wait(); err != nil {

closes #1067
sparrc added a commit that referenced this issue Apr 29, 2016
First is to write an internal CombinedOutput and Run function with a
timeout.

Second, the following instances of command runners need to have timeouts:

    plugins/inputs/ping/ping.go
    125:	out, err := c.CombinedOutput()

    plugins/inputs/exec/exec.go
    91:	if err := cmd.Run(); err != nil {

    plugins/inputs/ipmi_sensor/command.go
    31:	err := cmd.Run()

    plugins/inputs/sysstat/sysstat.go
    194:	out, err := cmd.CombinedOutput()

    plugins/inputs/leofs/leofs.go
    185:	defer cmd.Wait()

    plugins/inputs/sysstat/sysstat.go
    282:	if err := cmd.Wait(); err != nil {

closes #1067
sparrc added a commit that referenced this issue Apr 29, 2016
First is to write an internal CombinedOutput and Run function with a
timeout.

Second, the following instances of command runners need to have timeouts:

    plugins/inputs/ping/ping.go
    125:	out, err := c.CombinedOutput()

    plugins/inputs/exec/exec.go
    91:	if err := cmd.Run(); err != nil {

    plugins/inputs/ipmi_sensor/command.go
    31:	err := cmd.Run()

    plugins/inputs/sysstat/sysstat.go
    194:	out, err := cmd.CombinedOutput()

    plugins/inputs/leofs/leofs.go
    185:	defer cmd.Wait()

    plugins/inputs/sysstat/sysstat.go
    282:	if err := cmd.Wait(); err != nil {

closes #1067
sparrc added a commit that referenced this issue Apr 29, 2016
First is to write an internal CombinedOutput and Run function with a
timeout.

Second, the following instances of command runners need to have timeouts:

    plugins/inputs/ping/ping.go
    125:	out, err := c.CombinedOutput()

    plugins/inputs/exec/exec.go
    91:	if err := cmd.Run(); err != nil {

    plugins/inputs/ipmi_sensor/command.go
    31:	err := cmd.Run()

    plugins/inputs/sysstat/sysstat.go
    194:	out, err := cmd.CombinedOutput()

    plugins/inputs/leofs/leofs.go
    185:	defer cmd.Wait()

    plugins/inputs/sysstat/sysstat.go
    282:	if err := cmd.Wait(); err != nil {

closes #1067
@sparrc
Copy link
Contributor

sparrc commented Apr 29, 2016

Thanks for the detailed report @sharkannon! This fix will be available in Telegraf 0.13

@realware-it realware-it mentioned this issue Jul 31, 2017
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants