Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.10.3] current total CPU time is less than previous total CPU time #721

Closed
realdavidops opened this issue Feb 19, 2016 · 6 comments
Closed

Comments

@realdavidops
Copy link

Debian 8. Brand new build.

CPU check was working fine. Then suddenly IOwait climbed rapidly in the middle of the night and the telegraf service stopped. (No error). Now on restart I have no CPU metrics. Here is an output:

david@mx:~$ sudo /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -test 
* Plugin: disk, Collection 1
> disk,fstype=rootfs,path=/ free=41135247360i,inodes_free=2830243i,inodes_total=2880000i,inodes_used=49757i,total=47542140928i,used=3974348800i,used_percent=8.81042868551364 1455892928281272090
> disk,fstype=sysfs,path=/sys free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928281456481
> disk,fstype=proc,path=/proc free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928281505283
> disk,fstype=devtmpfs,path=/dev free=10485760i,inodes_free=254818i,inodes_total=255119i,inodes_used=301i,total=10485760i,used=0i,used_percent=0 1455892928281541266
> disk,fstype=devpts,path=/dev/pts free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928281580422
> disk,fstype=tmpfs,path=/run free=410550272i,inodes_free=256833i,inodes_total=257308i,inodes_used=475i,total=421576704i,used=11026432i,used_percent=2.6155221328358786 1455892928281737891
> disk,fstype=ext4,path=/ free=41135247360i,inodes_free=2830243i,inodes_total=2880000i,inodes_used=49757i,total=47542140928i,used=3974348800i,used_percent=8.81042868551364 1455892928281846315
> disk,fstype=securityfs,path=/sys/kernel/security free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928281890167
> disk,fstype=tmpfs,path=/dev/shm free=1053933568i,inodes_free=257307i,inodes_total=257308i,inodes_used=1i,total=1053933568i,used=0i,used_percent=0 1455892928281927970
> disk,fstype=tmpfs,path=/run/lock free=5242880i,inodes_free=257305i,inodes_total=257308i,inodes_used=3i,total=5242880i,used=0i,used_percent=0 1455892928281974749
> disk,fstype=tmpfs,path=/sys/fs/cgroup free=1053933568i,inodes_free=257295i,inodes_total=257308i,inodes_used=13i,total=1053933568i,used=0i,used_percent=0 1455892928282166420
> disk,fstype=cgroup,path=/sys/fs/cgroup/systemd free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282253266
> disk,fstype=pstore,path=/sys/fs/pstore free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282298984
> disk,fstype=cgroup,path=/sys/fs/cgroup/cpuset free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282407824
> disk,fstype=cgroup,path=/sys/fs/cgroup/cpu\,cpuacct free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282486306
> disk,fstype=cgroup,path=/sys/fs/cgroup/devices free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282586400
> disk,fstype=cgroup,path=/sys/fs/cgroup/freezer free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282631127
> disk,fstype=cgroup,path=/sys/fs/cgroup/net_cls\,net_prio free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282665437
> disk,fstype=cgroup,path=/sys/fs/cgroup/blkio free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282708708
> disk,fstype=cgroup,path=/sys/fs/cgroup/perf_event free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282799651
> disk,fstype=tmpfs,path=/etc/machine-id free=410550272i,inodes_free=256833i,inodes_total=257308i,inodes_used=475i,total=421576704i,used=11026432i,used_percent=2.6155221328358786 1455892928282854889
> disk,fstype=autofs,path=/proc/sys/fs/binfmt_misc free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928282927342
> disk,fstype=hugetlbfs,path=/dev/hugepages free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928283012625
> disk,fstype=debugfs,path=/sys/kernel/debug free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928283083024
> disk,fstype=mqueue,path=/dev/mqueue free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928283163702
> disk,fstype=tmpfs,path=/run/user/0 free=210788352i,inodes_free=257304i,inodes_total=257308i,inodes_used=4i,total=210788352i,used=0i,used_percent=0 1455892928283202130
> disk,fstype=binfmt_misc,path=/proc/sys/fs/binfmt_misc free=0i,inodes_free=0i,inodes_total=0i,inodes_used=0i,total=0i,used=0i,used_percent=0 1455892928283244467
> disk,fstype=tmpfs,path=/run/user/1000 free=210788352i,inodes_free=257304i,inodes_total=257308i,inodes_used=4i,total=210788352i,used=0i,used_percent=0 1455892928283284761
* Plugin: mem, Collection 1
> mem available=1222287360i,available_percent=57.986926174079315,buffered=43929600i,cached=1176727552i,free=106393600i,total=2107867136i,used=2001473536i,used_percent=94.95254714194661 1455892928283873889
* Plugin: swap, Collection 1
> swap free=2147438592i,in=0i,out=40960i,total=2147479552i,used=40960i,used_percent=0.001907352270798246 1455892928284270508
* Plugin: system, Collection 1
> system load1=0.08,load15=0.05,load5=0.03,uptime=44783i,uptime_format="12:26" 1455892928345733551
* Plugin: cpu, Collection 1
* Plugin: cpu, Collection 2
2016/02/19 07:42:08 Error: current total CPU time is less than previous total CPU time

And my config

david@mx:~$ cat /etc/telegraf/telegraf.conf 
# Telegraf configuration

# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.

# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.

# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will cache metric_buffer_limit metrics for each output, and will
  ## flush this buffer on a successful write.
  metric_buffer_limit = 10000
  ## Flush the buffer whenever full, regardless of flush_interval.
  flush_buffer_when_full = true

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = false
  ## Override default hostname, if empty use os.Hostname()
  hostname = ""


###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  # The full HTTP or UDP endpoint URL for your InfluxDB instance.
  # Multiple urls can be specified but it is assumed that they are part of the same
  # cluster, this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://localhost:8089"] # UDP endpoint example
  urls = ["[[[[[changed]]]]]"] # required
  # The target database for metrics (telegraf will create it if not exists)
  database = "telegraf" # required
  # Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
  # note: using second precision greatly helps InfluxDB compression
  precision = "s"

  # Connection timeout (for the connection with InfluxDB), formatted as a string.
  # If not provided, will default to 0 (no timeout)
  # timeout = "5s"
  username = "telegraf"
  password = "[[[[[CHANGED]]]]]"
  # Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  # Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512


###############################################################################
#                                  INPUTS                                     #
###############################################################################

# Read metrics about cpu usage
[[inputs.cpu]]
  # Whether to report per-cpu stats or not
  percpu = true
  # Whether to report total system cpu stats or not
  totalcpu = true
  # Comment this line if you want the raw CPU time metrics
  drop = ["time_*"]

# Read metrics about disk usage by mount point
[[inputs.disk]]
  # By default, telegraf gather stats for all mountpoints.
  # Setting mountpoints will restrict the stats to the specified mountpoints.
  # mount_points=["/"]

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration

# Read metrics about system load & uptime
[[inputs.system]]
  # no configuration


###############################################################################
#                              SERVICE INPUTS                                 #
###############################################################################
david@mx:~$ 
@realdavidops
Copy link
Author

I don't think this is related to #720, but it might be.

@tripledes
Copy link

It comes from: cpu.go#L81-L84

Doesn't it recover after awhile?

@sparrc
Copy link
Contributor

sparrc commented Feb 19, 2016

@david-AT That should happen only once, and then recover, can you give more details?

@realdavidops
Copy link
Author

@sparrc - What specific details would you like?

I did just check the box again and the telegraf service has fallen over again:

david@mx:~$ sudo systemctl status telegraf
[sudo] password for david: 
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/lib/systemd/system/telegraf.service; enabled)
   Active: failed (Result: exit-code) since Fri 2016-02-19 07:38:49 MST; 2h 25min ago
     Docs: https://github.com/influxdata/telegraf
  Process: 5640 ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d ${TELEGRAF_OPTS} (code=exited, status=2)
 Main PID: 5640 (code=exited, status=2)

Feb 19 07:38:10 mx telegraf[5640]: 2016/02/19 07:38:10 Wrote 31 metrics to output influxdb in 3.330322ms
Feb 19 07:38:20 mx telegraf[5640]: 2016/02/19 07:38:20 Error in input [cpu]: Error: current total CPU time is less than previous total CPU time
Feb 19 07:38:20 mx telegraf[5640]: 2016/02/19 07:38:20 Gathered metrics, (10s interval), from 5 inputs in 37.255017ms
Feb 19 07:38:20 mx telegraf[5640]: 2016/02/19 07:38:20 Wrote 31 metrics to output influxdb in 3.233115ms
Feb 19 07:38:30 mx telegraf[5640]: 2016/02/19 07:38:30 Error in input [cpu]: Error: current total CPU time is less than previous total CPU time
Feb 19 07:38:30 mx telegraf[5640]: 2016/02/19 07:38:30 Gathered metrics, (10s interval), from 5 inputs in 36.818794ms
Feb 19 07:38:30 mx telegraf[5640]: 2016/02/19 07:38:30 Wrote 31 metrics to output influxdb in 4.027961ms
Feb 19 07:38:40 mx telegraf[5640]: 2016/02/19 07:38:40 Error in input [cpu]: Error: current total CPU time is less than previous total CPU time
Feb 19 07:38:40 mx telegraf[5640]: 2016/02/19 07:38:40 Gathered metrics, (10s interval), from 5 inputs in 59.201251ms
Feb 19 07:38:40 mx telegraf[5640]: 2016/02/19 07:38:40 Wrote 31 metrics to output influxdb in 3.697831ms
Feb 19 07:38:49 mx systemd[1]: telegraf.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 19 07:38:49 mx systemd[1]: Unit telegraf.service entered failed state.

@sparrc
Copy link
Contributor

sparrc commented Feb 19, 2016

Is your system operationally sound during this time otherwise? I'm curious why IOwait climbed in the middle of the night

Would be helpful if you could run telegraf manually and get some debug output, like this:

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -debug

@sparrc
Copy link
Contributor

sparrc commented Mar 1, 2016

closing for lack of info, please re-open if you can provide more details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants