vSphere Input Plugin intermittent silent collection gaps for some hosts & virtual machines #5421

ghost · 2019-02-12T23:58:08Z

Relevant telegraf.conf:

System info:

[Include Telegraf version, operating system name, and other relevant details]
Telegraf 1.9.0-1 (RPM package installed)
RHEL 7.5 Maipo (32GB RAM, 8x 2.2GHz CPU)

Steps to reproduce:

... Enable telegraf
... 'journalctl -f -t telegraf' shows no instances of '[agent] input "inputs.vsphere" did not complete within its interval'
... Missing/Gaps in metrics for some ESX Hosts and Virtual Machines

Expected behavior:

Collection of all listed metrics for Hosts & VMs is successful, and where not a failure message is logged

Actual behavior:

Intermittent gaps in metrics for some hosts & VMs is observed as below:

Additional info:

I have tried increasing & decreasing the go routines, increased intervals to 2mins+ with no success. I have left the collector to run for over 12 hours with no change in behavior.
[Include gist of relevant config, logs, etc.]

Configuration for telegraf agent

[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = true
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

# # Read metrics from VMware vCenter
 [[inputs.vsphere]]
   ## List of vCenter URLs to be monitored. These three lines must be uncommented
   ## and edited for the plugin to work.
   vcenters = ["https://london22s304.generic.com/sdk", "https://test04.generic.com/sdk", "https://test06.generic.com/sdk", "https://test101.generic.com/sdk" ]
   username = "acc@generic.com"
   password = "password"
   interval = "60s"

   ## VMs
   ## Typical VM metrics (if omitted or empty, all metrics are collected)
   vm_metric_include = [
     "cpu.demand.average",
     "cpu.idle.summation",
     "cpu.latency.average",
     "cpu.readiness.average",
     "cpu.ready.summation",
     "cpu.run.summation",
     "cpu.usagemhz.average",
     "cpu.usage.average",
     "cpu.used.summation",
     "cpu.wait.summation",
     "mem.active.average",
     "mem.granted.average",
     "mem.latency.average",
     "mem.swapin.average",
     "mem.swapinRate.average",
     "mem.swapout.average",
     "mem.swapoutRate.average",
     "mem.usage.average",
     "mem.vmmemctl.average",
     "net.bytesRx.average",
     "net.bytesTx.average",
     "net.droppedRx.summation",
     "net.droppedTx.summation",
     "net.usage.average",
     "power.power.average",
     "virtualDisk.numberReadAveraged.average",
     "virtualDisk.numberWriteAveraged.average",
     "virtualDisk.read.average",
     "virtualDisk.readOIO.latest",
     "virtualDisk.throughput.usage.average",
     "virtualDisk.totalReadLatency.average",
     "virtualDisk.totalWriteLatency.average",
     "virtualDisk.write.average",
     "virtualDisk.writeOIO.latest",
     "sys.uptime.latest",
   ]
  # vm_metric_exclude = [] ## Nothing is excluded by default
  # vm_instances = true ## true by default

   ## Hosts
   ## Typical host metrics (if omitted or empty, all metrics are collected)
      host_metric_include = [
     "cpu.coreUtilization.average",
     "cpu.costop.summation",
     "cpu.demand.average",
     "cpu.idle.summation",
     "cpu.latency.average",
     "cpu.readiness.average",
     "cpu.ready.summation",
     "cpu.swapwait.summation",
     "cpu.usage.average",
     "cpu.usagemhz.average",
     "cpu.used.summation",
     "cpu.utilization.average",
     "cpu.wait.summation",
     "disk.deviceReadLatency.average",
     "disk.deviceWriteLatency.average",
     "disk.kernelReadLatency.average",
     "disk.kernelWriteLatency.average",
     "disk.numberReadAveraged.average",
     "disk.numberWriteAveraged.average",
     "disk.read.average",
     "disk.totalReadLatency.average",
     "disk.totalWriteLatency.average",
     "disk.write.average",
     "mem.active.average",
     "mem.latency.average",
     "mem.state.latest",
     "mem.swapin.average",
     "mem.swapinRate.average",
     "mem.swapout.average",
     "mem.swapoutRate.average",
     "mem.totalCapacity.average",
     "mem.usage.average",
     "mem.vmmemctl.average",
     "net.bytesRx.average",
     "net.bytesTx.average",
     "net.droppedRx.summation",
     "net.droppedTx.summation",
     "net.errorsRx.summation",
     "net.errorsTx.summation",
     "net.usage.average",
     "power.power.average",
     "storageAdapter.numberReadAveraged.average",
     "storageAdapter.numberWriteAveraged.average",
     "storageAdapter.read.average",
     "storageAdapter.write.average",
     "sys.uptime.latest",
   ]
  # host_metric_exclude = [] ## Nothing excluded by default
  # host_instances = true ## true by default
   ## Clusters
   # cluster_metric_include = [] ## if omitted or empty, all metrics are collected
   cluster_metric_exclude = ["*"] ## Nothing excluded by default
   # cluster_instances = true ## true by default

   ## Datastores
   # datastore_metric_include = [] ## if omitted or empty, all metrics are collected
   datastore_metric_exclude = ["*"] ## Nothing excluded by default
   # datastore_instances = false ## false by default for Datastores only

   ## Datacenters
   # datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
   datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
   # datacenter_instances = false ## false by default for Datastores only

   ## Plugin Settings
   ## separator character to use for measurement and field names (default: "_") 
     # separator = "_"

   ## number of objects to retreive per query for realtime resources (vms and hosts)
   ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
   max_query_objects = 64

   ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
   ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
   max_query_metrics = 64

   ## number of go routines to use for collection and discovery of objects and metrics
   collect_concurrency = 5
   discover_concurrency = 5

   ## whether or not to force discovery of new objects on initial gather call before collecting metrics
   ## when true for large environments this may cause errors for time elapsed while collecting metrics
   ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
   #   force_discover_on_init = false

   ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
   # object_discovery_interval = "300s"

   ## timeout applies to any of the api request made to vcenter
   timeout = "60s"

   ## Optional SSL Config
   # ssl_ca = "/path/to/cafile"
   # ssl_cert = "/path/to/certfile"
   # ssl_key = "/path/to/keyfile"
   ## Use SSL but skip chain & host verification
   insecure_skip_verify = true

   
  ## This is is a Windows Server vCenter, not an appliance.
 [[inputs.vsphere]]
   ## List of vCenter URLs to be monitored. These three lines must be uncommented
   ## and edited for the plugin to work.
   vcenters = ["https://london22s302.generic.com/sdk" ]
   username = "acc@generic.com"
   password = "password"
   interval = "60s"

   ## VMs
   ## Typical VM metrics (if omitted or empty, all metrics are collected)
   vm_metric_include = [
     "cpu.demand.average",
     "cpu.idle.summation",
     "cpu.latency.average",
     "cpu.readiness.average",
     "cpu.ready.summation",
     "cpu.run.summation",
     "cpu.usagemhz.average",
     "cpu.usage.average",
     "cpu.used.summation",
     "cpu.wait.summation",
     "mem.active.average",
     "mem.granted.average",
     "mem.latency.average",
     "mem.swapin.average",
     "mem.swapinRate.average",
     "mem.swapout.average",
     "mem.swapoutRate.average",
     "mem.usage.average",
     "mem.vmmemctl.average",
     "net.bytesRx.average",
     "net.bytesTx.average",
     "net.droppedRx.summation",
     "net.droppedTx.summation",
     "net.usage.average",
     "power.power.average",
     "virtualDisk.numberReadAveraged.average",
     "virtualDisk.numberWriteAveraged.average",
     "virtualDisk.read.average",
     "virtualDisk.readOIO.latest",
     "virtualDisk.throughput.usage.average",
     "virtualDisk.totalReadLatency.average",
     "virtualDisk.totalWriteLatency.average",
     "virtualDisk.write.average",
     "virtualDisk.writeOIO.latest",
     "sys.uptime.latest",
   ]
  # vm_metric_exclude = [] ## Nothing is excluded by default
  # vm_instances = true ## true by default

     ## Hosts
   ## Typical host metrics (if omitted or empty, all metrics are collected)
   host_metric_include = [
     "cpu.coreUtilization.average",
     "cpu.costop.summation",
     "cpu.demand.average",
     "cpu.idle.summation",
     "cpu.latency.average",
     "cpu.readiness.average",
     "cpu.ready.summation",
     "cpu.swapwait.summation",
     "cpu.usage.average",
     "cpu.usagemhz.average",
     "cpu.used.summation",
     "cpu.utilization.average",
     "cpu.wait.summation",
     "disk.deviceReadLatency.average",
     "disk.deviceWriteLatency.average",
     "disk.kernelReadLatency.average",
     "disk.kernelWriteLatency.average",
     "disk.numberReadAveraged.average",
     "disk.numberWriteAveraged.average",
     "disk.read.average",
     "disk.totalReadLatency.average",
     "disk.totalWriteLatency.average",
     "disk.write.average",
     "mem.active.average",
     "mem.latency.average",
     "mem.state.latest",
     "mem.swapin.average",
     "mem.swapinRate.average",
     "mem.swapout.average",
     "mem.swapoutRate.average",
     "mem.totalCapacity.average",
     "mem.usage.average",
     "mem.vmmemctl.average",
     "net.bytesRx.average",
     "net.bytesTx.average",
     "net.droppedRx.summation",
     "net.droppedTx.summation",
     "net.errorsRx.summation",
     "net.errorsTx.summation",
     "net.usage.average",
     "power.power.average",
     "storageAdapter.numberReadAveraged.average",
     "storageAdapter.numberWriteAveraged.average",
     "storageAdapter.read.average",
     "storageAdapter.write.average",
     "sys.uptime.latest",
   ]
  # host_metric_exclude = [] ## Nothing excluded by default
  # host_instances = true ## true by default
   ## Clusters
   # cluster_metric_include = [] ## if omitted or empty, all metrics are collected
   cluster_metric_exclude = ["*"] ## Nothing excluded by default
   # cluster_instances = true ## true by default

   ## Datastores
   # datastore_metric_include = [] ## if omitted or empty, all metrics are collected
   datastore_metric_exclude = ["*"] ## Nothing excluded by default
   # datastore_instances = false ## false by default for Datastores only

   ## Datacenters
   # datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
   datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
   # datacenter_instances = false ## false by default for Datastores only

   ## Plugin Settings
   ## separator character to use for measurement and field names (default: "_")
   # separator = "_"
  ## Plugin Settings
   ## separator character to use for measurement and field names (default: "_")
   # separator = "_"

   ## number of objects to retreive per query for realtime resources (vms and hosts)
   ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
   max_query_objects = 64

   ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
   ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
   max_query_metrics = 64

   ## number of go routines to use for collection and discovery of objects and metrics
   collect_concurrency = 5
   discover_concurrency = 5

   ## whether or not to force discovery of new objects on initial gather call before collecting metrics
   ## when true for large environments this may cause errors for time elapsed while collecting metrics
   ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
   # force_discover_on_init = false

   ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
   # object_discovery_interval = "300s"

   ## timeout applies to any of the api request made to vcenter
   timeout = "60s"

   ## Optional SSL Config
   # ssl_ca = "/path/to/cafile"
   # ssl_cert = "/path/to/certfile"
   # ssl_key = "/path/to/keyfile"
   ## Use SSL but skip chain & host verification
   insecure_skip_verify = true

The text was updated successfully, but these errors were encountered:

prydin · 2019-02-13T00:02:06Z

This should be addressed in 1.10, which is scheduled to be released next week.

ghost · 2019-02-13T00:06:19Z

@prydin
Fastest response time in history... thanks!

Any idea what causes this? We were previously utilising Oxalide's vmware collector with no issues. Just curious is all :-)

danielnelson · 2019-02-13T00:14:37Z

@bdeam It would be a great help if you could test out the 1.10 before the final release. Here is the latest automated build, with some additional improvements to the vSphere plugin from today. We will also probably do a release candidate next week:

telegraf-1.10.0~ce507e52-0.x86_64.rpm

ghost · 2019-02-13T00:43:13Z

@bdeam It would be a great help if you could test out the 1.10 before the final release. Here is the latest automated build, with some additional improvements to the vSphere plugin from today. We will also probably do a release candidate next week:

telegraf-1.10.0~ce507e52-0.x86_64.rpm

Will do!

ghost · 2019-02-13T01:31:24Z

Seems as though the issue persists for me.

telegraf -version
Telegraf unknown (git: master ce507e52)

The plugin has been running for approx. 40mins:

Same configuration as above.

prydin · 2019-02-13T14:45:17Z

Could you run it with debug and send some logs, please?

prydin · 2019-02-13T15:11:27Z

Also, I'm running some scale tests with that exact version to check if there might have been something fishy happening in the merge. Stand by!

prydin · 2019-02-13T19:02:33Z

I tried this by ingesting metrics from 7000 VMs into InfluxDB and put them all on a stacked chart. Looks OK to me. You typically see a lot of jitter on a stacked graph if there's data missing.

prydin · 2019-02-13T19:12:02Z

You may see missing samples in the last "tick" of the graph. This is because vCenter is sometimes late in reporting metrics. They will be picked up the next collection round and their timestamps will be backdated. This is what it looks like.

ghost · 2019-02-13T22:47:26Z

You may see missing samples in the last "tick" of the graph. This is because vCenter is sometimes late in reporting metrics. They will be picked up the next collection round and their timestamps will be backdated. This is what it looks like.

Thanks for the tests - I suspected that may be what was happening but in the second screenshot I posted above of "CPU READY" you can see that all of the series have large gaps in them, some of them for up to 10-15mins, that doesn't seem to get backdated.

I recreated the same stacked graph you have - this is what mine looks like (wouldn't fill(previous) simply mask missing collections?):

Edit:
I am definitely still seeing missing collections across multiple vCenters, hosts and VMs. Happy to get some debug/logs to you - just let me know what you need.

prydin · 2019-02-14T00:31:16Z

What does your "internal_write.metrics_dropped" look like? I was able to reproduce the problem when I fed the InfluxDB output more data than it could flush out to InfluxDB. That causes the InfluxDB output plugin to drop some samples. Try this query:

SELECT mean("metrics_dropped") AS "mean_metrics_dropped" FROM "telegraf"."autogen"."internal_write" WHERE time > :dashboardTime: GROUP BY time(:interval:) FILL(null)

Here's what mine looks like. Clearly, we have a problem here! I think the fix is to increase the buffer size.

prydin · 2019-02-14T00:40:05Z

It looks like you're selecting a metric from a single VM and get some very strange data. Can you simplify your query to just this:

SELECT mean("demand_average") AS "mean_demand_average" FROM "telegraf"."autogen"."vsphere_vm_cpu" WHERE time > :dashboardTime: AND "vmname"='A vm with a known name' GROUP BY time(:interval:) FILL(null)

(You obviously need to insert a VM name that's valid in your environment).

Here's what I get in an environment with 7000 VMs.

ghost · 2019-02-14T00:46:47Z

It looks like you're selecting a metric from a single VM and get some very strange data. Can you simplify your query to just this:

SELECT mean("demand_average") AS "mean_demand_average" FROM "telegraf"."autogen"."vsphere_vm_cpu" WHERE time > :dashboardTime: AND "vmname"='A vm with a known name' GROUP BY time(:interval:) FILL(null)

(You obviously need to insert a VM name that's valid in your environment).

Here's what I get in an environment with 7000 VMs.

You can see the large gaps here too, I tried this with different VMs across different vCenters and see the same- I just restarted Telegraf with the interal input configured, give me a few minutes and I'll get a screenshot posted.

prydin · 2019-02-14T01:02:00Z

Do you see any errors/warnings in the log?

ghost · 2019-02-14T01:09:10Z

I can't see any telegraf metrics dropped via the internal measurement & writes dropped.

This is the error/warning log over 24hrs (each colour is a different vCenter):

The gaps exist consistently, not just when these errors are reported.

Using debug I can see this despite having set both max_query_metrics and max_query_objects to 64:

I also reconfigured telegraf to point to a local InfluxDB instance as I couldn't rule network issues out, but still see the same behavior.

prydin · 2019-02-14T02:03:16Z

How many VMs do you have?

ghost · 2019-02-14T02:16:31Z

vCenter #1: 61
vCenter #2: 1057
vCenter #3: 274
vCenter #4: 280

Seperate Instance of plugin:
vCenter #5: 3216

prydin · 2019-02-14T15:11:15Z

I've been spending a fair amount of time trying to reproduce this, but haven't come across a smoking gun yet. I'm at a point where further troubleshooting without full debug logs is becoming difficult.

I understand that you can't share full logs here. Perhaps you have a VMware technical account manager, solutions engineer or something like that you trust? Since I'm a VMware employee, we may use that channel if we need to exchange information.

In the meantime, there are a few thing you can do:

Issue the following query:
SELECT mean("gather_count") AS "mean_gather_count" FROM "telegraf"."autogen"."internal_vsphere" WHERE time > :dashboardTime: AND "resourcetype"='vm' GROUP BY time(1m), "vcenter" FILL(0)

This should tell you how many datapoints the plugin found. This number will likely bounce up and down a bit due to delayed metrics, but variations should be less than 1%.

Set up a file output. This way we can make sure that the problem isn't somewhere between Telegraf and InfluxDB. You should be able to use a series of grep commands to isolate a single metric from a single VM and check if you have any drop-outs.
Issue the following query:
SELECT mean("gather_duration_ns") AS "mean_gather_duration" FROM "telegraf"."autogen"."internal_vsphere" WHERE time > :dashboardTime: AND "resourcetype"='vm' GROUP BY time(1m), "vcenter" FILL(0)

This should tell you how long the data collection for each resource type is taking (in nanoseconds). If you're getting close to 60s on each collection, there's a risk you're timing out.

ghost · 2019-02-15T04:25:50Z

I've been spending a fair amount of time trying to reproduce this, but haven't come across a smoking gun yet. I'm at a point where further troubleshooting without full debug logs is becoming difficult.

I understand that you can't share full logs here. Perhaps you have a VMware technical account manager, solutions engineer or something like that you trust? Since I'm a VMware employee, we may use that channel if we need to exchange information.

Understood. We have an open support agreement, I'll engage with our TAM next week, hopefully it won't be an issue to get logs to you as I understand how tricky this can be ;-)

In the meantime, there are a few thing you can do:

Issue the following query:
SELECT mean("gather_count") AS "mean_gather_count" FROM "telegraf"."autogen"."internal_vsphere" WHERE time > :dashboardTime: AND "resourcetype"='vm' GROUP BY time(1m), "vcenter" FILL(0)

This should tell you how many datapoints the plugin found. This number will likely bounce up and down a bit due to delayed metrics, but variations should be less than 1%.

There's a couple of blips, but mostly consistent.

Set up a file output. This way we can make sure that the problem isn't somewhere between Telegraf and InfluxDB. You should be able to use a series of grep commands to isolate a single metric from a single VM and check if you have any drop-outs.

I have generated the output file metrics.txt, it's about 30mins of collections. I'll have to analyse this on Monday my time.

Issue the following query:
SELECT mean("gather_duration_ns") AS "mean_gather_duration" FROM "telegraf"."autogen"."internal_vsphere" WHERE time > :dashboardTime: AND "resourcetype"='vm' GROUP BY time(1m), "vcenter" FILL(0)

This should tell you how long the data collection for each resource type is taking (in nanoseconds). If you're getting close to 60s on each collection, there's a risk you're timing out.

Time seem to be okay besides the one spike around 14:00, and that spike is on our largest vCenter. Does the fact our biggest vCenter isn't an appliance affect collections?

ZHumphries · 2019-03-08T09:06:12Z

We have the same issue in our environment. Reasonably large with multiple vcenters (5) and about 7000 VM's.

There is also no errors in the logs and no missed collections (that I can see). Ill run a couple of the debugs that you've asked for previously and see if if its exactly the same.

prydin · 2019-03-08T14:27:22Z

@ZHumphries which version are you on? 1.10?

ZHumphries · 2019-03-09T09:44:47Z

@prydin Yes 1.10. Although the issue was there in 1.8 & 1.9 as well. Our vCenters are a mix of 5.5 and 6.5 and the problem shows up on all 5 of them. I didnt have time yesterday to get the _internal telegraf stats setup and get any meaningful data re collection time and amount of datapoints. Ill post an update on monday with the data.

ZHumphries · 2019-03-14T09:31:39Z

Seems to be some gaps in my gathering process (stacked graphs). This is with a 60 or 120 second interval.

Something else ive noticed, data points are only ever at the minute mark, its seem that no matter what my settings its never collecting the 20/40 second interval realtime metrics. Am I missing something stupidly obvious here?

I dont want to hijack this thread, would I be better to start a new issue?

prydin · 2019-03-14T13:30:42Z

I've been trying without any luck to reproduce this in my lab. I would need debug logs to proceed. Any chance you can provide me with that?

ZHumphries · 2019-03-15T11:43:01Z

Do you have an email address or someway I can send them directly to you?

prydin · 2019-03-15T12:43:21Z

Same as my handle here @ the company that makes vsphere.... :) Sent from my Verizon, Samsung Galaxy smartphone

…

-------- Original message -------- From: ZHumphries <notifications@github.com> Date: 3/15/19 7:43 AM (GMT-05:00) To: influxdata/telegraf <telegraf@noreply.github.com> Cc: Pontus Rydin <prydin@vmware.com>, Mention <mention@noreply.github.com> Subject: Re: [influxdata/telegraf] vSphere Input Plugin intermittent silent collection gaps for some hosts & virtual machines (#5421) Do you have an email address or someway I can send them directly to you? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Finfluxdata%2Ftelegraf%2Fissues%2F5421%23issuecomment-473255187&data=02%7C01%7Cprydin%40vmware.com%7Cba6c5492069f4adaf42808d6a93b6f2c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636882470069791817&sdata=h7jnXYAd0GlWvljGARoOLW3aVvw%2BzSrRR7dpFf9JNrI%3D&reserved=0>, or mute the thread<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE_y70cWcySx_C3u-b3kQSnhpSyNso4Sks5vW4dcgaJpZM4a4TO7&data=02%7C01%7Cprydin%40vmware.com%7Cba6c5492069f4adaf42808d6a93b6f2c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636882470069791817&sdata=iSs%2Fpg3fTVkyFF6Zop4PAcZowYl54ekk2hc970%2BtGAQ%3D&reserved=0>.

pypb · 2019-04-15T13:30:43Z

I'm seeing the same problem with gaps in collected metrics here. Telegraf 1.10.1, two vCenters, rougly 5000 VM's in one and 1800 in the other. Difference from you guys is that I'm using a Graphite database, so at least it doesn't seem to be related to the output plugin.

I could provide you with debug logs, just have to clean out the identifiable bits first, if that would be helpful.

prydin · 2019-04-15T13:32:22Z

Debug logs would be helpful if you can provide them!

pypb · 2019-04-16T06:00:45Z

I started collecting some logs yesterday and when looking through them I saw a lot of "[outputs.graphite] buffer fullness: 2000 / 2000 metrics". I figured the buffer was too small and metric was being dropped because of it, so increased it. First to 10K, then 100K and finally 1M.

It doesn't really solve the problem, though, it just postpones the point where the buffer gets filled up - for some reason it's not outputting the metrics quickly enough to transmit all of them during a collection interval (1m). I've tried switching from carbon-relay to carbon-relay-ng and I've even tried a dummy netcat to /dev/null as a receiver - buffer still gets filled up eventually.

This might mean that my problem is different from bdeam's, unless there's an output buffering/performance issue in the influxdb output plugin as well.

pypb · 2019-04-17T08:15:28Z

I'm still struggling with this. If I set the metric_buffer high enough to avoid drops I get a lot of duplicated metrics, if I lower it I get drops.

According to the documentation metric_buffer is being used when output write fails, "This buffer only fills when writes fail to output plugin(s)", so for some reason Telegraf seems to think it fails to write to the output and buffers metrics it has already sent. Can't see anything about that in the logs, though. I've tried both outputs.graphite and outputs.socket_writer and the result is the same. I've used netcat as a receiver to collect the raw data and I have been able to verify that there are a lot of duplicate metrics in the output.

I have attached a debug log. During this run I got close to a million metrics/min sent to Graphite, which is way more than it should be.

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "60s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 25000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 500000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "60s"

  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  #precision = "60s"

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = true
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""

  ## Override default hostname, if empty use os.Hostname()
  #hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = true


# Configuration for Graphite server to send metrics to
[[outputs.graphite]]
  ## TCP endpoint for your graphite instance.
  ## If multiple endpoints are configured, output will be load balanced.
  ## Only one of the endpoints will be written to with each iteration.
  servers = ["server:62003"]

  ## Prefix metrics name
  prefix = "vmware.telegraf"
  ## Graphite output template
  ## see https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
  template = "host.tags.measurement.field"

  ## Enable Graphite tags support
  graphite_tag_support = true

  ## timeout in seconds for the write connection to graphite
  timeout = 60

  ## Optional TLS Config
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

# Read metrics from one or many vCenters
[[inputs.vsphere]]
    ## List of vCenter URLs to be monitored. These three lines must be uncommented
  ## and edited for the plugin to work.
  vcenters = [ 
    "https://vcenter/sdk"
  ]
  username = "username"
  password = "password"

  ## VMs
  ## Typical VM metrics (if omitted or empty, all metrics are collected)
  # vm_include = [ "/*/vm/**"] # Inventory path to VMs to collect (by default all are collected)
  vm_metric_include = [
    "cpu.costop.summation",
    "cpu.ready.summation",
    "cpu.usage.average",
    "cpu.used.summation",
    "datastore.numberReadAveraged.average",
    "datastore.numberWriteAveraged.average",
    "datastore.read.average",
    "datastore.write.average",
    "datastore.totalReadLatency.average",
    "datastore.totalWriteLatency.average",
    "mem.usage.average",
    "mem.granted.average",
    "mem.active.average",
    "mem.compressed.average",
    "mem.swapped.average",
    "mem.vmmemctl.average",
    "virtualDisk.read.average",
    "virtualDisk.readOIO.latest",
    "virtualDisk.writeOIO.latest",
    "virtualDisk.write.average",
    "virtualDisk.readLatencyUS.latest",
    "virtualDisk.writeLatencyUS.latest",
    "virtualDisk.numberReadAveraged.average",
    "virtualDisk.numberWriteAveraged.average",
    "net.bytesRx.average",
    "net.bytesTx.average",
    "net.droppedRx.summation",
    "net.droppedTx.summation",
    "net.packetsRx.summation",
    "net.packetsTx.summation"
  ]
  # vm_metric_exclude = [] ## Nothing is excluded by default
  # vm_instances = true ## true by default

  ## Hosts
  ## Typical host metrics (if omitted or empty, all metrics are collected)
  # host_include = [ "/*/host/**"] # Inventory path to hosts to collect (by default all are collected)
  host_metric_include = [
    "cpu.costop.summation",
    "cpu.ready.summation",
    "cpu.usage.average",
    "cpu.used.summation",
    "datastore.numberReadAveraged.average",
    "datastore.numberWriteAveraged.average",
    "datastore.read.average",
    "datastore.write.average",
    "datastore.totalReadLatency.average",
    "datastore.totalWriteLatency.average",
    "datastore.datastoreReadOIO.latest",
    "datastore.datastoreWriteOIO.latest",
    "mem.granted.average",
    "mem.active.average",
    "mem.compressed.average",
    "mem.vmmemctl.average",
    "mem.usage.average",
    "net.bytesRx.average",
    "net.bytesTx.average",
    "net.droppedRx.summation",
    "net.droppedTx.summation",
    "net.errorsRx.summation",
    "net.errorsTx.summation",
    "net.packetsRx.summation",
    "net.packetsTx.summation"
  ]
  # host_metric_exclude = [] ## Nothing excluded by default
  # host_instances = true ## true by default

  ## Clusters
  # cluster_include = [ "/*/host/**"] # Inventory path to clusters to collect (by default all are collected)
  # cluster_metric_include = [] ## if omitted or empty, all metrics are collected
  cluster_metric_exclude = [ "*" ] ## Nothing excluded by default
  # cluster_instances = false ## false by default

  ## Datastores
  # cluster_include = [ "/*/datastore/**"] # Inventory path to datastores to collect (by default all are collected)
  # datastore_metric_include = [] ## if omitted or empty, all metrics are collected
  # datastore_metric_exclude = [] ## Nothing excluded by default
  # datastore_instances = false ## false by default

  ## Datacenters
  # datacenter_include = [ "/*/host/**"] # Inventory path to clusters to collect (by default all are collected)
  #datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
  datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
  # datacenter_instances = false ## false by default

  ## Plugin Settings
  ## separator character to use for measurement and field names (default: "_")
  # separator = "_"

  ## number of objects to retreive per query for realtime resources (vms and hosts)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_objects = 256

  ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_metrics = 256

  ## number of go routines to use for collection and discovery of objects and metrics
  collect_concurrency = 10
  discover_concurrency = 10

  ## whether or not to force discovery of new objects on initial gather call before collecting metrics
  ## when true for large environments this may cause errors for time elapsed while collecting metrics
  ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
  # force_discover_on_init = false

  ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
  object_discovery_interval = "300s"

  ## timeout applies to any of the api request made to vcenter
  timeout = "30s"

  ## When set to true, all samples are sent as integers. This makes the output
  ## data types backwards compatible with Telegraf 1.9 or lower. Normally all
  ## samples from vCenter, with the exception of percentages, are integer
  ## values, but under some conditions, some averaging takes place internally in
  ## the plugin. Setting this flag to "false" will send values as floats to
  ## preserve the full precision when averaging takes place.
  use_int_samples = true

  ## Optional SSL Config
  # ssl_ca = "/path/to/cafile"
  # ssl_cert = "/path/to/certfile"
  # ssl_key = "/path/to/keyfile"
  ## Use SSL but skip chain & host verification
  insecure_skip_verify = true

telegraf_debug.log

prydin · 2019-04-17T12:16:41Z

At least we've found the reason for the missing samples. Setting the buffers too low will definitely cause that. As for duplicates, they can happen occasionally when the plugin is trying to catch samples that vCenter is late in reporting. You seem to use graphite. Are they causing problems for you in graphite? In most tools, duplicate samples are just ignored.

pypb · 2019-04-17T12:51:02Z

They are causing problems because there are so many duplicates that has to be processed by Graphite. If I were to switch to Telegraf from our current vSphere poller, I would probably have to scale out Graphite just to be able to handle all the duplicates coming in.

prydin · 2019-04-17T13:48:36Z

That sounds strange. You will see an occasional duplicate, but you shouldn't see any significant number of duplicates. Can you estimate the percentage of duplicates you're seeing?

pypb · 2019-04-17T14:28:05Z

During a ten minute run:

# cat from_telegraf.out | wc -l
10760731
# cat from_telegraf.out | sort -u | wc -l
7448050

This would be rougly 30% duplicates.

30% doesn't actually explain the high amount of metrics I'm seeing, there has to be something more besides the duplicates. I'm collecting the same counters using another poller and it's not even close to the amount of metrics generated by Telegraf. I'll have to compare the outputs and see what Telegraf is collecting that the other one doesn't.

But still, 30% duplicates does seem high to me.

pypb · 2019-04-17T14:35:15Z

Ah, now I know what it is. The other poller eliminates all counters with value 0 and assumes you handle null's in the dashboard. That, plus 30% duplicates, should account for the high amount of metrics from Telegraf.

danielnelson · 2019-04-17T20:54:35Z

"This buffer only fills when writes fail to output plugin(s)"

This is no longer accurate, the buffer is filled directly and the outputs send from the buffer. I'll fix the documentation. #5741

For monitoring the metric buffers, I recommend enabling the internal plugin. If you have a high throughput install you may even want to send the results to a secondary location. There are metrics created for the number of dropped metrics on a per output basis.

There has been some talk about adding a threshold processor, this could remove the 0 values for you so you could keep with the same strategy.

pypb · 2019-04-18T07:08:48Z

For monitoring the metric buffers, I recommend enabling the internal plugin. If you have a high throughput install you may even want to send the results to a secondary location. There are metrics created for the number of dropped metrics on a per output basis.

That's a very good idea, I'll look into it.

There has been some talk about adding a threshold processor, this could remove the 0 values for you so you could keep with the same strategy.

It does save a lot of updates that the database has to do. At least with Graphite this has a big impact due to the one .wsp file per metric design.

ghost · 2019-06-28T05:46:22Z

Still not sure as to the exact root cause for this, I spun up a Telegraf 1.11 instance and a 1.7.4 InfluxDB instance on a seperate RHEL7 server and left them all at default configuration. I then enabled the vSphere plugin and noticed 'Dropped metrics' from Telegraf's 'Internal' plugin:

I fixed this by adjusting the Telegraf [agent] metric_batch_size up to 20000 and the metric_buffer_limit to 300000 - no gaps since :-)

Happy to close this from my perspective.

danielnelson · 2019-07-01T17:39:00Z

I'd probably adjust metric_batch_size down to no more than 10k, this just modifies how large each request to InfluxDB is and if its too large you might run into request timeouts. The metric_buffer_limit setting seems about right.

We now log as well when metrics are dropped, though I still recommend using the internal plugin for almost everyone. I think this logging will solve being surprised by dropped metrics, so I think we can close now.

danielnelson added the area/vsphere label Feb 13, 2019

danielnelson added this to the 1.10.0 milestone Feb 13, 2019

danielnelson closed this as completed Feb 13, 2019

danielnelson reopened this Feb 13, 2019

danielnelson removed this from the 1.10.0 milestone Feb 26, 2019

danielnelson added the bug unexpected problem or unintended behavior label Feb 26, 2019

danielnelson closed this as completed Jul 1, 2019

vSphere Input Plugin intermittent silent collection gaps for some hosts & virtual machines #5421

vSphere Input Plugin intermittent silent collection gaps for some hosts & virtual machines #5421

Comments

ghost commented Feb 12, 2019 • edited by ghost Loading

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

Configuration for telegraf agent

prydin commented Feb 13, 2019

ghost commented Feb 13, 2019

danielnelson commented Feb 13, 2019

ghost commented Feb 13, 2019

ghost commented Feb 13, 2019

prydin commented Feb 13, 2019

prydin commented Feb 13, 2019

prydin commented Feb 13, 2019

prydin commented Feb 13, 2019

ghost commented Feb 13, 2019 • edited by ghost Loading

prydin commented Feb 14, 2019

prydin commented Feb 14, 2019

ghost commented Feb 14, 2019

prydin commented Feb 14, 2019

ghost commented Feb 14, 2019 • edited by ghost Loading

prydin commented Feb 14, 2019

ghost commented Feb 14, 2019

prydin commented Feb 14, 2019 • edited Loading

ghost commented Feb 15, 2019

ZHumphries commented Mar 8, 2019

prydin commented Mar 8, 2019

ZHumphries commented Mar 9, 2019

ZHumphries commented Mar 14, 2019

prydin commented Mar 14, 2019

ZHumphries commented Mar 15, 2019

prydin commented Mar 15, 2019 via email

pypb commented Apr 15, 2019

prydin commented Apr 15, 2019

pypb commented Apr 16, 2019

pypb commented Apr 17, 2019

prydin commented Apr 17, 2019

pypb commented Apr 17, 2019

prydin commented Apr 17, 2019

pypb commented Apr 17, 2019

pypb commented Apr 17, 2019

danielnelson commented Apr 17, 2019 • edited Loading

pypb commented Apr 18, 2019

ghost commented Jun 28, 2019 • edited by ghost Loading

danielnelson commented Jul 1, 2019

ghost commented Feb 12, 2019 •

edited by ghost

Loading

ghost commented Feb 13, 2019 •

edited by ghost

Loading

ghost commented Feb 14, 2019 •

edited by ghost

Loading

prydin commented Feb 14, 2019 •

edited

Loading

danielnelson commented Apr 17, 2019 •

edited

Loading

ghost commented Jun 28, 2019 •

edited by ghost

Loading