Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vSphere Input Plugin intermittent silent collection gaps for some hosts & virtual machines #5421

Closed
ghost opened this issue Feb 12, 2019 · 39 comments
Labels
area/vsphere bug unexpected problem or unintended behavior

Comments

@ghost
Copy link

ghost commented Feb 12, 2019

Relevant telegraf.conf:

System info:

[Include Telegraf version, operating system name, and other relevant details]
Telegraf 1.9.0-1 (RPM package installed)
RHEL 7.5 Maipo (32GB RAM, 8x 2.2GHz CPU)

Steps to reproduce:

  1. ... Enable telegraf
  2. ... 'journalctl -f -t telegraf' shows no instances of '[agent] input "inputs.vsphere" did not complete within its interval'
  3. ... Missing/Gaps in metrics for some ESX Hosts and Virtual Machines

Expected behavior:

Collection of all listed metrics for Hosts & VMs is successful, and where not a failure message is logged

Actual behavior:

Intermittent gaps in metrics for some hosts & VMs is observed as below:
image

Additional info:

I have tried increasing & decreasing the go routines, increased intervals to 2mins+ with no success. I have left the collector to run for over 12 hours with no change in behavior.
[Include gist of relevant config, logs, etc.]

Configuration for telegraf agent

[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = true
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

# # Read metrics from VMware vCenter
 [[inputs.vsphere]]
   ## List of vCenter URLs to be monitored. These three lines must be uncommented
   ## and edited for the plugin to work.
   vcenters = ["https://london22s304.generic.com/sdk", "https://test04.generic.com/sdk", "https://test06.generic.com/sdk", "https://test101.generic.com/sdk" ]
   username = "acc@generic.com"
   password = "password"
   interval = "60s"

   ## VMs
   ## Typical VM metrics (if omitted or empty, all metrics are collected)
   vm_metric_include = [
     "cpu.demand.average",
     "cpu.idle.summation",
     "cpu.latency.average",
     "cpu.readiness.average",
     "cpu.ready.summation",
     "cpu.run.summation",
     "cpu.usagemhz.average",
     "cpu.usage.average",
     "cpu.used.summation",
     "cpu.wait.summation",
     "mem.active.average",
     "mem.granted.average",
     "mem.latency.average",
     "mem.swapin.average",
     "mem.swapinRate.average",
     "mem.swapout.average",
     "mem.swapoutRate.average",
     "mem.usage.average",
     "mem.vmmemctl.average",
     "net.bytesRx.average",
     "net.bytesTx.average",
     "net.droppedRx.summation",
     "net.droppedTx.summation",
     "net.usage.average",
     "power.power.average",
     "virtualDisk.numberReadAveraged.average",
     "virtualDisk.numberWriteAveraged.average",
     "virtualDisk.read.average",
     "virtualDisk.readOIO.latest",
     "virtualDisk.throughput.usage.average",
     "virtualDisk.totalReadLatency.average",
     "virtualDisk.totalWriteLatency.average",
     "virtualDisk.write.average",
     "virtualDisk.writeOIO.latest",
     "sys.uptime.latest",
   ]
  # vm_metric_exclude = [] ## Nothing is excluded by default
  # vm_instances = true ## true by default

   ## Hosts
   ## Typical host metrics (if omitted or empty, all metrics are collected)
      host_metric_include = [
     "cpu.coreUtilization.average",
     "cpu.costop.summation",
     "cpu.demand.average",
     "cpu.idle.summation",
     "cpu.latency.average",
     "cpu.readiness.average",
     "cpu.ready.summation",
     "cpu.swapwait.summation",
     "cpu.usage.average",
     "cpu.usagemhz.average",
     "cpu.used.summation",
     "cpu.utilization.average",
     "cpu.wait.summation",
     "disk.deviceReadLatency.average",
     "disk.deviceWriteLatency.average",
     "disk.kernelReadLatency.average",
     "disk.kernelWriteLatency.average",
     "disk.numberReadAveraged.average",
     "disk.numberWriteAveraged.average",
     "disk.read.average",
     "disk.totalReadLatency.average",
     "disk.totalWriteLatency.average",
     "disk.write.average",
     "mem.active.average",
     "mem.latency.average",
     "mem.state.latest",
     "mem.swapin.average",
     "mem.swapinRate.average",
     "mem.swapout.average",
     "mem.swapoutRate.average",
     "mem.totalCapacity.average",
     "mem.usage.average",
     "mem.vmmemctl.average",
     "net.bytesRx.average",
     "net.bytesTx.average",
     "net.droppedRx.summation",
     "net.droppedTx.summation",
     "net.errorsRx.summation",
     "net.errorsTx.summation",
     "net.usage.average",
     "power.power.average",
     "storageAdapter.numberReadAveraged.average",
     "storageAdapter.numberWriteAveraged.average",
     "storageAdapter.read.average",
     "storageAdapter.write.average",
     "sys.uptime.latest",
   ]
  # host_metric_exclude = [] ## Nothing excluded by default
  # host_instances = true ## true by default
   ## Clusters
   # cluster_metric_include = [] ## if omitted or empty, all metrics are collected
   cluster_metric_exclude = ["*"] ## Nothing excluded by default
   # cluster_instances = true ## true by default

   ## Datastores
   # datastore_metric_include = [] ## if omitted or empty, all metrics are collected
   datastore_metric_exclude = ["*"] ## Nothing excluded by default
   # datastore_instances = false ## false by default for Datastores only

   ## Datacenters
   # datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
   datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
   # datacenter_instances = false ## false by default for Datastores only

   ## Plugin Settings
   ## separator character to use for measurement and field names (default: "_") 
     # separator = "_"

   ## number of objects to retreive per query for realtime resources (vms and hosts)
   ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
   max_query_objects = 64

   ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
   ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
   max_query_metrics = 64

   ## number of go routines to use for collection and discovery of objects and metrics
   collect_concurrency = 5
   discover_concurrency = 5

   ## whether or not to force discovery of new objects on initial gather call before collecting metrics
   ## when true for large environments this may cause errors for time elapsed while collecting metrics
   ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
   #   force_discover_on_init = false

   ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
   # object_discovery_interval = "300s"

   ## timeout applies to any of the api request made to vcenter
   timeout = "60s"

   ## Optional SSL Config
   # ssl_ca = "/path/to/cafile"
   # ssl_cert = "/path/to/certfile"
   # ssl_key = "/path/to/keyfile"
   ## Use SSL but skip chain & host verification
   insecure_skip_verify = true

   
  ## This is is a Windows Server vCenter, not an appliance.
 [[inputs.vsphere]]
   ## List of vCenter URLs to be monitored. These three lines must be uncommented
   ## and edited for the plugin to work.
   vcenters = ["https://london22s302.generic.com/sdk" ]
   username = "acc@generic.com"
   password = "password"
   interval = "60s"

   ## VMs
   ## Typical VM metrics (if omitted or empty, all metrics are collected)
   vm_metric_include = [
     "cpu.demand.average",
     "cpu.idle.summation",
     "cpu.latency.average",
     "cpu.readiness.average",
     "cpu.ready.summation",
     "cpu.run.summation",
     "cpu.usagemhz.average",
     "cpu.usage.average",
     "cpu.used.summation",
     "cpu.wait.summation",
     "mem.active.average",
     "mem.granted.average",
     "mem.latency.average",
     "mem.swapin.average",
     "mem.swapinRate.average",
     "mem.swapout.average",
     "mem.swapoutRate.average",
     "mem.usage.average",
     "mem.vmmemctl.average",
     "net.bytesRx.average",
     "net.bytesTx.average",
     "net.droppedRx.summation",
     "net.droppedTx.summation",
     "net.usage.average",
     "power.power.average",
     "virtualDisk.numberReadAveraged.average",
     "virtualDisk.numberWriteAveraged.average",
     "virtualDisk.read.average",
     "virtualDisk.readOIO.latest",
     "virtualDisk.throughput.usage.average",
     "virtualDisk.totalReadLatency.average",
     "virtualDisk.totalWriteLatency.average",
     "virtualDisk.write.average",
     "virtualDisk.writeOIO.latest",
     "sys.uptime.latest",
   ]
  # vm_metric_exclude = [] ## Nothing is excluded by default
  # vm_instances = true ## true by default

     ## Hosts
   ## Typical host metrics (if omitted or empty, all metrics are collected)
   host_metric_include = [
     "cpu.coreUtilization.average",
     "cpu.costop.summation",
     "cpu.demand.average",
     "cpu.idle.summation",
     "cpu.latency.average",
     "cpu.readiness.average",
     "cpu.ready.summation",
     "cpu.swapwait.summation",
     "cpu.usage.average",
     "cpu.usagemhz.average",
     "cpu.used.summation",
     "cpu.utilization.average",
     "cpu.wait.summation",
     "disk.deviceReadLatency.average",
     "disk.deviceWriteLatency.average",
     "disk.kernelReadLatency.average",
     "disk.kernelWriteLatency.average",
     "disk.numberReadAveraged.average",
     "disk.numberWriteAveraged.average",
     "disk.read.average",
     "disk.totalReadLatency.average",
     "disk.totalWriteLatency.average",
     "disk.write.average",
     "mem.active.average",
     "mem.latency.average",
     "mem.state.latest",
     "mem.swapin.average",
     "mem.swapinRate.average",
     "mem.swapout.average",
     "mem.swapoutRate.average",
     "mem.totalCapacity.average",
     "mem.usage.average",
     "mem.vmmemctl.average",
     "net.bytesRx.average",
     "net.bytesTx.average",
     "net.droppedRx.summation",
     "net.droppedTx.summation",
     "net.errorsRx.summation",
     "net.errorsTx.summation",
     "net.usage.average",
     "power.power.average",
     "storageAdapter.numberReadAveraged.average",
     "storageAdapter.numberWriteAveraged.average",
     "storageAdapter.read.average",
     "storageAdapter.write.average",
     "sys.uptime.latest",
   ]
  # host_metric_exclude = [] ## Nothing excluded by default
  # host_instances = true ## true by default
   ## Clusters
   # cluster_metric_include = [] ## if omitted or empty, all metrics are collected
   cluster_metric_exclude = ["*"] ## Nothing excluded by default
   # cluster_instances = true ## true by default

   ## Datastores
   # datastore_metric_include = [] ## if omitted or empty, all metrics are collected
   datastore_metric_exclude = ["*"] ## Nothing excluded by default
   # datastore_instances = false ## false by default for Datastores only

   ## Datacenters
   # datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
   datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
   # datacenter_instances = false ## false by default for Datastores only

   ## Plugin Settings
   ## separator character to use for measurement and field names (default: "_")
   # separator = "_"
  ## Plugin Settings
   ## separator character to use for measurement and field names (default: "_")
   # separator = "_"

   ## number of objects to retreive per query for realtime resources (vms and hosts)
   ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
   max_query_objects = 64

   ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
   ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
   max_query_metrics = 64

   ## number of go routines to use for collection and discovery of objects and metrics
   collect_concurrency = 5
   discover_concurrency = 5

   ## whether or not to force discovery of new objects on initial gather call before collecting metrics
   ## when true for large environments this may cause errors for time elapsed while collecting metrics
   ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
   # force_discover_on_init = false

   ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
   # object_discovery_interval = "300s"

   ## timeout applies to any of the api request made to vcenter
   timeout = "60s"

   ## Optional SSL Config
   # ssl_ca = "/path/to/cafile"
   # ssl_cert = "/path/to/certfile"
   # ssl_key = "/path/to/keyfile"
   ## Use SSL but skip chain & host verification
   insecure_skip_verify = true

@prydin
Copy link
Contributor

prydin commented Feb 13, 2019

This should be addressed in 1.10, which is scheduled to be released next week.

@ghost
Copy link
Author

ghost commented Feb 13, 2019

@prydin
Fastest response time in history... thanks!

Any idea what causes this? We were previously utilising Oxalide's vmware collector with no issues. Just curious is all :-)

@danielnelson danielnelson added this to the 1.10.0 milestone Feb 13, 2019
@danielnelson
Copy link
Contributor

@bdeam It would be a great help if you could test out the 1.10 before the final release. Here is the latest automated build, with some additional improvements to the vSphere plugin from today. We will also probably do a release candidate next week:

@ghost
Copy link
Author

ghost commented Feb 13, 2019

@bdeam It would be a great help if you could test out the 1.10 before the final release. Here is the latest automated build, with some additional improvements to the vSphere plugin from today. We will also probably do a release candidate next week:

Will do!

@ghost
Copy link
Author

ghost commented Feb 13, 2019

Seems as though the issue persists for me.

telegraf -version
Telegraf unknown (git: master ce507e52)

The plugin has been running for approx. 40mins:
image

Same configuration as above.

@prydin
Copy link
Contributor

prydin commented Feb 13, 2019

Could you run it with debug and send some logs, please?

@prydin
Copy link
Contributor

prydin commented Feb 13, 2019

Also, I'm running some scale tests with that exact version to check if there might have been something fishy happening in the merge. Stand by!

@danielnelson danielnelson reopened this Feb 13, 2019
@prydin
Copy link
Contributor

prydin commented Feb 13, 2019

I tried this by ingesting metrics from 7000 VMs into InfluxDB and put them all on a stacked chart. Looks OK to me. You typically see a lot of jitter on a stacked graph if there's data missing.

screen shot 2019-02-13 at 1 56 37 pm

@prydin
Copy link
Contributor

prydin commented Feb 13, 2019

You may see missing samples in the last "tick" of the graph. This is because vCenter is sometimes late in reporting metrics. They will be picked up the next collection round and their timestamps will be backdated. This is what it looks like.
screen shot 2019-02-13 at 2 09 52 pm

@ghost
Copy link
Author

ghost commented Feb 13, 2019

You may see missing samples in the last "tick" of the graph. This is because vCenter is sometimes late in reporting metrics. They will be picked up the next collection round and their timestamps will be backdated. This is what it looks like.
screen shot 2019-02-13 at 2 09 52 pm

Thanks for the tests - I suspected that may be what was happening but in the second screenshot I posted above of "CPU READY" you can see that all of the series have large gaps in them, some of them for up to 10-15mins, that doesn't seem to get backdated.

I recreated the same stacked graph you have - this is what mine looks like (wouldn't fill(previous) simply mask missing collections?):

image

Edit:
I am definitely still seeing missing collections across multiple vCenters, hosts and VMs. Happy to get some debug/logs to you - just let me know what you need.

image

@prydin
Copy link
Contributor

prydin commented Feb 14, 2019

What does your "internal_write.metrics_dropped" look like? I was able to reproduce the problem when I fed the InfluxDB output more data than it could flush out to InfluxDB. That causes the InfluxDB output plugin to drop some samples. Try this query:

SELECT mean("metrics_dropped") AS "mean_metrics_dropped" FROM "telegraf"."autogen"."internal_write" WHERE time > :dashboardTime: GROUP BY time(:interval:) FILL(null)

Here's what mine looks like. Clearly, we have a problem here! I think the fix is to increase the buffer size.

screen shot 2019-02-13 at 7 29 23 pm

@prydin
Copy link
Contributor

prydin commented Feb 14, 2019

It looks like you're selecting a metric from a single VM and get some very strange data. Can you simplify your query to just this:

SELECT mean("demand_average") AS "mean_demand_average" FROM "telegraf"."autogen"."vsphere_vm_cpu" WHERE time > :dashboardTime: AND "vmname"='A vm with a known name' GROUP BY time(:interval:) FILL(null)

(You obviously need to insert a VM name that's valid in your environment).

Here's what I get in an environment with 7000 VMs.

screen shot 2019-02-13 at 7 39 32 pm

@ghost
Copy link
Author

ghost commented Feb 14, 2019

It looks like you're selecting a metric from a single VM and get some very strange data. Can you simplify your query to just this:

SELECT mean("demand_average") AS "mean_demand_average" FROM "telegraf"."autogen"."vsphere_vm_cpu" WHERE time > :dashboardTime: AND "vmname"='A vm with a known name' GROUP BY time(:interval:) FILL(null)

(You obviously need to insert a VM name that's valid in your environment).

Here's what I get in an environment with 7000 VMs.

screen shot 2019-02-13 at 7 39 32 pm

image

You can see the large gaps here too, I tried this with different VMs across different vCenters and see the same- I just restarted Telegraf with the interal input configured, give me a few minutes and I'll get a screenshot posted.

@prydin
Copy link
Contributor

prydin commented Feb 14, 2019

Do you see any errors/warnings in the log?

@ghost
Copy link
Author

ghost commented Feb 14, 2019

I can't see any telegraf metrics dropped via the internal measurement & writes dropped.

This is the error/warning log over 24hrs (each colour is a different vCenter):
image

The gaps exist consistently, not just when these errors are reported.

Using debug I can see this despite having set both max_query_metrics and max_query_objects to 64:

image

I also reconfigured telegraf to point to a local InfluxDB instance as I couldn't rule network issues out, but still see the same behavior.

@prydin
Copy link
Contributor

prydin commented Feb 14, 2019

How many VMs do you have?

@ghost
Copy link
Author

ghost commented Feb 14, 2019

vCenter #1: 61
vCenter #2: 1057
vCenter #3: 274
vCenter #4: 280

Seperate Instance of plugin:
vCenter #5: 3216

@prydin
Copy link
Contributor

prydin commented Feb 14, 2019

I've been spending a fair amount of time trying to reproduce this, but haven't come across a smoking gun yet. I'm at a point where further troubleshooting without full debug logs is becoming difficult.

I understand that you can't share full logs here. Perhaps you have a VMware technical account manager, solutions engineer or something like that you trust? Since I'm a VMware employee, we may use that channel if we need to exchange information.

In the meantime, there are a few thing you can do:

  1. Issue the following query:
    SELECT mean("gather_count") AS "mean_gather_count" FROM "telegraf"."autogen"."internal_vsphere" WHERE time > :dashboardTime: AND "resourcetype"='vm' GROUP BY time(1m), "vcenter" FILL(0)

This should tell you how many datapoints the plugin found. This number will likely bounce up and down a bit due to delayed metrics, but variations should be less than 1%.

  1. Set up a file output. This way we can make sure that the problem isn't somewhere between Telegraf and InfluxDB. You should be able to use a series of grep commands to isolate a single metric from a single VM and check if you have any drop-outs.

  2. Issue the following query:
    SELECT mean("gather_duration_ns") AS "mean_gather_duration" FROM "telegraf"."autogen"."internal_vsphere" WHERE time > :dashboardTime: AND "resourcetype"='vm' GROUP BY time(1m), "vcenter" FILL(0)

This should tell you how long the data collection for each resource type is taking (in nanoseconds). If you're getting close to 60s on each collection, there's a risk you're timing out.

@ghost
Copy link
Author

ghost commented Feb 15, 2019

I've been spending a fair amount of time trying to reproduce this, but haven't come across a smoking gun yet. I'm at a point where further troubleshooting without full debug logs is becoming difficult.

I understand that you can't share full logs here. Perhaps you have a VMware technical account manager, solutions engineer or something like that you trust? Since I'm a VMware employee, we may use that channel if we need to exchange information.

Understood. We have an open support agreement, I'll engage with our TAM next week, hopefully it won't be an issue to get logs to you as I understand how tricky this can be ;-)

In the meantime, there are a few thing you can do:

  1. Issue the following query:
    SELECT mean("gather_count") AS "mean_gather_count" FROM "telegraf"."autogen"."internal_vsphere" WHERE time > :dashboardTime: AND "resourcetype"='vm' GROUP BY time(1m), "vcenter" FILL(0)

This should tell you how many datapoints the plugin found. This number will likely bounce up and down a bit due to delayed metrics, but variations should be less than 1%.

image

There's a couple of blips, but mostly consistent.

  1. Set up a file output. This way we can make sure that the problem isn't somewhere between Telegraf and InfluxDB. You should be able to use a series of grep commands to isolate a single metric from a single VM and check if you have any drop-outs.

I have generated the output file metrics.txt, it's about 30mins of collections. I'll have to analyse this on Monday my time.

  1. Issue the following query:
    SELECT mean("gather_duration_ns") AS "mean_gather_duration" FROM "telegraf"."autogen"."internal_vsphere" WHERE time > :dashboardTime: AND "resourcetype"='vm' GROUP BY time(1m), "vcenter" FILL(0)

This should tell you how long the data collection for each resource type is taking (in nanoseconds). If you're getting close to 60s on each collection, there's a risk you're timing out.

image

Time seem to be okay besides the one spike around 14:00, and that spike is on our largest vCenter. Does the fact our biggest vCenter isn't an appliance affect collections?

@danielnelson danielnelson removed this from the 1.10.0 milestone Feb 26, 2019
@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Feb 26, 2019
@ZHumphries
Copy link

We have the same issue in our environment. Reasonably large with multiple vcenters (5) and about 7000 VM's.

There is also no errors in the logs and no missed collections (that I can see). Ill run a couple of the debugs that you've asked for previously and see if if its exactly the same.

@prydin
Copy link
Contributor

prydin commented Mar 8, 2019

@ZHumphries which version are you on? 1.10?

@ZHumphries
Copy link

@prydin Yes 1.10. Although the issue was there in 1.8 & 1.9 as well. Our vCenters are a mix of 5.5 and 6.5 and the problem shows up on all 5 of them. I didnt have time yesterday to get the _internal telegraf stats setup and get any meaningful data re collection time and amount of datapoints. Ill post an update on monday with the data.

@ZHumphries
Copy link

Seems to be some gaps in my gathering process (stacked graphs). This is with a 60 or 120 second interval.

Something else ive noticed, data points are only ever at the minute mark, its seem that no matter what my settings its never collecting the 20/40 second interval realtime metrics. Am I missing something stupidly obvious here?

I dont want to hijack this thread, would I be better to start a new issue?

image

image

@prydin
Copy link
Contributor

prydin commented Mar 14, 2019

I've been trying without any luck to reproduce this in my lab. I would need debug logs to proceed. Any chance you can provide me with that?

@ZHumphries
Copy link

Do you have an email address or someway I can send them directly to you?

@prydin
Copy link
Contributor

prydin commented Mar 15, 2019 via email

@pypb
Copy link

pypb commented Apr 15, 2019

I'm seeing the same problem with gaps in collected metrics here. Telegraf 1.10.1, two vCenters, rougly 5000 VM's in one and 1800 in the other. Difference from you guys is that I'm using a Graphite database, so at least it doesn't seem to be related to the output plugin.

I could provide you with debug logs, just have to clean out the identifiable bits first, if that would be helpful.

@prydin
Copy link
Contributor

prydin commented Apr 15, 2019

Debug logs would be helpful if you can provide them!

@pypb
Copy link

pypb commented Apr 16, 2019

I started collecting some logs yesterday and when looking through them I saw a lot of "[outputs.graphite] buffer fullness: 2000 / 2000 metrics". I figured the buffer was too small and metric was being dropped because of it, so increased it. First to 10K, then 100K and finally 1M.

It doesn't really solve the problem, though, it just postpones the point where the buffer gets filled up - for some reason it's not outputting the metrics quickly enough to transmit all of them during a collection interval (1m). I've tried switching from carbon-relay to carbon-relay-ng and I've even tried a dummy netcat to /dev/null as a receiver - buffer still gets filled up eventually.

This might mean that my problem is different from bdeam's, unless there's an output buffering/performance issue in the influxdb output plugin as well.

@pypb
Copy link

pypb commented Apr 17, 2019

I'm still struggling with this. If I set the metric_buffer high enough to avoid drops I get a lot of duplicated metrics, if I lower it I get drops.

According to the documentation metric_buffer is being used when output write fails, "This buffer only fills when writes fail to output plugin(s)", so for some reason Telegraf seems to think it fails to write to the output and buffers metrics it has already sent. Can't see anything about that in the logs, though. I've tried both outputs.graphite and outputs.socket_writer and the result is the same. I've used netcat as a receiver to collect the raw data and I have been able to verify that there are a lot of duplicate metrics in the output.

I have attached a debug log. During this run I got close to a million metrics/min sent to Graphite, which is way more than it should be.

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "60s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 25000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 500000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "60s"

  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  #precision = "60s"

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = true
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""

  ## Override default hostname, if empty use os.Hostname()
  #hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = true


# Configuration for Graphite server to send metrics to
[[outputs.graphite]]
  ## TCP endpoint for your graphite instance.
  ## If multiple endpoints are configured, output will be load balanced.
  ## Only one of the endpoints will be written to with each iteration.
  servers = ["server:62003"]

  ## Prefix metrics name
  prefix = "vmware.telegraf"
  ## Graphite output template
  ## see https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
  template = "host.tags.measurement.field"

  ## Enable Graphite tags support
  graphite_tag_support = true

  ## timeout in seconds for the write connection to graphite
  timeout = 60

  ## Optional TLS Config
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

# Read metrics from one or many vCenters
[[inputs.vsphere]]
    ## List of vCenter URLs to be monitored. These three lines must be uncommented
  ## and edited for the plugin to work.
  vcenters = [ 
    "https://vcenter/sdk"
  ]
  username = "username"
  password = "password"

  ## VMs
  ## Typical VM metrics (if omitted or empty, all metrics are collected)
  # vm_include = [ "/*/vm/**"] # Inventory path to VMs to collect (by default all are collected)
  vm_metric_include = [
    "cpu.costop.summation",
    "cpu.ready.summation",
    "cpu.usage.average",
    "cpu.used.summation",
    "datastore.numberReadAveraged.average",
    "datastore.numberWriteAveraged.average",
    "datastore.read.average",
    "datastore.write.average",
    "datastore.totalReadLatency.average",
    "datastore.totalWriteLatency.average",
    "mem.usage.average",
    "mem.granted.average",
    "mem.active.average",
    "mem.compressed.average",
    "mem.swapped.average",
    "mem.vmmemctl.average",
    "virtualDisk.read.average",
    "virtualDisk.readOIO.latest",
    "virtualDisk.writeOIO.latest",
    "virtualDisk.write.average",
    "virtualDisk.readLatencyUS.latest",
    "virtualDisk.writeLatencyUS.latest",
    "virtualDisk.numberReadAveraged.average",
    "virtualDisk.numberWriteAveraged.average",
    "net.bytesRx.average",
    "net.bytesTx.average",
    "net.droppedRx.summation",
    "net.droppedTx.summation",
    "net.packetsRx.summation",
    "net.packetsTx.summation"
  ]
  # vm_metric_exclude = [] ## Nothing is excluded by default
  # vm_instances = true ## true by default

  ## Hosts
  ## Typical host metrics (if omitted or empty, all metrics are collected)
  # host_include = [ "/*/host/**"] # Inventory path to hosts to collect (by default all are collected)
  host_metric_include = [
    "cpu.costop.summation",
    "cpu.ready.summation",
    "cpu.usage.average",
    "cpu.used.summation",
    "datastore.numberReadAveraged.average",
    "datastore.numberWriteAveraged.average",
    "datastore.read.average",
    "datastore.write.average",
    "datastore.totalReadLatency.average",
    "datastore.totalWriteLatency.average",
    "datastore.datastoreReadOIO.latest",
    "datastore.datastoreWriteOIO.latest",
    "mem.granted.average",
    "mem.active.average",
    "mem.compressed.average",
    "mem.vmmemctl.average",
    "mem.usage.average",
    "net.bytesRx.average",
    "net.bytesTx.average",
    "net.droppedRx.summation",
    "net.droppedTx.summation",
    "net.errorsRx.summation",
    "net.errorsTx.summation",
    "net.packetsRx.summation",
    "net.packetsTx.summation"
  ]
  # host_metric_exclude = [] ## Nothing excluded by default
  # host_instances = true ## true by default

  ## Clusters
  # cluster_include = [ "/*/host/**"] # Inventory path to clusters to collect (by default all are collected)
  # cluster_metric_include = [] ## if omitted or empty, all metrics are collected
  cluster_metric_exclude = [ "*" ] ## Nothing excluded by default
  # cluster_instances = false ## false by default

  ## Datastores
  # cluster_include = [ "/*/datastore/**"] # Inventory path to datastores to collect (by default all are collected)
  # datastore_metric_include = [] ## if omitted or empty, all metrics are collected
  # datastore_metric_exclude = [] ## Nothing excluded by default
  # datastore_instances = false ## false by default

  ## Datacenters
  # datacenter_include = [ "/*/host/**"] # Inventory path to clusters to collect (by default all are collected)
  #datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
  datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
  # datacenter_instances = false ## false by default

  ## Plugin Settings
  ## separator character to use for measurement and field names (default: "_")
  # separator = "_"

  ## number of objects to retreive per query for realtime resources (vms and hosts)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_objects = 256

  ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_metrics = 256

  ## number of go routines to use for collection and discovery of objects and metrics
  collect_concurrency = 10
  discover_concurrency = 10

  ## whether or not to force discovery of new objects on initial gather call before collecting metrics
  ## when true for large environments this may cause errors for time elapsed while collecting metrics
  ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
  # force_discover_on_init = false

  ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
  object_discovery_interval = "300s"

  ## timeout applies to any of the api request made to vcenter
  timeout = "30s"

  ## When set to true, all samples are sent as integers. This makes the output
  ## data types backwards compatible with Telegraf 1.9 or lower. Normally all
  ## samples from vCenter, with the exception of percentages, are integer
  ## values, but under some conditions, some averaging takes place internally in
  ## the plugin. Setting this flag to "false" will send values as floats to
  ## preserve the full precision when averaging takes place.
  use_int_samples = true

  ## Optional SSL Config
  # ssl_ca = "/path/to/cafile"
  # ssl_cert = "/path/to/certfile"
  # ssl_key = "/path/to/keyfile"
  ## Use SSL but skip chain & host verification
  insecure_skip_verify = true

telegraf_debug.log

@prydin
Copy link
Contributor

prydin commented Apr 17, 2019

At least we've found the reason for the missing samples. Setting the buffers too low will definitely cause that. As for duplicates, they can happen occasionally when the plugin is trying to catch samples that vCenter is late in reporting. You seem to use graphite. Are they causing problems for you in graphite? In most tools, duplicate samples are just ignored.

@pypb
Copy link

pypb commented Apr 17, 2019

They are causing problems because there are so many duplicates that has to be processed by Graphite. If I were to switch to Telegraf from our current vSphere poller, I would probably have to scale out Graphite just to be able to handle all the duplicates coming in.

@prydin
Copy link
Contributor

prydin commented Apr 17, 2019

That sounds strange. You will see an occasional duplicate, but you shouldn't see any significant number of duplicates. Can you estimate the percentage of duplicates you're seeing?

@pypb
Copy link

pypb commented Apr 17, 2019

During a ten minute run:

# cat from_telegraf.out | wc -l
10760731
# cat from_telegraf.out | sort -u | wc -l
7448050

This would be rougly 30% duplicates.

30% doesn't actually explain the high amount of metrics I'm seeing, there has to be something more besides the duplicates. I'm collecting the same counters using another poller and it's not even close to the amount of metrics generated by Telegraf. I'll have to compare the outputs and see what Telegraf is collecting that the other one doesn't.

But still, 30% duplicates does seem high to me.

@pypb
Copy link

pypb commented Apr 17, 2019

Ah, now I know what it is. The other poller eliminates all counters with value 0 and assumes you handle null's in the dashboard. That, plus 30% duplicates, should account for the high amount of metrics from Telegraf.

@danielnelson
Copy link
Contributor

danielnelson commented Apr 17, 2019

"This buffer only fills when writes fail to output plugin(s)"

This is no longer accurate, the buffer is filled directly and the outputs send from the buffer. I'll fix the documentation. #5741

For monitoring the metric buffers, I recommend enabling the internal plugin. If you have a high throughput install you may even want to send the results to a secondary location. There are metrics created for the number of dropped metrics on a per output basis.

There has been some talk about adding a threshold processor, this could remove the 0 values for you so you could keep with the same strategy.

@pypb
Copy link

pypb commented Apr 18, 2019

For monitoring the metric buffers, I recommend enabling the internal plugin. If you have a high throughput install you may even want to send the results to a secondary location. There are metrics created for the number of dropped metrics on a per output basis.

That's a very good idea, I'll look into it.

There has been some talk about adding a threshold processor, this could remove the 0 values for you so you could keep with the same strategy.

It does save a lot of updates that the database has to do. At least with Graphite this has a big impact due to the one .wsp file per metric design.

@ghost
Copy link
Author

ghost commented Jun 28, 2019

Still not sure as to the exact root cause for this, I spun up a Telegraf 1.11 instance and a 1.7.4 InfluxDB instance on a seperate RHEL7 server and left them all at default configuration. I then enabled the vSphere plugin and noticed 'Dropped metrics' from Telegraf's 'Internal' plugin:

image

I fixed this by adjusting the Telegraf [agent] metric_batch_size up to 20000 and the metric_buffer_limit to 300000 - no gaps since :-)

Happy to close this from my perspective.

@danielnelson
Copy link
Contributor

I'd probably adjust metric_batch_size down to no more than 10k, this just modifies how large each request to InfluxDB is and if its too large you might run into request timeouts. The metric_buffer_limit setting seems about right.

We now log as well when metrics are dropped, though I still recommend using the internal plugin for almost everyone. I think this logging will solve being surprised by dropped metrics, so I think we can close now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vsphere bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants