-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vSphere Input Plugin intermittent silent collection gaps for some hosts & virtual machines #5421
Comments
This should be addressed in 1.10, which is scheduled to be released next week. |
@prydin Any idea what causes this? We were previously utilising Oxalide's vmware collector with no issues. Just curious is all :-) |
@bdeam It would be a great help if you could test out the 1.10 before the final release. Here is the latest automated build, with some additional improvements to the vSphere plugin from today. We will also probably do a release candidate next week: |
Will do! |
Could you run it with debug and send some logs, please? |
Also, I'm running some scale tests with that exact version to check if there might have been something fishy happening in the merge. Stand by! |
What does your "internal_write.metrics_dropped" look like? I was able to reproduce the problem when I fed the InfluxDB output more data than it could flush out to InfluxDB. That causes the InfluxDB output plugin to drop some samples. Try this query:
Here's what mine looks like. Clearly, we have a problem here! I think the fix is to increase the buffer size. |
It looks like you're selecting a metric from a single VM and get some very strange data. Can you simplify your query to just this:
(You obviously need to insert a VM name that's valid in your environment). Here's what I get in an environment with 7000 VMs. |
You can see the large gaps here too, I tried this with different VMs across different vCenters and see the same- I just restarted Telegraf with the interal input configured, give me a few minutes and I'll get a screenshot posted. |
Do you see any errors/warnings in the log? |
How many VMs do you have? |
I've been spending a fair amount of time trying to reproduce this, but haven't come across a smoking gun yet. I'm at a point where further troubleshooting without full debug logs is becoming difficult. I understand that you can't share full logs here. Perhaps you have a VMware technical account manager, solutions engineer or something like that you trust? Since I'm a VMware employee, we may use that channel if we need to exchange information. In the meantime, there are a few thing you can do:
This should tell you how many datapoints the plugin found. This number will likely bounce up and down a bit due to delayed metrics, but variations should be less than 1%.
This should tell you how long the data collection for each resource type is taking (in nanoseconds). If you're getting close to 60s on each collection, there's a risk you're timing out. |
Understood. We have an open support agreement, I'll engage with our TAM next week, hopefully it won't be an issue to get logs to you as I understand how tricky this can be ;-)
There's a couple of blips, but mostly consistent.
I have generated the output file metrics.txt, it's about 30mins of collections. I'll have to analyse this on Monday my time.
Time seem to be okay besides the one spike around 14:00, and that spike is on our largest vCenter. Does the fact our biggest vCenter isn't an appliance affect collections? |
We have the same issue in our environment. Reasonably large with multiple vcenters (5) and about 7000 VM's. There is also no errors in the logs and no missed collections (that I can see). Ill run a couple of the debugs that you've asked for previously and see if if its exactly the same. |
@ZHumphries which version are you on? 1.10? |
@prydin Yes 1.10. Although the issue was there in 1.8 & 1.9 as well. Our vCenters are a mix of 5.5 and 6.5 and the problem shows up on all 5 of them. I didnt have time yesterday to get the _internal telegraf stats setup and get any meaningful data re collection time and amount of datapoints. Ill post an update on monday with the data. |
Seems to be some gaps in my gathering process (stacked graphs). This is with a 60 or 120 second interval. Something else ive noticed, data points are only ever at the minute mark, its seem that no matter what my settings its never collecting the 20/40 second interval realtime metrics. Am I missing something stupidly obvious here? I dont want to hijack this thread, would I be better to start a new issue? |
I've been trying without any luck to reproduce this in my lab. I would need debug logs to proceed. Any chance you can provide me with that? |
Do you have an email address or someway I can send them directly to you? |
I'm seeing the same problem with gaps in collected metrics here. Telegraf 1.10.1, two vCenters, rougly 5000 VM's in one and 1800 in the other. Difference from you guys is that I'm using a Graphite database, so at least it doesn't seem to be related to the output plugin. I could provide you with debug logs, just have to clean out the identifiable bits first, if that would be helpful. |
Debug logs would be helpful if you can provide them! |
I started collecting some logs yesterday and when looking through them I saw a lot of "[outputs.graphite] buffer fullness: 2000 / 2000 metrics". I figured the buffer was too small and metric was being dropped because of it, so increased it. First to 10K, then 100K and finally 1M. It doesn't really solve the problem, though, it just postpones the point where the buffer gets filled up - for some reason it's not outputting the metrics quickly enough to transmit all of them during a collection interval (1m). I've tried switching from carbon-relay to carbon-relay-ng and I've even tried a dummy netcat to /dev/null as a receiver - buffer still gets filled up eventually. This might mean that my problem is different from bdeam's, unless there's an output buffering/performance issue in the influxdb output plugin as well. |
I'm still struggling with this. If I set the metric_buffer high enough to avoid drops I get a lot of duplicated metrics, if I lower it I get drops. According to the documentation metric_buffer is being used when output write fails, "This buffer only fills when writes fail to output plugin(s)", so for some reason Telegraf seems to think it fails to write to the output and buffers metrics it has already sent. Can't see anything about that in the logs, though. I've tried both outputs.graphite and outputs.socket_writer and the result is the same. I've used netcat as a receiver to collect the raw data and I have been able to verify that there are a lot of duplicate metrics in the output. I have attached a debug log. During this run I got close to a million metrics/min sent to Graphite, which is way more than it should be.
|
At least we've found the reason for the missing samples. Setting the buffers too low will definitely cause that. As for duplicates, they can happen occasionally when the plugin is trying to catch samples that vCenter is late in reporting. You seem to use graphite. Are they causing problems for you in graphite? In most tools, duplicate samples are just ignored. |
They are causing problems because there are so many duplicates that has to be processed by Graphite. If I were to switch to Telegraf from our current vSphere poller, I would probably have to scale out Graphite just to be able to handle all the duplicates coming in. |
That sounds strange. You will see an occasional duplicate, but you shouldn't see any significant number of duplicates. Can you estimate the percentage of duplicates you're seeing? |
During a ten minute run:
This would be rougly 30% duplicates. 30% doesn't actually explain the high amount of metrics I'm seeing, there has to be something more besides the duplicates. I'm collecting the same counters using another poller and it's not even close to the amount of metrics generated by Telegraf. I'll have to compare the outputs and see what Telegraf is collecting that the other one doesn't. But still, 30% duplicates does seem high to me. |
Ah, now I know what it is. The other poller eliminates all counters with value 0 and assumes you handle null's in the dashboard. That, plus 30% duplicates, should account for the high amount of metrics from Telegraf. |
This is no longer accurate, the buffer is filled directly and the outputs send from the buffer. I'll fix the documentation. #5741 For monitoring the metric buffers, I recommend enabling the There has been some talk about adding a threshold processor, this could remove the 0 values for you so you could keep with the same strategy. |
That's a very good idea, I'll look into it.
It does save a lot of updates that the database has to do. At least with Graphite this has a big impact due to the one .wsp file per metric design. |
Still not sure as to the exact root cause for this, I spun up a Telegraf 1.11 instance and a 1.7.4 InfluxDB instance on a seperate RHEL7 server and left them all at default configuration. I then enabled the vSphere plugin and noticed 'Dropped metrics' from Telegraf's 'Internal' plugin: I fixed this by adjusting the Telegraf [agent] Happy to close this from my perspective. |
I'd probably adjust We now log as well when metrics are dropped, though I still recommend using the |
Relevant telegraf.conf:
System info:
[Include Telegraf version, operating system name, and other relevant details]
Telegraf 1.9.0-1 (RPM package installed)
RHEL 7.5 Maipo (32GB RAM, 8x 2.2GHz CPU)
Steps to reproduce:
Expected behavior:
Collection of all listed metrics for Hosts & VMs is successful, and where not a failure message is logged
Actual behavior:
Intermittent gaps in metrics for some hosts & VMs is observed as below:
Additional info:
I have tried increasing & decreasing the go routines, increased intervals to 2mins+ with no success. I have left the collector to run for over 12 hours with no change in behavior.
[Include gist of relevant config, logs, etc.]
Configuration for telegraf agent
The text was updated successfully, but these errors were encountered: