Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics for only one host or datastore instance are reported #40

Closed
karlism opened this issue Dec 20, 2018 · 16 comments
Closed

Metrics for only one host or datastore instance are reported #40

karlism opened this issue Dec 20, 2018 · 16 comments
Labels
bug Something isn't working

Comments

@karlism
Copy link
Contributor

karlism commented Dec 20, 2018

Helllo,

VMware exporter returns metrics only for one host or datastore instance.

Environment:
vmware_exporter.py v0.3.0
Python 3.5.2 (Ubuntu) & Python 3.6.6 (OpenBSD)
I've checked that all permissions are correct on vSphere side and also tried connecting by using administrator account.

Steps to reproduce:

$ cat config.yml
default:
    vsphere_host: "hostname"   # this is vCenter host
    vsphere_user: 'username'
    vsphere_password: 'password'
    ignore_ssl: True
    collect_only:
        vms: False
        vmguests: False
        datastores: True
        hosts: True
        snapshots: False
$ ./vmware_exporter.py -c config.yml
$ curl -s localhost:9272/metrics | grep -v "#"
vmware_host_power_state{cluster_name="XXX cluster",dc_name="DC_NAME",host_name="vm1.example.com"} 1.0

Applying following patch shows that only first host or datastore is processed in for loop:

--- vmware_exporter.py.orig     2018-12-20 14:11:48.535447407 +0100
+++ vmware_exporter.py  2018-12-20 14:15:59.207702036 +0100
@@ -299,7 +299,9 @@
         """
         log("Starting datastore metrics collection")
         datastores = self._vmware_get_obj(content, [vim.Datastore])
+        print(datastores)
         for datastore in datastores:
+            print(datastore)
             # ds.RefreshDatastoreStorageInfo()
             summary = datastore.summary
             ds_name = summary.name
@@ -461,7 +463,9 @@
         """
         log("Starting host metrics collection")
         hosts = self._vmware_get_obj(content, [vim.HostSystem])
+        print(hosts)
         for host in hosts:
+            print(host)
             summary = host.summary
             host_name, host_dc_name, host_cluster_name = self._vmware_host_metadata(inventory, host)
             host_metadata = [host_name, host_dc_name, host_cluster_name]

Output after applying patch:

$ ./vmware_exporter.py -c config.yml 
[2018-12-20 13:21:36.743208+00:00] Starting web server on port 9272
[2018-12-20 13:21:41.860451+00:00] Start collecting metrics from hostname
[2018-12-20 13:21:42.041749+00:00] Starting inventory collection
[2018-12-20 13:21:43.712368+00:00] Finished inventory collection
[2018-12-20 13:21:43.712878+00:00] Starting datastore metrics collection
[2018-12-20 13:21:43.713506+00:00] Starting host metrics collection
(ManagedObject) [
   'vim.Datastore:datastore-1263',
   'vim.Datastore:datastore-26',
   'vim.Datastore:datastore-29',
   'vim.Datastore:datastore-25',
   'vim.Datastore:datastore-28',
   'vim.Datastore:datastore-6324',
   'vim.Datastore:datastore-1261',
   'vim.Datastore:datastore-369',
   'vim.Datastore:datastore-3412',
   'vim.Datastore:datastore-6169',
   'vim.Datastore:datastore-6167',
   'vim.Datastore:datastore-2064',
   'vim.Datastore:datastore-2065',
   'vim.Datastore:datastore-12841',
   'vim.Datastore:datastore-12847',
   'vim.Datastore:datastore-27354',
   'vim.Datastore:datastore-27349'
]
'vim.Datastore:datastore-1263'
(ManagedObject) [
   'vim.HostSystem:host-1487',
   'vim.HostSystem:host-1315',
   'vim.HostSystem:host-1390',
   'vim.HostSystem:host-1260',
   'vim.HostSystem:host-1460',
   'vim.HostSystem:host-1462',
   'vim.HostSystem:host-24',
   'vim.HostSystem:host-27',
   'vim.HostSystem:host-6322',
   'vim.HostSystem:host-3411',
   'vim.HostSystem:host-367',
   'vim.HostSystem:host-6246',
   'vim.HostSystem:host-6276',
   'vim.HostSystem:host-6255',
   'vim.HostSystem:host-6194',
   'vim.HostSystem:host-6166',
   'vim.HostSystem:host-6205',
   'vim.HostSystem:host-2063',
   'vim.HostSystem:host-12840',
   'vim.HostSystem:host-12846',
   'vim.HostSystem:host-27348',
   'vim.HostSystem:host-27353'
]
'vim.HostSystem:host-1487'
[2018-12-20 13:21:43.828089+00:00] Finished collecting metrics from hostname
@dannyk81
Copy link
Collaborator

dannyk81 commented Dec 21, 2018

@pryorda I'm hitting this too, but it seems like this is an intermittent issue. Some scrapes come back with all/most the expected metrics, other with less.

/cc @Jc2k - any ideas?

Hosts:
image

Datastores:
image

I guess this is why I've missed this at first.

Could it be a convergence issue in the new threading model?

@dannyk81 dannyk81 added the bug Something isn't working label Dec 21, 2018
@Jc2k
Copy link
Collaborator

Jc2k commented Dec 21, 2018

I was expecting problems with people probing multiple standalone ESX's or multiple vCenters due to blocking the twisted event loop, but not enumerating a single vCenter.

There are much less threads in the new model so i'd actually exepect less threading issues. And previously it didn't wait for all threads to finish - just the last one it spawned. So again, i'd expect less threading issues.

Now it runs the threads in ThreadPoolExecutor and calls shutdown(wait=True) which should wait for all tasks to finish before returning, or at least thats how I read the docs.

One annoying thing with this new code is that it swallows exceptions which might give us a clue. If it wasn't waiting for the thread to finish the thread would still finish, just out of order. So we need to add exeception handling. Give me a second...

@Jc2k
Copy link
Collaborator

Jc2k commented Dec 21, 2018

Can someone seeing the issue try #41? It might need fixing - but its a relatively small change. Should hopefully log exceptions seen in the child threads.

@dannyk81
Copy link
Collaborator

@Jc2k I have a possible fix in #42

This fixed the issue for me.

@Jc2k
Copy link
Collaborator

Jc2k commented Dec 21, 2018

I think that introduces another race. Do you have a clue about why that helps? Is ThreadPoolExecutor not behaving as documented?

@dannyk81
Copy link
Collaborator

@Jc2k I'm not entirely sure, this was more of a hunch kind of fix.

Can you elaborate about your race concern? why did you want to shutdown the threads before getting vm performance metrics? which also fires threads.

@dannyk81
Copy link
Collaborator

@Jc2k as discussed #42 is not a good fix as it introduces another issue (I closed it), however current implementation submits tasks to threader after it was already shutdown which is problematic as well.

Hope we can fix that with your help.

/cc @pryorda

@Jc2k
Copy link
Collaborator

Jc2k commented Dec 21, 2018

Working on moving the affected code paths over to PropertyCollectors to avoid nested threads and racing.

@Jc2k
Copy link
Collaborator

Jc2k commented Dec 21, 2018

#43 removes all nested threads, its working in my test env but i don't have access to a vCenter env to test on right now.

@Jc2k
Copy link
Collaborator

Jc2k commented Dec 21, 2018

I instrumented the HTTP library pyvmomi is using - was worried the snapshot code might trigger API calls as it traverses the snapshot tree. But it looks like all the info is captured in the initial PropertyCollect call.

@Jc2k
Copy link
Collaborator

Jc2k commented Dec 21, 2018

@dannyk81 could you see if #43 works for you? As i said - not tested with vCenter, so maybe just poke it with curl first and see if the output looks sane.

Incidentally i've incorparated and improved #41 so if it does go wrong we might even know why! :)

It's fairly late here so I might not reply for a while.

@pryorda
Copy link
Owner

pryorda commented Dec 21, 2018

@Jc2k testing now.

@pryorda
Copy link
Owner

pryorda commented Dec 22, 2018

summary.uncommitted doesn't always exist. I'm working on seeing why. We could compensate with a simple if, but I'm not sure that is the right path. I should have more info shortly.

@pryorda
Copy link
Owner

pryorda commented Dec 22, 2018

So I think the RCA(or at least one of them) might be that the summary.uncommitted doesn't always exist and would sometimes cause the threads to crash, resulting in incorrect results. I still think we should do all your other fixes including the one I'm pushing to the PR.

[2018-12-22 00:01:17.649313+00:00] Start collecting metrics from vmware-vcenter.pryorda.net
[2018-12-22 00:01:17.766380+00:00] Starting inventory collection
[2018-12-22 00:01:17.940364+00:00] Finished inventory collection
[2018-12-22 00:01:17.940927+00:00] Starting datastore metrics collection
[2018-12-22 00:01:17.943159+00:00] Starting host metrics collection
[2018-12-22 00:01:17.944080+00:00] Starting VM Guests metrics collection
[2018-12-22 00:01:17.960276+00:00] Traceback (most recent call last):
  File "/opt/vmware_exporter/vmware_exporter/vmware_exporter.py", line 54, in _future_done
    future.result()
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/vmware_exporter/vmware_exporter/vmware_exporter.py", line 316, in _vmware_get_datastores
    ds_uncommitted = float(datastore['summary.uncommitted']) if datastore['summary.uncommitted'] else 0
KeyError: 'summary.uncommitted'

[2018-12-22 00:01:17.970265+00:00] Finished host metrics collection
[2018-12-22 00:01:17.978065+00:00] Finished VM Guests metrics collection
[2018-12-22 00:01:18.218414+00:00] START: _vmware_get_vm_perf_manager_metrics
[2018-12-22 00:01:18.219940+00:00] START: _vmware_get_vm_perf_manager_metrics: QUERY
[2018-12-22 00:01:18.253393+00:00] FIN: _vmware_get_vm_perf_manager_metrics: QUERY
[2018-12-22 00:01:18.253643+00:00] FIN: _vmware_get_vm_perf_manager_metrics
[2018-12-22 00:01:18.264557+00:00] Finished collecting metrics from vmware-vcenter.pryorda.net

Thoughts?

pryorda pushed a commit that referenced this issue Dec 24, 2018
* Use PropertyCollector to do fast VM stats without threads
* Fix race conditions around #40
* Fix summary.uncommitted fetching. 
* Dockerfile fixes.
@pryorda
Copy link
Owner

pryorda commented Dec 24, 2018

Update to v0.3.1 as it should fix your issues @karlism

@pryorda pryorda closed this as completed Dec 24, 2018
pryorda added a commit that referenced this issue Dec 24, 2018
##  🔖 v0.3.1 Release
* #38: Allow cached requirements and build dependencies - Removed
* Property collectors (#43)
* Use PropertyCollector to do fast VM stats without threads
* Fix race conditions around #40
* Fix summary.uncommitted fetching. 
 * Dockerfile fixes.

Precommit-Verified: 308310afc2ea3e4d3b73469b49afc75811db75800d8d183c39c1bad0637d0dc3
@karlism
Copy link
Contributor Author

karlism commented Jan 3, 2019

@pryorda, I updated to version 0.4.2 today and everything seems to working great. Thank you very much for your help and sorry for a late reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants