Improve Telemetry for Dispatch Jobs #4422

Miserlou · 2018-06-15T17:37:00Z

Issue

We use Nomad to dispatch ~~many hundreds of thousands~~ tens of millions of dispatch jobs.

Currently, the telemetry about these dispatched jobs is extremely poor. There are no summary statistics. The only slightly useful endpoint is stats.gauges.nomad.nomad.blocked_evals.total_blocked.mbp, but this isn't particularly useful in a complex system because it isn't broken down by type.

There is a stats.gauges.nomad.nomad.job_summary.complete category, but unfortunately it doesn't actually provide any summary statistics, it's just a list of hundreds of thousands of dispatch job names with a value of 0. This is almost worse than useless.

It would be excellent if there were a stats.gauges.nomad.nomad.dispatch_summary where dispatches could be broken down by type so that we could see avg/max usage for cpu/iops/disk/memory for each of our dispatch job types. Without this, there is no useful telemetry for Nomad Dispatch based systems.

The text was updated successfully, but these errors were encountered:

Miserlou · 2018-06-15T17:44:33Z

Related-ish: #4422

Miserlou · 2018-09-14T14:45:14Z

We've found that not only is the current telemetry worse than useless, we've found that it actually adds about 15-25 seconds of overhead per connection, making it worse-than-worse-than-useless.

dadgar added type/enhancement theme/metrics labels Jun 19, 2018

preetapan added theme/metrics and removed theme/metrics labels Jun 20, 2018

Miserlou mentioned this issue Sep 5, 2018

job name tagging in telemetry should be less specific for periodic jobs (should not include /periodic-1235... info) #4646

Closed

Miserlou mentioned this issue Sep 14, 2018

Bumps nomad timeout, adds a few missing dependencies, speeds up Nomad… AlexsLemonade/refinebio#617

Merged

4 tasks

Miserlou mentioned this issue Oct 1, 2018

Poor parameterized job scheduling performance #4736

Closed

Miserlou mentioned this issue Nov 13, 2018

Nomad Lead Server at 100% CPU Usage Leading to Total System Death #4864

Closed

nickethier mentioned this issue Nov 14, 2018

Metric prefix filtering #4878

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Telemetry for Dispatch Jobs #4422

Improve Telemetry for Dispatch Jobs #4422

Miserlou commented Jun 15, 2018 •

edited

Loading

Miserlou commented Jun 15, 2018 •

edited

Loading

Miserlou commented Sep 14, 2018

Improve Telemetry for Dispatch Jobs #4422

Improve Telemetry for Dispatch Jobs #4422

Comments

Miserlou commented Jun 15, 2018 • edited Loading

Issue

Miserlou commented Jun 15, 2018 • edited Loading

Miserlou commented Sep 14, 2018

Miserlou commented Jun 15, 2018 •

edited

Loading

Miserlou commented Jun 15, 2018 •

edited

Loading