You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use Nomad to dispatch many hundreds of thousands tens of millions of dispatch jobs.
Currently, the telemetry about these dispatched jobs is extremely poor. There are no summary statistics. The only slightly useful endpoint is stats.gauges.nomad.nomad.blocked_evals.total_blocked.mbp, but this isn't particularly useful in a complex system because it isn't broken down by type.
There is a stats.gauges.nomad.nomad.job_summary.complete category, but unfortunately it doesn't actually provide any summary statistics, it's just a list of hundreds of thousands of dispatch job names with a value of 0. This is almost worse than useless.
It would be excellent if there were a stats.gauges.nomad.nomad.dispatch_summary where dispatches could be broken down by type so that we could see avg/max usage for cpu/iops/disk/memory for each of our dispatch job types. Without this, there is no useful telemetry for Nomad Dispatch based systems.
The text was updated successfully, but these errors were encountered:
We've found that not only is the current telemetry worse than useless, we've found that it actually adds about 15-25 seconds of overhead per connection, making it worse-than-worse-than-useless.
Issue
We use Nomad to dispatch
many hundreds of thousandstens of millions of dispatch jobs.Currently, the telemetry about these dispatched jobs is extremely poor. There are no summary statistics. The only slightly useful endpoint is
stats.gauges.nomad.nomad.blocked_evals.total_blocked.mbp
, but this isn't particularly useful in a complex system because it isn't broken down by type.There is a
stats.gauges.nomad.nomad.job_summary.complete
category, but unfortunately it doesn't actually provide any summary statistics, it's just a list of hundreds of thousands of dispatch job names with a value of 0. This is almost worse than useless.It would be excellent if there were a
stats.gauges.nomad.nomad.dispatch_summary
where dispatches could be broken down by type so that we could see avg/max usage for cpu/iops/disk/memory for each of our dispatch job types. Without this, there is no useful telemetry for Nomad Dispatch based systems.The text was updated successfully, but these errors were encountered: