WMAgent monitoring

This wiki is meant to describe some aspects of WMAgent that requires a better monitoring, aiming to ease the debugging process. In general, there will be two major tasks (maybe 3...):

fetch the necessary information with AgentStatusWatcher component (following a defined polling cycle and keeping in mind the additional load on the agent)
publish and make this information retrievable via http
work on data visualization (possible in the far future)

The main idea behind these improvements is to make the agent status (internal queues, database tables, job status, site/task thresholds, component status, etc) information easily accessible (and make the operator/developer life easier).

How to proceed

A proposal on how to progress on this development would be (Alan's opinion):

implement a reasonable part (50%?) of the monitoring points as described below
make this information initially available in the component log (or a json file)
publish this information either to LogDB or to HTCondor collector (to condor seems to be easier and better). Still to investigate.
Information needs to be queriable (API) and accessbible via http
At this point, stop sending this info to the component log.
In the future (> several months ahead), we could work on another service/tool that would make those json data user friendly (graphs, tables, etc) and centralized, hoping to get rid of most of the personal monitoring scripts.

What to monitor

Most of what will be listed here belong to WMAgent, however there are some other aspects from central Global Queue that would be needed to monitor as well. The initial and most important aspects are listed as:

For a given agent, WMBS info for:
number of jobs in WMBS in each state.
number of jobs in WMBS in 'created' status, sorted by job type 1. BONUS: would be sorting by job priority and site as well
number of jobs in WMBS in 'executing' status, sorted by job type (should be equal to the numbers of jobs in condor)
thesholds for data acquisition from GQ to LQ
thresholds for job creation (LQ to WMBS)
thresholds for job submission (WMBS to condor global pool)
total running and pending thresholds for each site in Drain or Normal status (and their status)
For a given agent, WorkQueue info for:
number of local workqueue elements in each state (+ the total number of possible AND unique number of jobs per site)
number of local workqueue_inbox elements in each state (+ the total number of possible AND unique number of jobs per site)
worst workqueue element offenders (single elements that create > 30k jobs) in workqueue_inbox in Acquired or Running status.
BONUS: number of workqueue elements in workqueue and number of jobs sorted by priority 1. NOTE: the number of jobs retrieved from workqueue elements cannot be taken for granted because it does not count utilitarian jobs nor has a precise number for chained requests.
Agent health:
for each registered thread in a Daemon, the current state (running/idle), the time since last successful execution, the time in the current state, the length of the last cycle, and the status of the last cycle (success / failure).
For WorkQueueManager, the # of blocks obtained from GQ and # of blocks obtained by LQ
For JobCreator, the number of jobs created in the last cycle.
For JobSubmitter, the number of jobs submitted in the last cycle.
Central Global Queue (job counts are not accurate since they do not count further steps):
for Available workqueue elements: 1. total number of elements and total number of estimated jobs 2. number of WQE, estimated number of possible AND unique jobs sorted by team name 3. number of WQE, estimated number of possible AND unique jobs sorted by request priority 4. number of WQE, estimated number of possible AND unique jobs sorted by site 5. WQE without a common site list (that does not pass the work restrictions) 6. WQE older than 7 days (or whatever number we decide) 7. WQE that create > 30k jobs (or whatever number we decide)
for Acquired workqueue elements 1. total number of elements and total number of estimated jobs 2. number of WQE, estimated number of possible AND unique jobs sorted by team name 3. number of WQE, estimated number of possible AND unique jobs sorted by agent (ChildQueueUrl) 4. number of WQE, estimated number of possible AND unique jobs sorted by request priority 5. number of WQE, estimated number of possible AND unique jobs sorted by site 6. WQE older than 7 days (or whatever the number is)
for Running workqueue elements 1. total number of elements and total number of estimated jobs 2. number of WQE, estimated number of possible AND unique jobs sorted by team name 3. number of WQE, estimated number of possible AND unique jobs sorted by agent (ChildQueueUrl) 4. number of WQE, estimated number of possible AND unique jobs sorted by request priority

HTCondor Data Path

One channel the monitoring data is reported is through the HTCondor ClassAd. The ClassAd is a simple key-expression format (here, we'll use it as key-values) that is slightly more expressive than JSON. The monitoring data from AgentStatusWatcher will be converted from JSON to ClassAd format, and sent as a part of the schedd ClassAd to the global pool collector.

Once in the collector, the key/values can be queried by various monitoring infrastructure (Ganglia, Kibana, Grafana, ElasticSearch) utilized by the Submit Infrastructure team.

To integrate with HTCondor, AgentStatusWatcher will ship with a script that converts the current JSON to ClassAd format, writing the output from JSON. Periodically, the condor_schedd process will call this script in a non-blocking manner. The schedd will take the output ClassAd, merge it with its internal ClassAd, and send the merged version in the next update.

Description of the metrics already collected

This list is still expanding, however the metrics already available as of 1.0.19.patch4 and their short description are:

activeRunJobByStatus: represents all the jobs active in the BossAir database, which are also in the condor pool. Jobs should not remain in the “New” status longer than a JobStatusLite component cycle.
completeRunJobByStatus: represents jobs no longer active in the BossAir database, which are also gone from condor pool. TODO: how the cleanup and state transition happens in this case?
sitePendCountByPrio: freeSlots(minusRunning=True) code-wise. Number of pending jobs keyed by the request priority, for each site. Affects data acquisition from LQ to WMBS.
thresholds: provides the state, the pending and running thresholds for each site.
thresholdsGQ2LQ: freeSlots(minusRunning=True) code-wise. Calculates the number of free slots for each site, the same call used for data acquisition from GQ to LQ. Divided in two huge queries that are aggregated to define the final available thresholds.
- assigned jobs: check jobs with a location assigned and in several states, like created, *cooloff, executing, etc. Skips jobs Running in condor.
- unassigned jobs: check jobs without a location assigned and in all states but killed and cleanup.
wmbsCountByState: number of webs jobs in each status. Data is cleaned up as workflows get archived.
wmbsCreatedTypeCount: number of wmbs jobs created for each job type.
wmbsExecutingTypeCount: number of wmbs jobs in executing state for each job type. An executing job can be either pending, running or just gone from condor.
total_query_time: time it took to collect all these SQL database information.

Use cases to be implemented at the visualization layer

Use case 1 - amount of work in the agents local queue

sub case 1: getting the number of ‘created’ jobs from ‘wmbsCountByState’. This gives us a hint of how much work/jobs JobSubmitter has in its local queue for submission to condor.
sub case 2: we could even split this info into ‘CPUbound’ x ‘Inbound’ jobs, looking at ‘wmbsCreatedTypeCount’ and summing up all the ‘Processing’ and ‘Production’ jobs for the former case, all the job types are considered IObound jobs.
note: this information is useful both in an agent basis and aggregated to all the agents.

Use case 2 - amount of job failures

sub case 1: plot the number of non-final failures for each agent by looking at keys like ‘*cooloff’ (e.g. jobcooloff), from wmbsCountByState.
sub case 2: plot the number of final failures for each agent by looking at keys like ‘*failed’ (e.g. jobfailed), from wmbsCountByState.
note: both agent and aggregate view are interesting.

Use case 3 - amount of work in condor queue

sub case 1: checking the number of jobs running in condor by looking at ‘Running’ value for ’activeRunJobByStatus’.
sub case 2: checking the number of jobs pending in condor by summing up ‘New’ and ‘Idle’ values for ‘activeRunJobByStatus’.
note: both agent and aggregated view are important.

Use case 4 - idea of the wmagent load

sub case 1: the ‘total_query_time’ metric should give us a rough idea of how the agent and the SQL database performance is. If this arises to a level of minutes, then something is not healthy in the agent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly