WMAgent monitoring

This wiki is meant to describe some aspects of WMAgent that requires a better monitoring, aiming to ease the debugging process. In general, there will be two major tasks (maybe 3...):

fetch the necessary information with AgentStatusWatcher component (following a defined polling cycle and keeping in mind the additional load on the agent)
publish and make this information retrievable via http
work on data visualization (possible in the far future)

The main idea behind these improvements is to make the agent status (internal queues, database tables, job status, site/task thresholds, component status, etc) information easily accessible (and make the operator/developer life easier).

How to proceed

A proposal on how to progress on this development would be (Alan's opinion):

implement a reasonable part (50%?) of the monitoring points as described below
make this information initially available in the component log (or a json file)
publish this information either to LogDB or to HTCondor collector (to condor seems to be easier and better). Still to investigate.
Information needs to be queriable (API) and accessbible via http
At this point, stop sending this info to the component log.
In the future (> several months ahead), we could work on another service/tool that would make those json data user friendly (graphs, tables, etc) and centralized, hoping to get rid of most of the personal monitoring scripts.

What to monitor

Most of what will be listed here belong to WMAgent, however there are some other aspects from central Global Queue that would be needed to monitor as well. The initial and most important aspects are listed as:

For a given agent, WMBS info for:
number of jobs in WMBS in each state.
number of jobs in WMBS in 'created' status, sorted by job type 1. BONUS: would be sorting by job priority and site as well
number of jobs in WMBS in 'executing' status, sorted by job type (should be equal to the numbers of jobs in condor)
thesholds for data acquisition from GQ to LQ
thresholds for job creation (LQ to WMBS)
thresholds for job submission (WMBS to condor global pool)
total running and pending thresholds for each site in Drain or Normal status (and their status)
For a given agent, WorkQueue info for:
number of local workqueue elements in each state (+ the total number of possible AND unique number of jobs per site)
number of local workqueue_inbox elements in each state (+ the total number of possible AND unique number of jobs per site)
worst workqueue element offenders (single elements that create > 30k jobs) in workqueue_inbox in Acquired or Running status.
BONUS: number of workqueue elements in workqueue and number of jobs sorted by priority 1. NOTE: the number of jobs retrieved from workqueue elements cannot be taken for granted because it does not count utilitarian jobs nor has a precise number for chained requests.
Agent health:
for each registered thread in a Daemon, the current state (running/idle), the time since last successful execution, the time in the current state, the length of the last cycle, and the status of the last cycle (success / failure).
For WorkQueueManager, the # of blocks obtained from GQ and # of blocks obtained by LQ
For JobCreator, the number of jobs created in the last cycle.
For JobSubmitter, the number of jobs submitted in the last cycle.
Central Global Queue (job counts are not accurate since they do not count further steps):
for Available workqueue elements: 1. total number of elements and total number of estimated jobs 2. number of WQE, estimated number of possible AND unique jobs sorted by team name 3. number of WQE, estimated number of possible AND unique jobs sorted by request priority 4. number of WQE, estimated number of possible AND unique jobs sorted by site 5. WQE without a common site list (that does not pass the work restrictions) 6. WQE older than 7 days (or whatever number we decide) 7. WQE that create > 30k jobs (or whatever number we decide)
for Acquired workqueue elements 1. total number of elements and total number of estimated jobs 2. number of WQE, estimated number of possible AND unique jobs sorted by team name 3. number of WQE, estimated number of possible AND unique jobs sorted by agent (ChildQueueUrl) 4. number of WQE, estimated number of possible AND unique jobs sorted by request priority 5. number of WQE, estimated number of possible AND unique jobs sorted by site 6. WQE older than 7 days (or whatever the number is)
for Running workqueue elements 1. total number of elements and total number of estimated jobs 2. number of WQE, estimated number of possible AND unique jobs sorted by team name 3. number of WQE, estimated number of possible AND unique jobs sorted by agent (ChildQueueUrl) 4. number of WQE, estimated number of possible AND unique jobs sorted by request priority

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WMAgent monitoring

How to proceed

What to monitor

Clone this wiki locally