Skip to content

Request Status

lucacopa edited this page Nov 22, 2013 · 16 revisions

This page describes the different status for a request in ReqMgr.

State transitions

alt text

Legend:

  • M: It represents a transition that requires human intervention.
  • A: It represents a transition made automatically by some component in the system, e.g. global WorkQueue or WMAgent.

States

new

A new request, this status is usually skipped and the request goes directly to assignment-approved when created with script tools.

assignment-approved

Requests in this status are awaiting review and assignment from the CompOps L2s. They will be moved to rejected if there is a problem with the request, otherwise it is assigned.

assigned

Assigned requests have been reviewed and modified by the CompOps L2s, these requests have been provided with an appropiate site whitelist, acquisition era, processed dataset and other attributes. Requests in this state are examined by the WorkQueue and moved to acquired after creating the work elements, or in case of failure the request is moved to failed.

acquired

Acquired requests have been split by the global WorkQueue into work elements, but no work element has been injected into the SQL database (i.e. WMBS) of any WMAgent and therefore not considered as running yet.

running-open

Running open requests have at least one work element injected into WMBS and likely jobs running as well, however these are marked as open because it is still possible for new work elements to be acquired from the input data. If there is an open block in the input dataset then the request will remain in running-open until all input blocks are closed and a specified time has passed since the last block closing. Production requests, i.e. MonteCarlo and LHEStepZero, skip the running-open state.

running-closed

Running closed requests are requests that were in running-open state enough time such that it is unlikely for new input data to appear, i.e. more blocks in the input dataset. Running closed requests are guaranteed to have at least one work element injected into WMBS in one of the WMAgents and likely to have jobs running.

completed

A request is marked as completed after all work elements are done, which means that the WMAgent(s) have processed all the jobs generated by each one of them. This includes not only the top level task, but also the auxiliary ones like log collection and cleanup of unmerged data. A completed request will be looked at by CompOps people to verify the success or failure of it, when the output of the request is considered satisfactory the request is moved to closed-out status, otherwise to rejected.

Note that a request in completed is not guaranteed to have all its output data registered in DBS and/or PhEDEx, although this is usually taken for granted there are failure cases when this may not happen automatically.

closed-out

Closed out status indicates that the output has been reviewed and is ready to be announced back to the requestors.

announced

An announced request has been announced to the requestors using the usual channels and can be archived.

normal-archived

After a request is announced the WMAgent takes care of cleaning up most of the monitoring information from the system about it and then it is marked as normal archived.

rejected

A request is moved to rejected when it is considered invalid at assignment or when the produced output is not satisfactory.

rejected-archived

After a request is rejected the WMAgent takes care of cleaning up most of the monitoring information from the system about it and then it is marked as rejected archived.

failed

A failed request has had a failure in one of the work elements, or it didn't produce any. These can be re-evaluated and reassigned to run again, or move to rejected state if unrecoverable.

aborted

If there is an unrecoverable problem with a request after it has been acquired, then it is possible to move it to aborted state. This will trigger an internal action to kill all current jobs and run only auxiliary tasks like unmerged data cleanup and log collection, after all these actions are completed the request will be moved to aborted compelted.

aborted-completed

A request is marked as aborted completed after all left-over jobs have been processed in an aborted request, a request in this state has been cleaned up from the WMAgents and global WorkQueue and is ready to be archived.

aborted-archived

A request is moved to aborted-archived after it is aborted-completed and all monitoring information related to it has been cleaned up.

Stuck Issues

Common issues among different states, and their solution:

assigned

  1. Issue: Too many workflows stuck in assigned state

Problem: team is not set properly. A block is going to be pulled by an agent only if the team is set properly. The teams are: mc, mc_highprio, repro_lowprio, repro_highprio, step0, hlt, relval. Solution: elog about the problem, the workflow has to be resubmitted.

Problem: Site is not assigned properly. Some workflows need to be submitted to specific places (i.e. ACDCs) Solution: There is no one solution to this, you have to check each case. If the site where the block is available is not set in the whitelist, send an elog about this.

Problem: White list only includes the _Disk of a site. Solution: For sites where disk and tape are separated, elog and recommend to send the request to both T1_XX_XXXX and T1_XX_XXXX_Disk. [[https://cmslogbook.cern.ch/elog/Workflow+processing/11311][Read this]]

acquired

Clone this wiki locally