-
Notifications
You must be signed in to change notification settings - Fork 107
WMCore debugging tools
This wiki is meant to list debugging use cases, either to solve/debug Operations issues or internal Dev ones.
Problem: Ops request us to check why the workflow hasn't processed 100% of the lumi sections, even though all the failures have been recovered via ACDCs
Solution: first we need to make sure that ACDCs have been created AND executed for every single task path (fileset_name, in terms of ACDC collection).
Details: what we need to retrieve/check, is:
- did the ACDCs get created after the initial/original workflow moved to
completed
status? - list the amount of jobs/lumis in each
fileset_name
, from the ACDC collection - query reqmgr2 for ACDC workflows recovering that workflow (and fetch their
InitialTaskPath
) - make sure that those ACDC workflows are in
completed
status - anything else
Problem: Ops request us to investigate why the output datasets are missing statistics, even though there are no job failures reported (or they have all been recovered).
Solution: not necessarily a solution. However, part of the solution above has to be applied here, thus check whether all lumis have been recovered. In addition to that, we could have a tool that takes a workflow as input, it finds all the run/lumis meant to be processed, randomly selects one output dataset and compare it against the input dataset. Finally, yielding a list of run/lumis missing in the output dataset.
Problem: When we are completing the agent draining procedure, there are some rare cases where subscriptions are stuck in unfinished state (finished=0
). It also usually means that there is - at least - one GQ workqueue element in Running
state (and its equivalent LQ workqueue/workqueue_inbox element).
Solution: there are many possible reasons for having a subscription stuck, so there is no common solution. Among the checks we can perform are: correlate the subscription to its fileset and workflow task; check whether they have files either in the available or acquired tables.
Details: further details can be extracted from this github issue: https://github.com/dmwm/WMCore/issues/9568
Problem: This is specially common on ACDC workflows, even though it can also happen to the other workflow types. The problem here is that the GQE is pulled down by the agent (thus it passes all the data constraints), but as we know, those elements can live in the local workqueue for a while until data is inserted in WMBS and jobs created. Meanwhile, the input/secondary data location may change, causing this LQE to no longer pass the data constraints, thus getting stuck once in the agent database.
Solution: the real solution I'd like to seek out is, if an agent pull work down, it does not update data location; and it creates jobs right away. However, for now we could provide a tool to the Operations team, such that they can identify where the GQE is stuck, fetch that element from the LQE and run the check on the possible list of sites (this would give them the reason why the LQE is stuck).
Details: requires access to the agents.
Problem: While traversing the whole chain from ReqMgr to the final Worker node for calculation and back a workflow can get stuck in any state (from 'new' to 'announced' [1]). For any each of those states there is its respective components in the system which holds the workflow at the moment.
[1] https://github.com/dmwm/WMCore/blob/master/doc/wmcore/RequestStateTransition.png
Solution: As an example the WF may stay in 'Aquired' or 'Running Open' in the agent, but Condor may have not generated jobs for it. So the corresponding action in this case should be to try to find the WF in the local Work queue and eventually the jobs (if there are any) in the condor queue and compare the results. One way of querying the local Work queue should be to tunnel do the agent, then one can access the couch futon interface. Alternative to that approach is to parse WorkQueueManager logs. For the condor queue a simple condor_q with the proper constrains will do. Or to use dedicated scripts to access the Workqueue [3].
[3] https://github.com/amaltaro/ProductionTools/blob/master/getWQStatusByWorkflow.py
Details: This was just one of the possible status transitions discussed above. We need to add similar details for all the rest of the status transitions [2].
Problem: There might be a situation where we need to invalidate (in PhEDEx and DBS) blocks produced by a given workflow. Among the reasons, it could be that there were two workflows writing to the same output (like a duplicate ACDC).
Solution: we need to find out which agents were processing that given workflow. With that information in hands, we can then query their local SQL database and list all the output blocks (from all the tasks). What to do then with the output blocks, is out of the scope of this debugging.
Details: a SQL query like the following can yield all the output blocks (starting from files associated to blocks) for a given workflow
SELECT dbsbuffer_block.id AS blockid, dbsbuffer_block.blockname AS blockname FROM dbsbuffer_block
INNER JOIN dbsbuffer_file ON dbsbuffer_block.id = dbsbuffer_file.block_id
INNER JOIN dbsbuffer_workflow ON dbsbuffer_file.workflow = dbsbuffer_workflow.id
WHERE dbsbuffer_workflow.name='cmsunified_ACDC0_Run2016B-v2-ZeroBias2-21Feb2020_UL2016_HIPM_1068p1_200313_133114_5167';
Problem: We need to start integrating Rucio with our utilitarian scripts, such as the one used for workflow validation/integration; for debugging number of files in each DM system.
Solution: solve what gets requested in this GH issue: https://github.com/dmwm/WMCore/issues/9620 Another possibility would be to integrate it to our WMCore toolset, but better if done in a later stage.
Problem: Sometimes there are workflows that change their TotalInputLumis during the lifetime of the workflow, so the check executed by Unified (against the request dictionary) might be using outdated estimated lumi sections.
Solution: We need to re-calculate the estimated number of lumi sections in the input, and compare the output against that number, providing the final workflow completion (number of lumis and completion ratio). Note that, if the workflow has no input dataset at all, the TotalInputLumis
cannot have changed and there is nothing else to be done with the input dataset, just a comparison of the output lumis against the already provided estimated lumis.