You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a workflow using Dask fails, the errors may not be fully captured and exposed back to users.
Here is one observation from running reana-demo-dask-coffea example.
The status shows that the workflow failed:
$ rcg status -w dask-coffea-serial-kubernetesNAME RUN_NUMBER CREATED STARTED ENDED STATUS PROGRESSdask-coffea-serial-kubernetes 1 2025-01-28T10:02:38 2025-01-28T10:02:51 2025-01-28T10:04:03 failed 0/1
The logs show ditto:
$ rcg logs -w dask-coffea-serial-kubernetes==> Workflow engine logs2025-01-28 10:02:51,384 | root | MainThread | INFO | Publishing step:0, cmd: python analysis.py, total steps 1 to MQ2025-01-28 10:04:03,456 | root | MainThread | INFO | Workflow a98dfa47-b7a2-4979-b75f-44ee4b224776 finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/a98dfa47-b7a2-4979-b75f-44ee4b224776.==> Job logs==> Step: process==> Workflow ID: a98dfa47-b7a2-4979-b75f-44ee4b224776==> Compute backend: Kubernetes==> Job ID: reana-run-job-d60a2b71-c263-46c5-8bb1-06793e9083b4==> Docker image: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049==> Command: python analysis.py==> Status: failed==> Started: 2025-01-28T10:02:51==> Finished: 2025-01-28T10:03:59==> Logs:... ------------------------------------------------------------------- ------------------------------------------------------------------- ---------------- DASK WORKER LOGS ---------------- ------------------------------------------------------------------- ------------------------------------------------------------------- /usr/local/lib/python3.10/site-packages/distributed/cli/dask_worker.py:264: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead warnings.warn(2025-01-28 10:02:56,570 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.244.0.62:44707'2025-01-28 10:02:57,354 - distributed.worker - INFO - Start worker at: tcp://10.244.0.62:419892025-01-28 10:02:57,354 - distributed.worker - INFO - Listening to: tcp://10.244.0.62:419892025-01-28 10:02:57,354 - distributed.worker - INFO - Worker name: reana-run-dask-a98dfa47-default-worker-1b354780602025-01-28 10:02:57,354 - distributed.worker - INFO - dashboard at: 10.244.0.62:87882025-01-28 10:02:57,354 - distributed.worker - INFO - Waiting to connect to: tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:87862025-01-28 10:02:57,354 - distributed.worker - INFO - -------------------------------------------------2025-01-28 10:02:57,354 - distributed.worker - INFO - Threads: 162025-01-28 10:02:57,354 - distributed.worker - INFO - Memory: 2.00 GiB2025-01-28 10:02:57,354 - distributed.worker - INFO - Local Directory: /tmp/dask-worker-space/worker-qrbfmwwd2025-01-28 10:02:57,354 - distributed.worker - INFO - -------------------------------------------------2025-01-28 10:02:57,360 - distributed.worker - INFO - Registered to: tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:87862025-01-28 10:02:57,360 - distributed.worker - INFO - -------------------------------------------------2025-01-28 10:02:57,361 - distributed.core - INFO - Starting established connection to tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Electron_photonIdx => Photon warnings.warn(/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FatJet_subJetIdx1 => SubJet warnings.warn(/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FatJet_subJetIdx2 => SubJet warnings.warn(/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FsrPhoton_muonIdx => Muon warnings.warn(/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_electronIdx1 => Electron warnings.warn(/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_electronIdx2 => Electron warnings.warn(/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_muonIdx1 => Muon warnings.warn(/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_muonIdx2 => Muon warnings.warn(/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Muon_fsrPhotonIdx => FsrPhoton warnings.warn(/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Photon_electronIdx => Electron warnings.warn(
However, when one consults the Kubernetes pod logs, one can see the root cause is the job pod:
$ k logs reana-run-job-d60a2b71-c263-46c5-8bb1-06793e9083b4-rvwr9Defaulted container "job" out of: job, krb5-renew, krb5-init (init)Matplotlib created a temporary cache directory at /tmp/matplotlib-t6r42k3b because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.[###### Traceback (most recent call last): File "/var/reana/users/00000000-0000-0000-0000-000000000000/workflows/a98dfa47-b7a2-4979-b75f-44ee4b224776/analysis.py", line 65, in <module> output, metrics = run(fileset, "Events", processor_instance=Processor()) File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1700, in __call__ wrapped_out = self.run(fileset, processor_instance, treename) File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1848, in run wrapped_out, e = executor(chunks, closure, None) File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 974, in __call__ else _decompress(work.result()) File "/usr/local/lib/python3.10/site-packages/distributed/client.py", line 287, in result raise exc.with_traceback(tb) File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 221, in __call__ out = self.function(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1367, in automatic_retries raise e File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1336, in automatic_retries return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1575, in _work_function filecontext = uproot.open( File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 141, in open file = ReadOnlyFile( File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 580, in __init__ self._source = Source( File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 272, in __init__ self._open() File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 275, in _open self._resource = XRootDResource(self._file_path, self._timeout) File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 83, in __init__ self._open() File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 92, in _open self._xrd_error(status) File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 118, in _xrd_error raise OSError(OSError: XRootD error: [ERROR] Operation expiredin file root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root
This error did not seem to have been captured and exposed back to the user.
Expected behaviour
The user should see in the logs that the workflow failed due to an XRootD "Operation expired" data access error.
Note
It looks like we may need to check exceptional situations when capturing both Dask worker logs and REANA job logs and merging them.
The text was updated successfully, but these errors were encountered:
Did you trim the logs between the job logs and the worker logs? The error is captured in the job logs, but you are looking at the worker logs in the output of the rcg logs -w dask-coffea-serial-kubernetes command. Can you paste a full log example for a complete reference?
Note 1: I assumed that by the three dots at the bottom of the job logs, you trimmed all the logs between the job logs and the scheduler logs.
Note 2: I tried to reproduce the issue, and I was able to capture the error 5/5 times in the job logs.
Current behaviour
When a workflow using Dask fails, the errors may not be fully captured and exposed back to users.
Here is one observation from running
reana-demo-dask-coffea
example.The status shows that the workflow failed:
The logs show ditto:
However, when one consults the Kubernetes pod logs, one can see the root cause is the job pod:
This error did not seem to have been captured and exposed back to the user.
Expected behaviour
The user should see in the logs that the workflow failed due to an XRootD "Operation expired" data access error.
Note
It looks like we may need to check exceptional situations when capturing both Dask worker logs and REANA job logs and merging them.
The text was updated successfully, but these errors were encountered: