Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logs: better capturing of Dask and job logs in error situations #739

Open
tiborsimko opened this issue Jan 28, 2025 · 1 comment
Open

logs: better capturing of Dask and job logs in error situations #739

tiborsimko opened this issue Jan 28, 2025 · 1 comment

Comments

@tiborsimko
Copy link
Member

Current behaviour

When a workflow using Dask fails, the errors may not be fully captured and exposed back to users.

Here is one observation from running reana-demo-dask-coffea example.

The status shows that the workflow failed:

$ rcg status -w dask-coffea-serial-kubernetes
NAME                            RUN_NUMBER   CREATED               STARTED               ENDED                 STATUS   PROGRESS
dask-coffea-serial-kubernetes   1            2025-01-28T10:02:38   2025-01-28T10:02:51   2025-01-28T10:04:03   failed   0/1

The logs show ditto:

$ rcg logs -w dask-coffea-serial-kubernetes

==> Workflow engine logs
2025-01-28 10:02:51,384 | root | MainThread | INFO | Publishing step:0, cmd: python analysis.py, total steps 1 to MQ
2025-01-28 10:04:03,456 | root | MainThread | INFO | Workflow a98dfa47-b7a2-4979-b75f-44ee4b224776 finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/a98dfa47-b7a2-4979-b75f-44ee4b224776.



==> Job logs
==> Step: process
==> Workflow ID: a98dfa47-b7a2-4979-b75f-44ee4b224776
==> Compute backend: Kubernetes
==> Job ID: reana-run-job-d60a2b71-c263-46c5-8bb1-06793e9083b4
==> Docker image: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
==> Command: python analysis.py
==> Status: failed
==> Started: 2025-01-28T10:02:51
==> Finished: 2025-01-28T10:03:59
==> Logs:

...


                -------------------------------------------------------------------
                -------------------------------------------------------------------
                ----------------         DASK WORKER LOGS          ----------------
                -------------------------------------------------------------------
                -------------------------------------------------------------------




                /usr/local/lib/python3.10/site-packages/distributed/cli/dask_worker.py:264: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
  warnings.warn(
2025-01-28 10:02:56,570 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.244.0.62:44707'
2025-01-28 10:02:57,354 - distributed.worker - INFO -       Start worker at:    tcp://10.244.0.62:41989
2025-01-28 10:02:57,354 - distributed.worker - INFO -          Listening to:    tcp://10.244.0.62:41989
2025-01-28 10:02:57,354 - distributed.worker - INFO -           Worker name: reana-run-dask-a98dfa47-default-worker-1b35478060
2025-01-28 10:02:57,354 - distributed.worker - INFO -          dashboard at:           10.244.0.62:8788
2025-01-28 10:02:57,354 - distributed.worker - INFO - Waiting to connect to: tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
2025-01-28 10:02:57,354 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,354 - distributed.worker - INFO -               Threads:                         16
2025-01-28 10:02:57,354 - distributed.worker - INFO -                Memory:                   2.00 GiB
2025-01-28 10:02:57,354 - distributed.worker - INFO -       Local Directory: /tmp/dask-worker-space/worker-qrbfmwwd
2025-01-28 10:02:57,354 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,360 - distributed.worker - INFO -         Registered to: tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
2025-01-28 10:02:57,360 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,361 - distributed.core - INFO - Starting established connection to tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Electron_photonIdx => Photon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FatJet_subJetIdx1 => SubJet
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FatJet_subJetIdx2 => SubJet
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FsrPhoton_muonIdx => Muon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_electronIdx1 => Electron
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_electronIdx2 => Electron
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_muonIdx1 => Muon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_muonIdx2 => Muon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Muon_fsrPhotonIdx => FsrPhoton
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Photon_electronIdx => Electron
  warnings.warn(

However, when one consults the Kubernetes pod logs, one can see the root cause is the job pod:

$ k logs reana-run-job-d60a2b71-c263-46c5-8bb1-06793e9083b4-rvwr9
Defaulted container "job" out of: job, krb5-renew, krb5-init (init)
Matplotlib created a temporary cache directory at /tmp/matplotlib-t6r42k3b because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[######                                Traceback (most recent call last):
  File "/var/reana/users/00000000-0000-0000-0000-000000000000/workflows/a98dfa47-b7a2-4979-b75f-44ee4b224776/analysis.py", line 65, in <module>
    output, metrics = run(fileset, "Events", processor_instance=Processor())
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1700, in __call__
    wrapped_out = self.run(fileset, processor_instance, treename)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1848, in run
    wrapped_out, e = executor(chunks, closure, None)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 974, in __call__
    else _decompress(work.result())
  File "/usr/local/lib/python3.10/site-packages/distributed/client.py", line 287, in result
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 221, in __call__
    out = self.function(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1367, in automatic_retries
    raise e
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1336, in automatic_retries
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1575, in _work_function
    filecontext = uproot.open(
  File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 141, in open
    file = ReadOnlyFile(
  File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 580, in __init__
    self._source = Source(
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 272, in __init__
    self._open()
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 275, in _open
    self._resource = XRootDResource(self._file_path, self._timeout)
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 83, in __init__
    self._open()
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 92, in _open
    self._xrd_error(status)
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 118, in _xrd_error
    raise OSError(
OSError: XRootD error: [ERROR] Operation expired
in file root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root

This error did not seem to have been captured and exposed back to the user.

Expected behaviour

The user should see in the logs that the workflow failed due to an XRootD "Operation expired" data access error.

Note

It looks like we may need to check exceptional situations when capturing both Dask worker logs and REANA job logs and merging them.

@tiborsimko tiborsimko added this to Dask Jan 29, 2025
@tiborsimko tiborsimko moved this to Backlog in Dask Jan 29, 2025
@Alputer
Copy link
Member

Alputer commented Jan 30, 2025

Did you trim the logs between the job logs and the worker logs? The error is captured in the job logs, but you are looking at the worker logs in the output of the rcg logs -w dask-coffea-serial-kubernetes command. Can you paste a full log example for a complete reference?

Note 1: I assumed that by the three dots at the bottom of the job logs, you trimmed all the logs between the job logs and the scheduler logs.
Note 2: I tried to reproduce the issue, and I was able to capture the error 5/5 times in the job logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

2 participants