Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htcondor 24.2.2 broken #38

Open
nsmith- opened this issue Dec 12, 2024 · 3 comments
Open

htcondor 24.2.2 broken #38

nsmith- opened this issue Dec 12, 2024 · 3 comments

Comments

@nsmith-
Copy link
Member

nsmith- commented Dec 12, 2024

As of htcondor 24.2.2 we get a failure to spool some files:

ERROR:lpcjobqueue.cluster:DCSchedd::spoolJobFiles:7002:File transfer failed for target job 73893203.0: TOOL at 131.225.190.225 failed to send file(s) to <131.225.189.168:9618>: |Error: sending file
 /uscms/home/ncsmith/x509up_u49040; SCHEDD at 131.225.189.168 - |Error: receiving file /storage/local/data1/condor/spool/3203/0/cluster73893203.proc0.subproc0.tmp/x509up_u49040
2024-12-11 21:15:17,379 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x
7efe1aecf550>>, <Task finished name='Task-106' coro=<SpecCluster._correct_state_internal() done, defined at /usr/local/lib/python3.10/site-packages/distributed/deploy/spec.py:346> exception=Asserti
onError()>)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/tornado/ioloop.py", line 750, in _run_callback
    ret = callback()
  File "/usr/local/lib/python3.10/site-packages/tornado/ioloop.py", line 774, in _discard_future_result
    future.result()
  File "/usr/local/lib/python3.10/site-packages/distributed/deploy/spec.py", line 390, in _correct_state_internal
    await asyncio.gather(*worker_futs)
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/usr/local/lib/python3.10/site-packages/distributed/deploy/spec.py", line 75, in _
    assert self.status == Status.running
AssertionError
@nsmith-
Copy link
Member Author

nsmith- commented Dec 12, 2024

Workaround: downgrade htcondor: pip install htcondor==24.2.1

oshadura added a commit to CoffeaTeam/af-images that referenced this issue Dec 13, 2024
oshadura added a commit to CoffeaTeam/af-images that referenced this issue Dec 13, 2024
oshadura added a commit to CoffeaTeam/af-images that referenced this issue Dec 13, 2024
@ikrommyd
Copy link

ikrommyd commented Dec 18, 2024

This isn't only 24.2.2. Something else is happening. It's happening with earlier versions as well in the latest images.
In the 2024.9.1 coffea image we have

➜  egamma-tnp git:(master) ✗ ./shell coffeateam/coffea-dask-almalinux8:2024.9.1-py3.11
Singularity> pip list | grep htc
htcondor                  24.1.1
htcondor-cli              24.1.1

and it works fine.
Now in the latest image with the exact same version of htcondor the following

➜  egamma-tnp git:(master) ✗ ./shell coffeateam/coffea-dask-almalinux8:2024.11.0-py3.11
Singularity> ipython
/usr/local/lib/python3.11/site-packages/IPython/core/interactiveshell.py:937: UserWarning: Attempting to work in a virtualenv. If you encounter problems, please install IPython inside the virtualenv.
  warn(
Python 3.11.11 | packaged by conda-forge | (main, Dec  5 2024, 14:17:24) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.30.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from distributed import Client
   ...: from lpcjobqueue import LPCCondorCluster
   ...: import logging
   ...:
   ...:
   ...: logging.basicConfig(level=logging.DEBUG)
   ...:
   ...: cluster = LPCCondorCluster()
   ...: cluster.adapt(minimum=0, maximum=10)
   ...: client = Client(cluster)
   ...:
   ...: for future in client.map(lambda x: x * 5, range(10)):
   ...:     print(future.result())
   ...: cluster.close()

will error

251 - |Error: receiving file /storage/local/data1/condor/spool/2033/0/cluster8842033.proc0.subproc0.tmp/x509up_u3756
2024-12-17 19:53:00,140 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f8551752a10>>, <Task finished name='Task-117' coro=<SpecCluster._correct_state_internal() done, defined at /usr/local/lib/python3.11/site-packages/distributed/deploy/spec.py:346> exception=AssertionError()>)
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/tornado/ioloop.py", line 750, in _run_callback
    ret = callback()
          ^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tornado/ioloop.py", line 774, in _discard_future_result
    future.result()
  File "/usr/local/lib/python3.11/site-packages/distributed/deploy/spec.py", line 390, in _correct_state_internal
    await asyncio.gather(*worker_futs)
  File "/usr/local/lib/python3.11/asyncio/tasks.py", line 694, in _wrap_awaitable
    return (yield from awaitable.__await__())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/distributed/deploy/spec.py", line 75, in _
    assert self.status == Status.running
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Now if you get the exact same htcondor version but from pypi (it's installed from conda-forge in the images)

Singularity> pip install htcondor==24.2.1
Collecting htcondor==24.2.1
  Using cached htcondor-24.2.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.4 kB)
Using cached htcondor-24.2.1-cp311-cp311-manylinux_2_28_x86_64.whl (60.4 MB)
Installing collected packages: htcondor
  Attempting uninstall: htcondor
    Found existing installation: htcondor 24.1.1
    Not uninstalling htcondor at /usr/local/lib/python3.11/site-packages, outside environment /srv/.env
    Can't uninstall 'htcondor'. No files were found to uninstall.
Successfully installed htcondor-24.2.1
Singularity> pip install htcondor==24.1.1
Collecting htcondor==24.1.1
  Using cached htcondor-24.1.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.4 kB)
Using cached htcondor-24.1.1-cp311-cp311-manylinux_2_28_x86_64.whl (60.4 MB)
Installing collected packages: htcondor
  Attempting uninstall: htcondor
    Found existing installation: htcondor 24.2.1
    Uninstalling htcondor-24.2.1:
      Successfully uninstalled htcondor-24.2.1
Successfully installed htcondor-24.1.1

It magically works!
I don't know what kind of sorcery this is and where this weird interaction is happening.

@ikrommyd
Copy link

What's interesting is that nothing changed inthe image bulding apart from versions.

diff --git a/coffea-dask/environment.yaml b/coffea-dask/environment.yaml
index b52d4d8..69fc0a8 100644
--- a/coffea-dask/environment.yaml
+++ b/coffea-dask/environment.yaml
@@ -11,7 +11,7 @@ dependencies:
   - xrootd
   # we have issues with conflicting openssl version and htcondor 10.8.0 version is last one
   # which we able to resolve in this environment.yaml
-  - htcondor
+  - htcondor=24.1.1 # pin HTCondor for LPC https://github.com/CoffeaTeam/lpcjobqueue/issues/38
   - curl
     # jupyter-related
   - jupyterlab
@@ -19,6 +19,10 @@ dependencies:
   - dask_labextension
   - dask-gateway
   - dask-jobqueue
+  # To be reverted: Dask needs to be pinned for now, do not use dask>=2024.12.0 with coffea, dask-awkward, or uproot
+  - dask=2024.11.2
+  # To be reverted with next coffea release: Dask needs to be pinned for now
+  - dask-awkward=2024.12.0
   - bokeh
   # Add workqueue
   - ndcctools
@@ -42,6 +46,7 @@ dependencies:
   - correctionlib
   - python-graphviz
     # scikit-hep
+  # FIXME: # disable microarches for awkward-cpp
   - awkward
   - vector
   - hist
@@ -50,7 +55,7 @@ dependencies:
   - pytorch
   - torch-scatter
   - pip
-  - coffea=2024.9.0
+  - coffea=2024.11.0
   - rucio-clients
     # pyg
   - pyg
@@ -60,6 +65,6 @@ dependencies:
   - pip:
     - fastjet # to be added to conda-forge: https://github.com/scikit-hep/fastjet/issues/133
     - tritonclient[all]
-    - tflite-runtime==2.14.0
+    - ai-edge-litert-nightly # ai-edge-litert as replacement of tflite is still not available for python
     - onnxruntime
     - fsspec-xrootd
 diff --git a/coffea-dask/Dockerfile.almalinux8 b/coffea-dask/Dockerfile.almalinux8
index 3bcfde3..7d8adf6 100644
--- a/coffea-dask/Dockerfile.almalinux8
+++ b/coffea-dask/Dockerfile.almalinux8
@@ -22,6 +22,9 @@ RUN mamba install --yes python=${PYTHON_VERSION} \
      && mamba env update --file /environment.yaml \
      && mamba clean -y --all

+# FIXME: # disable microarches for awkward-cpp
+RUN pip uninstall -y awkward awkward-cpp && pip install awkward
+
 # Make a symbolic link between installation /opt/conda/etc/grid-security and actual directory /etc/grid-security
 RUN ln -s /usr/local/etc/grid-security /etc/grid-security && \
     curl -L https://github.com/opensciencegrid/osg-vo-config/archive/refs/heads/master.tar.gz | \

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants