You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On slurm, law explicitly changes TMPDIR to a place within the job's own directory. This breaks packages relying on the default arguments of multiprocessing.Listener because this involves creating a UNIX socket in TMPDIR, and socket paths can only be so long.
It seems multiprocessing.Queue exposes the issue, but cannot 100% confirm.
Example stack trace (followed by a deadlock):
Traceback (most recent call last):
File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/mourelou/micromamba/envs/case/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 569, in reduce_storage
df = multiprocessing.reduction.DupFd(fd)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/reduction.py", line 198, in DupFd
return resource_sharer.DupFd(fd)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/resource_sharer.py", line 53, in __init__
self._id = _resource_sharer.register(send, close)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/resource_sharer.py", line 76, in register
self._start()
File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/resource_sharer.py", line 126, in _start
self._listener = Listener(authkey=process.current_process().authkey, backlog=128)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/connection.py", line 464, in __init__
self._listener = SocketListener(address, family, backlog)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/connection.py", line 608, in __init__
self._socket.bind(address)
OSError: AF_UNIX path too long
The text was updated successfully, but these errors were encountered:
IMHO, this is indeed an odd constraint of unix sockets (the limit is 107 bytes if I recall correctly).
It might depend on the actual cluster configuration, but afaik slurm jobs use the submission directory (usually transparently seen by the submission node) as the base for temporary files, which
can also cause socket paths to be too long (depending on where your submission directory is located)
and definitely will lead to a heavy use of the file system under the submission directory.
That's why law changes the default to a place that is cleaned up after the job terminates, but I think we should change that behavior for slurm.
Thanks! I agree that the limit on UNIX sockets is weird, but it's baked into socket sockaddr_un and it would be a major change to make it longer... For backward compatibility one would basically need to duplicate the whole API.
Bug description
On slurm,
law
explicitly changesTMPDIR
to a place within the job's own directory. This breaks packages relying on the default arguments ofmultiprocessing.Listener
because this involves creating a UNIX socket inTMPDIR
, and socket paths can only be so long.It seems
multiprocessing.Queue
exposes the issue, but cannot 100% confirm.Example stack trace (followed by a deadlock):
The text was updated successfully, but these errors were encountered: