Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long TMPDIR breaks multiprocessing #184

Closed
lmoureaux opened this issue Jul 17, 2024 · 3 comments
Closed

Long TMPDIR breaks multiprocessing #184

lmoureaux opened this issue Jul 17, 2024 · 3 comments
Assignees
Labels

Comments

@lmoureaux
Copy link
Contributor

Bug description

On slurm, law explicitly changes TMPDIR to a place within the job's own directory. This breaks packages relying on the default arguments of multiprocessing.Listener because this involves creating a UNIX socket in TMPDIR, and socket paths can only be so long.

It seems multiprocessing.Queue exposes the issue, but cannot 100% confirm.

Example stack trace (followed by a deadlock):

Traceback (most recent call last):
  File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/mourelou/micromamba/envs/case/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 569, in reduce_storage
    df = multiprocessing.reduction.DupFd(fd)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/reduction.py", line 198, in DupFd
    return resource_sharer.DupFd(fd)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/resource_sharer.py", line 53, in __init__
    self._id = _resource_sharer.register(send, close)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/resource_sharer.py", line 76, in register
    self._start()
  File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/resource_sharer.py", line 126, in _start
    self._listener = Listener(authkey=process.current_process().authkey, backlog=128)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/connection.py", line 464, in __init__
    self._listener = SocketListener(address, family, backlog)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mourelou/micromamba/envs/case/lib/python3.11/multiprocessing/connection.py", line 608, in __init__
    self._socket.bind(address)
OSError: AF_UNIX path too long
@riga
Copy link
Owner

riga commented Jul 18, 2024

Thanks for the info @lmoureaux !

IMHO, this is indeed an odd constraint of unix sockets (the limit is 107 bytes if I recall correctly).

It might depend on the actual cluster configuration, but afaik slurm jobs use the submission directory (usually transparently seen by the submission node) as the base for temporary files, which

  • can also cause socket paths to be too long (depending on where your submission directory is located)
  • and definitely will lead to a heavy use of the file system under the submission directory.

That's why law changes the default to a place that is cleaned up after the job terminates, but I think we should change that behavior for slurm.

A temporary fix on your side could be along these lines: https://github.com/columnflow/columnflow/blob/835020b57280947f500e232241b014131735d8f2/columnflow/tasks/framework/remote.py#L813-L818

However, I will make sure something similar is done centrally within law.

@riga riga closed this as completed in 6c714a8 Jul 18, 2024
@riga
Copy link
Owner

riga commented Jul 18, 2024

The fix I referenced above is generic enough, so I simply moved it to the slurm base workflow. Reopen the issue in case the issue persists 👍

@lmoureaux
Copy link
Contributor Author

Thanks! I agree that the limit on UNIX sockets is weird, but it's baked into socket sockaddr_un and it would be a major change to make it longer... For backward compatibility one would basically need to duplicate the whole API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants