Skip to content

[rllib] Deadlock error when running ES. #1773

@robertnishihara

Description

@robertnishihara

When running

python ray/python/ray/rllib/train.py --redis-address=172.31.7.72:6379 --env=Humanoid-v1 --run=ES --config='{"episodes_per_batch": 1000, "timesteps_per_batch": 10000, "num_workers": 400}'

on a cluster (100 machines), I see

Traceback (most recent call last):
  File "/home/ubuntu/ray/python/ray/worker.py", line 1720, in fetch_and_execute_function_to_run
    function = pickle.loads(serialized_function)
  File "/home/ubuntu/ray/python/ray/rllib/__init__.py", line 17, in <module>
    _register_all()
  File "/home/ubuntu/ray/python/ray/rllib/__init__.py", line 14, in _register_all
    register_trainable(key, get_agent_class(key))
  File "/home/ubuntu/ray/python/ray/rllib/agent.py", line 229, in get_agent_class
    from ray.rllib import es
  File "/home/ubuntu/ray/python/ray/rllib/es/__init__.py", line 1, in <module>
    from ray.rllib.es.es import (ESAgent, DEFAULT_CONFIG)
  File "/home/ubuntu/ray/python/ray/rllib/es/es.py", line 19, in <module>
    from ray.rllib.es import policies
  File "<frozen importlib._bootstrap>", line 968, in _find_and_load
  File "<frozen importlib._bootstrap>", line 168, in __enter__
  File "<frozen importlib._bootstrap>", line 110, in acquire
_frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('ray.rllib.es.policies') at 139937598221224

This likely has to do with recursive imports in rllib, probably related to #1716.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn't

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions