Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E0216 from backup_poller.cc or ev_epollex_linux.cc #572

Open
uduse opened this issue Feb 16, 2022 · 13 comments
Open

E0216 from backup_poller.cc or ev_epollex_linux.cc #572

uduse opened this issue Feb 16, 2022 · 13 comments
Labels
Bug Something isn't working

Comments

@uduse
Copy link
Contributor

uduse commented Feb 16, 2022

🐛 Bug

This happens when I spawn multiple compiler gym sessions in parallel. The error comes in two forms but I think they similar:

either:

E0216 05:27:23.400729219  198153 ev_epollex_linux.cc:515]    Error shutting down fd 16. errno: 9```

or:

E0216 05:27:43.469307484  293529 backup_poller.cc:134]       Run client channel backup poller: {"created":"@1645007263.469241393","description":"pollset_work","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":320,"referenced_errors":[{"created":"@1645007263.469233691","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":950,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}

To Reproduce

Steps to reproduce the behavior:

I managed to create this example to consistently recreates this behavior:

import compiler_gym
import ray


@ray.remote
def spawn():
    with compiler_gym.make("llvm-ic-v0") as env:
        env.reset()
        _, _, done, _ = env.step(env.action_space.sample())
        if done:
            raise RuntimeError


ray.get([spawn.remote() for _ in range(5000)])

example output from the above script:

$ python sandbox.py 
2022-02-16 05:31:32,323 INFO services.py:1374 -- View the Ray dashboard at http://127.0.0.1:8265
(spawn pid=438029) E0216 05:31:43.796126540  482891 backup_poller.cc:134]       Run client channel backup poller: {"created":"@1645007503.796063126","description":"pollset_work","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":320,"referenced_errors":[{"created":"@1645007503.796054426","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":950,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}
(spawn pid=437973) E0216 05:31:48.767850290  502732 backup_poller.cc:134]       Run client channel backup poller: {"created":"@1645007508.767783963","description":"pollset_work","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":320,"referenced_errors":[{"created":"@1645007508.767776654","description":"Bad file descriptor","errno":9,"file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":950,"os_error":"Bad file descriptor","syscall":"epoll_wait"}]}

Sometimes the script above leave orphan processes behind (related: #326), but it seems all the environments created above should be closed properly.

This means right after the script above, sometimes doing ps aux | grep compiler_gym-llvm-service | grep -v grep | awk '{print $2}' | wc -l yields a number that's not 0.

Environment

Please fill in this checklist:

  • CompilerGym: compiler-gym==0.2.2
  • How you installed CompilerGym (conda, pip, source): pip
  • OS: Linux
  • Python version: 3.8.12
  • Build command you used (if compiling from source):
  • GCC/clang version (if compiling from source):
  • Bazel version (if compiling from source):
  • Versions of any other relevant libraries: ray==1.10.0
PyTorch version: 1.10.1+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 10.0.0-4ubuntu1 
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:57:06)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-26-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
GPU 2: Tesla V100-PCIE-32GB
GPU 3: Tesla V100-PCIE-32GB
GPU 4: Tesla V100-PCIE-32GB
GPU 5: Tesla V100-PCIE-32GB
GPU 6: Tesla V100-PCIE-32GB
GPU 7: Tesla V100-PCIE-32GB

Nvidia driver version: 440.95.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy==0.920
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.19.5
[pip3] torch==1.10.1
[conda] mypy_extensions           0.4.3            py38h578d9bd_4    conda-forge
[conda] numpy                     1.19.5                   pypi_0    pypi
[conda] torch                     1.10.1                   pypi_0    pypi
@uduse uduse added the Bug Something isn't working label Feb 16, 2022
@ChrisCummins
Copy link
Contributor

Hi Zeyi,

Thanks for the detailed bug report. It sure looks like a race condition in environment startup (or shutdown) that is causing the gRPC services to get confused. Unfortunately I haven't been able to reproduce the problem on a similar 24-core Linux machine. How many CPU cores do you have? ($ nproc)

Does the error reproduce if you only create the environments without interacting with them?

@ray.remote
def spawn():
    with compiler_gym.make("llvm-ic-v0") as env:
        pass

Workaround

This will massively slowdown environment startup, but as a quick workaround you could use an inter-process lock to make sure that only a single environment is created at a time. Far from ideal, but a quick "fix":

import compiler_gym
import ray
from fasteners import InterProcessLock

def locked_compiler_gym_make(*args, **kwargs):
    """Wrapper around compiler_gym.make() that guarantees synchronous calls."""
    with InterProcessLock("/dev/shm/test.lock"):
        return compiler_gym.make(*args, **kwargs)


@ray.remote
def spawn():
    with locked_compiler_gym_make("llvm-ic-v0") as env:
        env.reset()
        _, _, done, _ = env.step(env.action_space.sample())
        if done:
            raise RuntimeError

ray.get([spawn.remote() for _ in range(5000)])

Cheers,
Chris

@uduse
Copy link
Contributor Author

uduse commented Feb 16, 2022

In my case, nproc==72. The script could fail without env.step(), but much unlikely. My "fix" was to limit the number of parallel instances. This means with less environment created in parallel, this problem also happens much less likely. As a result, I wouldn't be surprised if it's hard to reproduce with nproc==24. I tried your workaround and it seems to work, but yes it drastically slows down the process 😭 A slightly better band-aid fix would be using an inter-process semaphore (with value of 24, for example), but of course the best case is compiler gym actually fix this 😉

@uduse
Copy link
Contributor Author

uduse commented Feb 16, 2022

Moreover, one difficulty of dealing with this problem is that no exception is thrown to my Python process. If I could catch and process that error, I would happily throw away ~3 problematic environments out of 5000 to retain the ability to create environments in parallel. Similarly, the environment could report this in the return value of env.step() for the Python process to handle. Currently it seems there's no way for the Python process to know about the error happening under the hood.

@ChrisCummins
Copy link
Contributor

Okay, 72 isn't crazy, there's no reason we should expect CompilerGym not to be able to support that. My hunch is that there is a data race in the gRPC service startup that can cause more than one environment service to be bound to the same port when the system is launching a bunch of services in parallel. We let the host operating system pick an unused port to launch a new service on, I would start by looking for race opportunities here:

https://github.com/facebookresearch/CompilerGym/blob/development/compiler_gym/service/runtime/CreateAndRunCompilerGymServiceImpl.h#L94-L98

For temporary workaround, something as simple as a random pause may be right balance between reducing the risk of error and taking too long:

@ray.remote
def spawn():
    time.sleep(random.random() * 10)
    pass

What exactly is the symptom of this error? Do the environments that log this error message stop working? I'm looking at the the source that logs the first error you provided and it's not clear to me the severity of the problem.

Cheers,
Chris

@ChrisCummins
Copy link
Contributor

Possible relevant: grpc/grpc#10755

@uduse
Copy link
Contributor Author

uduse commented Feb 17, 2022

The sleeping slows down things but don't solve the problem. My program spawns environments in multiple places repeatedly, and there's still a chance of clashing too many requests at the same time. In terms of the symptoms, most of the time the program treats this as a warning and proceed normally (not sure if the bug changes the correctness of future environments or something), occasionally it hangs a process, and also occasionally an error is thrown to Python telling me that I'm using an environment that's already closed. The behavior is very inconsistent and I'm not even sure the error message I shown above actually address the place where something went wrong.

Since current workarounds (I also tried a multi-process semaphore) are too slow, my approach right now is just try to get lucky and discard the runs that hangs or throws an exception.

@ChrisCummins
Copy link
Contributor

I've started looking into this, I'll let you know how I get on.

In the meantime, another possible workaround I was thinking about. If the problem is that there is a race condition in free port assignment, you could allocate a single port value to each process so that there shouldn't be any conflicts. This is still a total hack but might help you get unblocked:

from multiprocessing import Process
import compiler_gym
from compiler_gym.service import ConnectionOpts
from tqdm import tqdm

def spawn(port: int):
    opts = ConnectionOpts(script_args=[f"--port={port}"])
    with compiler_gym.make("llvm-v0", connection_settings=opts) as env:
        env.reset()
        env.step(0)

START_PORT = 10000
NUM_PROC = 1000

processes = [Process(target=spawn, args=(START_PORT + i,)) for i in range(NUM_PROC)]
for p in tqdm(processes):
    p.start()
for p in tqdm(processes):
    p.join()

Also, do you need to keep creating new environments? compiler_gym.make() is much more expensive than env.reset(). If the number of simultaneous environments you need is static then you can allocate them all once.

Cheers,
Chris

ChrisCummins added a commit to ChrisCummins/CompilerGym that referenced this issue Feb 17, 2022
This corrects my use of the boost::process API whereby llvm-size child
processes could become detached from the parent environment service
and be left as zombies.

I found this issue will attempting to reproduce facebookresearch#572. I'm not sure if
this bug is relevant.
ChrisCummins added a commit to ChrisCummins/CompilerGym that referenced this issue Feb 18, 2022
This corrects my use of the boost::process API whereby llvm-size child
processes could become detached from the parent environment service
and be left as zombies.

I found this issue will attempting to reproduce facebookresearch#572. I'm not sure if
this bug is relevant.
@ChrisCummins
Copy link
Contributor

Hi @uduse, can you please let me know what version of gRPC you have?

$ python -m pip freeze | grep grpcio

I'm seeing some more logging errors from gRPC 1.44 and think this may be relevant ray-project/ray#22518

Cheers,
Chris

ChrisCummins added a commit to ChrisCummins/CompilerGym that referenced this issue Feb 22, 2022
@uduse
Copy link
Contributor Author

uduse commented Feb 23, 2022

I got

grpcio==1.42.0
grpcio-tools==1.43.0

@uduse
Copy link
Contributor Author

uduse commented Feb 25, 2022

I couldn't reproduce the bug with multiprocessing. I think it might be a bug in ray or some weird interaction between ray and compiler_gym since both of them use gRPC.

@ChrisCummins
Copy link
Contributor

ChrisCummins commented Mar 1, 2022

I couldn't reproduce the bug with multiprocessing. I think it might be a bug in ray or some weird interaction between ray and compiler_gym since both of them use gRPC.

Oo interesting, thanks for following up! I have been working on a big update to gym.make() that uses an object pool for caching environments, enabling gRPC services to be re-used across environments. It's not ready for merging just yet but it should help with the issue you're seeing, and also speeds up environment construction >100x...

Cheers,
Chris

@uduse
Copy link
Contributor Author

uduse commented Mar 1, 2022

I couldn't reproduce the bug with multiprocessing. I think it might be a bug in ray or some weird interaction between ray and compiler_gym since both of them use gRPC.

Oo interesting, thanks for following up! I have been working on a big update to gym.make() that uses an object pool for caching environments, enabling gRPC services to be re-used across environments. It's not ready for merging just yet but it should help with the issue you're seeing, and also speeds up environment construction >100x...

Cheers, Chris

🤣 I just implemented this pool for caching in my own project, called EnvSupervisor and I kept thinking it would be nice to have it handled on the compiler_gym's side.

To be more specific, my class:

  1. Put aside no-longer-needed environments, and resets them when a new environment is needed.
  2. Keep track of all spawned environment instances, so the user could close all of them with one function call.
  3. Environments are automatically putted aside when its accessor is dereferenced.
  4. Slightly faster (4%) in terms of system speed (my algorithms now don't make too many environments now and the speedup is from recycling no-longer-needed environments).

This tidies up my algorithm implementations as it removes environment management.

@ChrisCummins
Copy link
Contributor

This sounds great. The change that I am working on caches only the service connections (the env.service object), not the entire environment, but otherwise sounds similar.

Glad its working out for you :)

ChrisCummins added a commit to ChrisCummins/CompilerGym that referenced this issue Mar 6, 2022
This corrects my use of the boost::process API whereby llvm-size child
processes could become detached from the parent environment service
and be left as zombies.

I found this issue will attempting to reproduce facebookresearch#572. I'm not sure if
this bug is relevant.
ChrisCummins added a commit to ChrisCummins/CompilerGym that referenced this issue Mar 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants