-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E0216 from backup_poller.cc or ev_epollex_linux.cc #572
Comments
Hi Zeyi, Thanks for the detailed bug report. It sure looks like a race condition in environment startup (or shutdown) that is causing the gRPC services to get confused. Unfortunately I haven't been able to reproduce the problem on a similar 24-core Linux machine. How many CPU cores do you have? ( Does the error reproduce if you only create the environments without interacting with them? @ray.remote
def spawn():
with compiler_gym.make("llvm-ic-v0") as env:
pass WorkaroundThis will massively slowdown environment startup, but as a quick workaround you could use an inter-process lock to make sure that only a single environment is created at a time. Far from ideal, but a quick "fix": import compiler_gym
import ray
from fasteners import InterProcessLock
def locked_compiler_gym_make(*args, **kwargs):
"""Wrapper around compiler_gym.make() that guarantees synchronous calls."""
with InterProcessLock("/dev/shm/test.lock"):
return compiler_gym.make(*args, **kwargs)
@ray.remote
def spawn():
with locked_compiler_gym_make("llvm-ic-v0") as env:
env.reset()
_, _, done, _ = env.step(env.action_space.sample())
if done:
raise RuntimeError
ray.get([spawn.remote() for _ in range(5000)]) Cheers, |
In my case, |
Moreover, one difficulty of dealing with this problem is that no exception is thrown to my Python process. If I could catch and process that error, I would happily throw away ~3 problematic environments out of 5000 to retain the ability to create environments in parallel. Similarly, the environment could report this in the return value of |
Okay, 72 isn't crazy, there's no reason we should expect CompilerGym not to be able to support that. My hunch is that there is a data race in the gRPC service startup that can cause more than one environment service to be bound to the same port when the system is launching a bunch of services in parallel. We let the host operating system pick an unused port to launch a new service on, I would start by looking for race opportunities here: For temporary workaround, something as simple as a random pause may be right balance between reducing the risk of error and taking too long: @ray.remote
def spawn():
time.sleep(random.random() * 10)
pass What exactly is the symptom of this error? Do the environments that log this error message stop working? I'm looking at the the source that logs the first error you provided and it's not clear to me the severity of the problem. Cheers, |
Possible relevant: grpc/grpc#10755 |
The sleeping slows down things but don't solve the problem. My program spawns environments in multiple places repeatedly, and there's still a chance of clashing too many requests at the same time. In terms of the symptoms, most of the time the program treats this as a warning and proceed normally (not sure if the bug changes the correctness of future environments or something), occasionally it hangs a process, and also occasionally an error is thrown to Python telling me that I'm using an environment that's already closed. The behavior is very inconsistent and I'm not even sure the error message I shown above actually address the place where something went wrong. Since current workarounds (I also tried a multi-process semaphore) are too slow, my approach right now is just try to get lucky and discard the runs that hangs or throws an exception. |
I've started looking into this, I'll let you know how I get on. In the meantime, another possible workaround I was thinking about. If the problem is that there is a race condition in free port assignment, you could allocate a single port value to each process so that there shouldn't be any conflicts. This is still a total hack but might help you get unblocked: from multiprocessing import Process
import compiler_gym
from compiler_gym.service import ConnectionOpts
from tqdm import tqdm
def spawn(port: int):
opts = ConnectionOpts(script_args=[f"--port={port}"])
with compiler_gym.make("llvm-v0", connection_settings=opts) as env:
env.reset()
env.step(0)
START_PORT = 10000
NUM_PROC = 1000
processes = [Process(target=spawn, args=(START_PORT + i,)) for i in range(NUM_PROC)]
for p in tqdm(processes):
p.start()
for p in tqdm(processes):
p.join() Also, do you need to keep creating new environments? Cheers, |
This corrects my use of the boost::process API whereby llvm-size child processes could become detached from the parent environment service and be left as zombies. I found this issue will attempting to reproduce facebookresearch#572. I'm not sure if this bug is relevant.
This corrects my use of the boost::process API whereby llvm-size child processes could become detached from the parent environment service and be left as zombies. I found this issue will attempting to reproduce facebookresearch#572. I'm not sure if this bug is relevant.
Hi @uduse, can you please let me know what version of gRPC you have?
I'm seeing some more logging errors from gRPC 1.44 and think this may be relevant ray-project/ray#22518 Cheers, |
I got
|
I couldn't reproduce the bug with |
Oo interesting, thanks for following up! I have been working on a big update to Cheers, |
🤣 I just implemented this pool for caching in my own project, called To be more specific, my class:
This tidies up my algorithm implementations as it removes environment management. |
This sounds great. The change that I am working on caches only the service connections (the Glad its working out for you :) |
This corrects my use of the boost::process API whereby llvm-size child processes could become detached from the parent environment service and be left as zombies. I found this issue will attempting to reproduce facebookresearch#572. I'm not sure if this bug is relevant.
🐛 Bug
This happens when I spawn multiple compiler gym sessions in parallel. The error comes in two forms but I think they similar:
either:
or:
To Reproduce
Steps to reproduce the behavior:
I managed to create this example to consistently recreates this behavior:
example output from the above script:
Sometimes the script above leave orphan processes behind (related: #326), but it seems all the environments created above should be closed properly.
This means right after the script above, sometimes doing
ps aux | grep compiler_gym-llvm-service | grep -v grep | awk '{print $2}' | wc -l
yields a number that's not 0.Environment
Please fill in this checklist:
compiler-gym==0.2.2
ray==1.10.0
The text was updated successfully, but these errors were encountered: