Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ungraceful shutdown of python script can leave orphan processes #326

Open
ChrisCummins opened this issue Jul 20, 2021 · 6 comments
Open
Assignees
Labels
Bug Something isn't working
Milestone

Comments

@ChrisCummins
Copy link
Contributor

ChrisCummins commented Jul 20, 2021

🐛 Bug

CompilerGym uses a client/service architecture. Every time a CompilerEnv object is created, a CompilerService subprocess is started. The lifetime of the subprocess is managed by the CompilerEnv. Calling CompilerEnv.close() terminates the service:

CompilerGym

If for some reason CompilerEnv.close() is not called (either through a system or user error), the CompilerService will not be killed and will remain dormant indefinitely.

To Reproduce

In one terminal, open a python interpreter and start a CompilerGym environment. Make a note of the interpreter and the environment's service process IDs:

In [1]: import os

In [2]: os.getpid()
Out[2]: 5425

In [3]: import compiler_gym

In [4]: env = compiler_gym.make("llvm-v0")

In [5]: env.service.connection.process
Out[5]: <subprocess.Popen at 0x7fbe10855790>

In [6]: env.service.connection.process.pid
Out[6]: 5809

In another terminal, kill the interpreter process, and observe that the CompilerGym environment's service is still running:

$ kill -9 5425

$ ps aux | grep 5809
cummins           6087   0.0  0.0  4408696    864 s002  S+    1:16PM   0:00.00 grep --color=auto 5809
cummins           5809   0.0  0.0  4499680  14628 s000  S     1:15PM   0:00.02 ./compiler_gym-llvm-service --working_dir=/Users/cummins/.cache/compiler_gym/s/0720T131545-167414-6660

That process will remain dormant until explicitly killed, or the machine is rebooted.

Expected behavior

After some period of inactivity, the service should realize that it is no longer being used and should gracefully shutdown.

To the best of my understanding, it is not possible to guarantee that a subprocess shutdown routine can be called by the parent process, so the proposed workaround is to have a 'time to live' timer on each subprocess which will shut itself down if that period of inactivity is reached.

Workaround

If you suspect that there are dormant LLVM CompilerGym services and you are not currently running any CompilerGym python scripts, you can manually kill them using:

ps aux | grep compiler_gym-llvm-service | grep -v grep | awk '{print $2}' | xargs --no-run-if-empty kill

although this does not tidy up any temporary cache files that the environments have created.

Environment

Please fill in this checklist:

  • CompilerGym: v0.1.9
  • How you installed CompilerGym (conda, pip, source): n/a
  • OS: n/a
  • Python version: n/a

Additional context

See the documentation for more background on CompilerGym's client/service architecture.

@ChrisCummins ChrisCummins added the Bug Something isn't working label Jul 20, 2021
@ChrisCummins ChrisCummins added this to the v0.1.11 milestone Jul 20, 2021
@ChrisCummins ChrisCummins self-assigned this Jul 20, 2021
@ChrisCummins
Copy link
Contributor Author

cc @hughleat

@uduse
Copy link
Contributor

uduse commented Feb 7, 2022

I noticed this problem when I implemented an algorithm that env.fork() a lot and sometimes I forgot to env.close() or the program is interrupted. What I really want to address this problem (indirectly) is to have a context manager for compiler environment class.

e.g.,

with compiler_gym.make("llvm-v0") as env:
    # do something here
    with env.fork() as fork:
        # do more things here

contextlib.closing can partly do this job already, but it would be nice to have native support. If we have such support, it should be the recommended way of managing compiler gym environments, like with open(...) as ... is THE way of opening files in Python.

@ChrisCummins
Copy link
Contributor Author

Hi @uduse, I believe what you want is already implemented in the base gym API. CompilerGym environments created by make() and fork() should be used in context managers. Adapting your example:

import os
import compiler_gym

def is_running(pid):        
    try:
        os.kill(pid, 0)
        return True
    except OSError:
        return False

# Demo make():
with compiler_gym.make("llvm-v0") as env:
    env_pid = env.service.connection.process.pid
    print("Environment", ("running" if is_running(env_pid) else "exited"))
    # Demo fork():
    with env.fork() as fkd:
        fkd_pid = env.service.connection.process.pid
        print("Forked", ("running" if is_running(fkd_pid) else "exited"))

# Check whether environments have closed:
print()
print("Environment", ("running" if is_running(env_pid) else "exited"))
print("Forked", ("running" if is_running(fkd_pid) else "exited"))

Produces output:

Environment running
Forked running

Environment exited
Forked exited

See this colab notebook.

Cheers,
Chris

@uduse
Copy link
Contributor

uduse commented Feb 14, 2022

@ChrisCummins Oh thanks for letting me know! I checked the source code and didn't find __enter__ and totally ignored the possibility that it's defined it gym's base class 😛 Guess I should use dir(var) to check if an implementation exists.

@uduse
Copy link
Contributor

uduse commented May 9, 2022

Capture

How about the orphan executables like these? Should we also kill them manually (using grep, xargs, kill command)?

@ChrisCummins
Copy link
Contributor Author

How about the orphan executables like these?

Yes, those are safe to kill. Those processes are cBench binaries which are used to compute runtime. If they refuse to complete within a specified timeout (default 5 min), the process will be abandoned. Adding proper subprocess timeout would be a nice feature. The code is here:

https://github.com/facebookresearch/CompilerGym/blob/development/compiler_gym/util/Subprocess.cc#L62-L66

Cheers,
Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants