Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SubprocVecEnv Thread Safe #218

Merged
merged 11 commits into from
Mar 10, 2019

Conversation

AdamGleave
Copy link
Collaborator

As discussed in issue #217, SubprocVecEnv uses on Unix systems the non-thread-safe fork method of multiprocessing, which can cause issues e.g. when the parent process has an extant TensorFlow session. This PR changes SubprocVecEnv to use the thread-safe spawn method.

@AdamGleave
Copy link
Collaborator Author

Only real downside of spawn is that it's slower than fork. On my system, it introduces around a 5 second fixed cost to environment setup; this might be larger for complex projects with more imports. I expect this overhead to be negligible for RL training, but could be moderately annoying for interactive use. Users could always switch to DummyVecEnv in such cases; we could also make the method multiprocessing uses configurable.

Before PR:

0 rollouts
Time:  [0.3692049980163574, 0.39535021781921387, 0.31310319900512695]
1000 rollouts
Time:  [2.1021463871002197, 2.0643362998962402, 2.264810800552368]
10000 rollouts
Time:  [19.19278883934021, 19.433793306350708, 19.497289896011353]  

After PR:

0 rollouts
Time:  [5.425152778625488, 5.514942646026611, 5.524834871292114]
1000 rollouts
Time:  [7.168581247329712, 7.421223163604736, 7.134859561920166]
10000 rollouts
Time:  [23.813771963119507, 23.359588384628296, 23.930598735809326]

Generated using:

import time

import gym
from stable_baselines.common.vec_env import SubprocVecEnv

ENV_NAME = 'SeaquestNoFrameskip-v4'
N_ENVS = 8

def benchmark(n_steps):
    def make_env():
        return gym.make(ENV_NAME)

    venv = SubprocVecEnv([make_env] * N_ENVS)
    venv.reset()
    for i in range(n_steps):
        action = [venv.action_space.sample() for _ in range(N_ENVS)]
        venv.step(action)
    venv.close()


if __name__ == '__main__':
    for n in [0, 1000, 10000]:
        print('{} rollouts'.format(n))
        times = []
        for rep in range(3):
            start = time.time()
            benchmark(n)
            end = time.time()
            times.append(end - start)
        print('Time: ', times)

@araffin
Copy link
Collaborator

araffin commented Mar 3, 2019

Hello Adam,
Thank you for your contribution.
I'll be AFK until wednesday, so unless @hill-a or @erniejunior have time before, you'll need to wait a bit before having some feedbacks ;)

@AdamGleave
Copy link
Collaborator Author

AdamGleave commented Mar 4, 2019

OK, thanks for letting me know, I don't think this is time sensitive.

Using forkserver on platforms where it's available improves performance and is still thread safe. It's still slower than using fork, though. Benchmark:

0 rollouts                                                                                                                                                                                            
Time:  [3.565499782562256, 3.5964715480804443, 3.493802785873413]                                                                                                                                     
1000 rollouts
Time:  [5.70629620552063, 5.803895950317383, 5.861036539077759]
10000 rollouts
Time:  [27.23869800567627, 27.16290044784546, 26.949447870254517] 

@hill-a
Copy link
Owner

hill-a commented Mar 6, 2019

Hey,

I think you should add a hyperparameter to SubProcVecEnv, such as subprocess_type=None, and if it is None, keep the autodetect forkserver else spawn, otherwise use the given type.
This is mainly because the environment might need a global variable, forkserver and spawn might lead to unexpected behaviour, and it would make much more sense to empower the user (ref).

@AdamGleave
Copy link
Collaborator Author

@hill-a thanks for the feedback, I agree this is an improvement, incorporated.

I've also written a test case, but it depends on some helper functions I wrote for PR #217. I'm planning on adding that test case into this PR once PR #217 is merged; if that doesn't happen, I'll backport the tests.

@araffin
Copy link
Collaborator

araffin commented Mar 8, 2019

@AdamGleave you meant PR #207 ?

That's a good thing you added a test.

I'm planning on adding that test case into this PR once PR #217 is merged; if that doesn't happen, I'll backport the tests.

There is a high probability your two PR will be merged ;)
It just take some time because this is a side project for all the three maintainers and I would say it is better to be sure about a feature than merging too fast something and breaking master branch.

@AdamGleave
Copy link
Collaborator Author

Yes I meant PR #207 -- sorry. No problem, I understand things can take a while. I'm happy using my fork until things get merged, so there's no rush.

@AdamGleave
Copy link
Collaborator Author

Updated with latest changes to master and added the test case.

@araffin
Copy link
Collaborator

araffin commented Mar 9, 2019

The failure on travis is due to a memory error. I suspect it to be related to what I experimented when trying to do hyperparameter search in the rl zoo.
My solution was to kill the different processes to avoid taking all the RAM.

See fix here

@araffin
Copy link
Collaborator

araffin commented Mar 9, 2019

You can try to cherry-pick this comit: 987cfdf
(on the gailt-test branch)

@AdamGleave
Copy link
Collaborator Author

AdamGleave commented Mar 9, 2019

Thanks @araffin. I'm a little confused why we need to kill the processes explicitly -- shouldn't this happen when venv.close() is called? I ran into a similar problem with processes not being terminated that seems related to an OpenMPI bug, see HumanCompatibleAI/adversarial-policies#3, perhaps it's the same issue. If we can't figure out the underlying cause then 987cfdf seems like a good workaround.

@araffin
Copy link
Collaborator

araffin commented Mar 9, 2019

I'm a little confused why we need to kill the processes explicitly

Good remark, I did not noticed that method (I thought it was related only to the render() method). Could you try calling it in the different run_atari.py scripts (as in my commit) ? (this is what takes a lot of memory)

EDIT: apparently, my fix was not enough :/ (see https://travis-ci.com/hill-a/stable-baselines/builds/103783111)

@araffin
Copy link
Collaborator

araffin commented Mar 9, 2019

I made the changes here: 6e655a7
Let's hope it will solve the issue.

@AdamGleave
Copy link
Collaborator Author

@araffin thanks, closing the VecEnv after my tests has resolved the memory issue for this PR.

docs/misc/changelog.rst Outdated Show resolved Hide resolved
araffin
araffin previously approved these changes Mar 10, 2019
Copy link
Collaborator

@araffin araffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appart from the minor typo in the changelog, LGTM, thanks =)

@araffin araffin merged commit 9d9fbcc into hill-a:master Mar 10, 2019
@araffin
Copy link
Collaborator

araffin commented Mar 11, 2019

Hello @AdamGleave ,

Because of your good contributions, and as we need more people like you to help us improve stable-baselines, we would like to propose you to join us and become maintainer of this repository.

You will have the ability to create branches, do code reviews and will be obviously credited in the README ;)

With @hill-a and @erniejunior in CC


EDIT related to your PR
I'm experimenting several issues with the new behavior on different machines (pytest hang, weird seg fault errors, python asking me to put everything in __main__ or to call freeze_import()), so we may switch back to the previous behavior (to avoid breaking people code) but adding a warning to the documentation.

The cool thing with your PR is that the user has still the ability to change that behavior.

EDIT: good write-up of the dilemma found by Ashley https://codewithoutrules.com/2018/09/04/python-multiprocessing/

@AdamGleave
Copy link
Collaborator Author

Hi @araffin,

Thanks for the offer of joining the team! Potentially interested depending on time constraints -- I'll send an e-mail to discuss this further in private.

In terms of the PR, I'd be happy for the default behaviour to be switched to fork if we feel this will break fewer things by default. The requirement to put stuff in __main__ is expected behaviour, but the other problems "should not" happen.

Were the problems you were experiencing related to MPI at all? In particular, if the segfault had anything involving pmix, it's probably MPI related. I also experienced strange hangs with MPI. I think MPI in general tends to interact poorly with Python multiprocessing, they're very different programming paradigms. I can't see many use cases for both SubprocVecEnv and MPI-based RL algorithms, so if this is the problem then one workaround might be to only ever import mpi4py lazily when it's actually needed.

@araffin
Copy link
Collaborator

araffin commented Mar 12, 2019

In terms of the PR, I'd be happy for the default behaviour to be switched to fork if we feel this will break fewer things by default.

Yes, that's my main concern, keeping things easy for users (and not breaking their code), even though it has some risk (which are hopefully rare).

Were the problems you were experiencing related to MPI at all?

Yes, some where with MPI, some where not, but surprisingly enough, it depends on the machine I'm working with.

so if this is the problem then one workaround might be to only ever import mpi4py lazily when it's actually needed.

I think that what they did recently in the Baselines repo. I assume that the number of users affected by those problems is low (we did not have any issue yet) compared to the complexity of making several tweaks to import mpi only when needed.

@AdamGleave AdamGleave mentioned this pull request Aug 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants