-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SubprocVecEnv performance compared to gym.vector.async_vector_env #121
Comments
Hello, especially this part: https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3/3_multiprocessing.ipynb |
Thank you @araffin for the links. I get the same speed with But I cannot understand why higher number of threads and using all cores is slower. Thank you |
I have a similar behavior with stable-baselines (tensorflow). It might be that most of the time is spent by the "learner" and not by the "worker/experience-collecting" processes, but I might be wrong |
We believe it is PyTorch trying to parallelize every single computation with given number of threads, but if the computation sizes are small (which they are, if you are using a small env with MlpPolicy), then all the overhead just slows you down. We have seen ~5x speedups from setting
The overhead comes from inter-process communication, which is super-slow if the environments are fast and data is small. Edit: One should bear in mind multiple envs has other effects than sample speed alone. More environments -> more samples from different states of the environment -> better estimation of the expectations. Generally one should see stabler learning with more environments, but not necessarily more sample efficient. |
We have a tutorial about that ;) |
@AlessandroZavoli @Miffyli @araffin I will put the plots here after it is finished. |
This is the curves for training: gray (the lower curve): 2 for a higher number of envs, it is very time-consuming, but I see that the 16 envs case can reach the same reward value much faster (1h 2m compared to 27m). |
On my pc, i got a similar result, but 8core was the optimal choice I think that we should distinguish between sample-collecting time (that depends on the custom env) and training time (spent by SGD, as an example) |
Yes, using more environments (with same n_step -> more samples) is expected to result in stabler and/or faster learning, sometimes even in terms of env steps. I tried to look for a paper that had experiments on this very topic but can not find it for the life of me. Closest thing I have to share is the OpenAI Dota 2 paper, where in Figure 5 they compare different batch sizes. The training times should be minuscule and only have a real effect on training speed if you can reach thousands of FPS with your environment (e.g. basic control tasks). |
Hi,
I'm trying to use
SubProcVecEnv
to create a vectorized environment and use it in my own PPO implementation. I have a couple of questions about the performance of this vectorization and hyperparameters.I have a 28 cores CPU on my system and an RTX 2080Ti GPU. When I use
gym.vector.async_vector_env
to create vectorized envs, it is 3 to 6 times faster thanSubProcVecEnv
from stable_baselines3.In
SubProcVecEnv
, when I set the number of threads usingtorch.set_num_threads(28)
all the cores are involved but again it is almost two times slower than usingtorch.set_num_threads(10)
.I also did all the comparisons by creating 100 parallel envs. I'm not sure how would I set this number of envs and torch number of threads. I think this slower performance compared to
gym.vector.async_vector_env
is because of the bad hyperparameters I used.What are the parameters I can use to get the best performance? which parameters are the most important ones?
Thank you
The text was updated successfully, but these errors were encountered: