-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ray[RLlib]: Windows fatal exception: access violation #24955
Comments
Let me look into this. |
The error can't be reproduced on your machine then. Laptop Dell G3 15: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz 2.59 GHz (12 Core) Os: |
I see. Note that I don't have any GPU on my Azure Windows VM. Its Windows 10 Pro 20 H2. |
So there might be a serious problem once the cluster gets updated? |
A random input, the error message seems to say: (DQNTrainer pid=7004) ModuleNotFoundError: No module named 'pyglet' |
Yeah but that wasn't the reason for the error. Here the updated console output after installing the package.
|
I could not reproduce this with latest ray HEAD. I did need to remove the |
With the nightly version all 3 parallel tune runs start. The access violation does not occure but another unspecific error does. Actor died unexpected. console output
Error file of each worker are the same:
Edit: (mattip) put the error log into a |
@Peter-P779 did you change anything in the script or install instructions? Which nightly did you use? |
I didn't change anything in the script except deleting pip uninstall -y ray The version is the windows python 3.9 nightly: |
Hello all, I can reproduce the crash on my Windows Desktop on both the current nightly and pypi release. I stumbled over this Issue investigating an unexpected crash using only Ray Core, which exclusively occurs on my Home Desktop. |
TL;DR: I could not reproduce. If someone can still reproduce this, please report what you did using the comment below as a template, starting from a vanilla python installation. And in too much detail: Here is the script I used Modified script
Here is what I did
I then get a number of diagonostic messages on startup with hints to improve the script Startup messages``` 2022-11-15 15:32:12,844 INFO worker.py:1519 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 d:\temp\issue24955\lib\site-packages\ray\tune\tune.py:523: UserWarning: Consider boosting PBT performance by enabling `reuse_actors` as well as implementing `reset_config` for Trainable. warnings.warn( 2022-11-15 15:32:14,520 WARNING trial_runner.py:1604 -- You are trying to access _search_alg interface of TrialRunner in TrialScheduler, which is being restricted. If you believe it is reasonable for your scheduler to access this TrialRunner API, please reach out to Ray team on GitHub. A more strict API access pattern would be enforced starting 1.12s.0 (DQN pid=10472) 2022-11-15 15:32:19,748 INFO algorithm.py:2303 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode. (DQN pid=10472) 2022-11-15 15:32:19,748 INFO simple_q.py:307 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you. (DQN pid=10472) 2022-11-15 15:32:19,748 INFO algorithm.py:457 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags. (RolloutWorker pid=1228) d:\temp\issue24955\lib\site-packages\gym\envs\registration.py:505: UserWarning: WARN: The environment CartPole-v0 is out of date. You should consider upgrading to version `v1` with the environment ID `CartPole-v1`. (RolloutWorker pid=1228) logger.warn( (RolloutWorker pid=10248) d:\temp\issue24955\lib\site-packages\gym\envs\registration.py:505: UserWarning: WARN: The environment CartPole-v0 is out of date. You should consider upgrading to version `v1` with the environment ID `CartPole-v1`.(RolloutWorker pid=10248) logger.warn( (RolloutWorker pid=6740) d:\temp\issue24955\lib\site-packages\gym\envs\registration.py:505: UserWarning: WARN: The environment CartPole-v0 is out of date. You should consider upgrading to version `v1` with the environment ID `CartPole-v1`. (RolloutWorker pid=6740) logger.warn( (RolloutWorker pid=1228) 2022-11-15 15:32:24,547 WARNING env.py:159 -- Your env reset() method appears to take 'seed' or 'return_info' arguments. Note that these are not yet supported in RLlib. Seeding will take place using 'env.seed()' and the info dict will not be returned from reset. == Status == ```The script runs, and I can see the resource usage on the dashboard. There are 8 RolloutWorker actors and 2 DQN actors. The processes seem to take up to 14.3GB of RAM. The script runs for much more than 90 seconds: I stopped it after ~10 minutes by pressing CTRL-C, and it stopped cleanly:
|
Hey, sorry for the somewhat unspecific response. It has been a while, but I remember after cross-examination of my working and non-working systems that the issue only occured with a specific Python 3.9 patch version. Switching to a previous patch resolved my problems completely. |
Perhaps your machine has 16GB of RAM which is enough on linux but not sufficient on windows to run this experiment. |
Closing this as we seem to lack a reproduction/may be related to python versioning. |
What happened + What you expected to happen
Expectation: Training CartPole
What Happens: WINDOWS FATAL EXECTION ACCESS VIOLATION
Versions / Dependencies
ray, version 1.12.0
Python 3.9.12
gym 0.21.0
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: