-
Notifications
You must be signed in to change notification settings - Fork 7k
[tune] Added pbt with keras on cifar10 dataset example #1729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test PASSed. |
| The model comes from: https://zhuanlan.zhihu.com/p/29214791, | ||
| and it gets to about 87% validation accuracy in 100 epochs. | ||
|
|
||
| Note that the scipt cannot init cuda in parallel, hence it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell me more about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I write the script and run it on a local machine (with 2 GPU). And I find that all trials report "failed call to cuInit: CUDA_ERROR_NO_DEVICE". You can run the script and reproduce the problem.
Note that tf.keras shares a global graph and session, which might result in conflict when training in parallel in local machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would work if you also set "resources": {"gpu": 1} in the train_spec. By default, if GPU resources are not requested then CUDA_VISIBLE_DEVICES will be set to the empty string which will disable GPU access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl So ray always reset CUDA_VISIBLE_DEVICES ? I'd like to take a try later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, CUDA_VISIBLE_DEVICES will be assigned to be consistent with ray.get_gpu_ids(). So if you request 1 gpu it gets set to some gpu id, 2 gpus a list of two ids, 0 means the empty string.
| from ray.tune.pbt import PopulationBasedTraining | ||
|
|
||
|
|
||
| config = tf.ConfigProto(log_device_placement=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be moved into the ifmain right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because tf.keras share a global setter, I'm not sure whether it is a good idea to place the config below main. I prefer to placing it in _setup method, however if ray don't allow assign one same GPU to different trials, I think the config is useless and can be removed.
|
@ericl Thanks, gpu_resource did work. However, it seems that ray-tune cannot assign the same GPU to different trials, although tensorflow can share one GPU with multiple processes by using 74/781 [=>............................] - ETA: 456s - loss: 2.1866 - acc: 0.1924Error starting runner, abort: Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ray/tune/trial_runner.py", line 203, in _launch_trial
trial.start()
File "/usr/lib/python2.7/site-packages/ray/tune/trial.py", line 133, in start
self._setup_runner()
File "/usr/lib/python2.7/site-packages/ray/tune/trial.py", line 363, in _setup_runner
logger_creator=logger_creator)
File "/usr/lib/python2.7/site-packages/ray/actor.py", line 728, in remote
resources, ray.worker.global_worker)
File "/usr/lib/python2.7/site-packages/ray/actor.py", line 361, in export_actor
resources.get("GPU", 0), worker.redis_client)
File "/usr/lib/python2.7/site-packages/ray/utils.py", line 301, in select_local_scheduler
"information is {}.".format(local_schedulers))
Exception: Could not find a node with enough GPUs or other resources to create this actor. The local scheduler information is [{'ClientType': 'local_scheduler', 'Deleted': False, 'LocalSchedulerSocketName': '/tmp/scheduler79148953', 'AuxAddress': '127.0.0.1:51998', u'GPU': 2.0, u'CPU': 32.0, 'DBClientID': 'f1fa40a0c96401c698245c83b5791ee7d85a8035'}]. |
|
Test PASSed. |
|
Ray doesn't currently support fractional GPU assignment: #402 One workaround is to make "fake" gpus, and manually set cuda visible devices. For example, you could do --num-gpus=4, and the |
|
@ericl @richardliaw Thanks for your help. I think it's not feasible to use ray with fractional GPU now, so I change GPUs resource requirements to 4. In fact because cifar10 is a small dataset, hence it's OK to run all trails in CPUs / GPUs mode. |
|
Test PASSed. |
|
Test PASSed. |
|
Test PASSed. |
ericl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm. Do you have any results that go with the example? It could be good to update the pbt docs to reference this
|
Filed #1777 to fix the lint error. |
* commit 'f69cbd35d4e86f2a3c2ace875aaf8166edb69f5d': (64 commits) Bump version to 0.4.0. (ray-project#1745) Fix monitor.py bottleneck by removing excess Redis queries. (ray-project#1786) Convert the ObjectTable implementation to a Log (ray-project#1779) Acquire worker lock when importing actor. (ray-project#1783) Introduce a log interface for the new GCS (ray-project#1771) [tune] Fix linting error (ray-project#1777) [tune] Added pbt with keras on cifar10 dataset example (ray-project#1729) Add a GCS table for the xray task flatbuffer (ray-project#1775) [tune] Change tune resource request syntax to be less confusing (ray-project#1764) Remove from X import Y convention in RLlib ES. (ray-project#1774) Check if the provider is external before getting the config. (ray-project#1743) Request and cancel notifications in the new GCS API (ray-project#1758) Fix resource bookkeeping for blocked actor methods. (ray-project#1766) Fix bug when connecting another driver in local case. (ray-project#1760) Define string prefixes for all tables in the new GCS API (ray-project#1755) [rllib] Update RLlib to work with new actor scheduling behavior (ray-project#1754) Redirect output of all processes by default. (ray-project#1752) Add API for getting total cluster resources. (ray-project#1736) Always send actor creation tasks to the global scheduler. (ray-project#1757) Print error when actor takes too long to start, and refactor error me… (ray-project#1747) ... # Conflicts: # python/ray/rllib/__init__.py # python/ray/rllib/dqn/dqn.py # python/ray/rllib/dqn/dqn_evaluator.py # python/ray/rllib/dqn/dqn_replay_evaluator.py # python/ray/rllib/optimizers/__init__.py # python/ray/rllib/tuned_examples/pong-dqn.yaml
Hi, I write a simple example to test PBT with keras on cifar10 dataset. Feel free to merge or close it.