[tune] Added pbt with keras on cifar10 dataset example #1729

facaiy · 2018-03-17T07:06:35Z

Hi, I write a simple example to test PBT with keras on cifar10 dataset. Feel free to merge or close it.

AmplabJenkins · 2018-03-17T08:01:52Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4369/
Test PASSed.

richardliaw · 2018-03-17T19:45:21Z

python/ray/tune/examples/pbt_tune_cifar10_with_keras.py

+The model comes from: https://zhuanlan.zhihu.com/p/29214791,
+and it gets to about 87% validation accuracy in 100 epochs.
+
+Note that the scipt cannot init cuda in parallel, hence it


Can you tell me more about this?

Yes. I write the script and run it on a local machine (with 2 GPU). And I find that all trials report "failed call to cuInit: CUDA_ERROR_NO_DEVICE". You can run the script and reproduce the problem.

Note that tf.keras shares a global graph and session, which might result in conflict when training in parallel in local machine.

I think this would work if you also set "resources": {"gpu": 1} in the train_spec. By default, if GPU resources are not requested then CUDA_VISIBLE_DEVICES will be set to the empty string which will disable GPU access.

@ericl So ray always reset CUDA_VISIBLE_DEVICES ? I'd like to take a try later.

Yeah, CUDA_VISIBLE_DEVICES will be assigned to be consistent with ray.get_gpu_ids(). So if you request 1 gpu it gets set to some gpu id, 2 gpus a list of two ids, 0 means the empty string.

richardliaw · 2018-03-18T09:32:31Z

python/ray/tune/examples/pbt_tune_cifar10_with_keras.py

+from ray.tune.pbt import PopulationBasedTraining
+
+
+config = tf.ConfigProto(log_device_placement=True)


this can be moved into the ifmain right?

Because tf.keras share a global setter, I'm not sure whether it is a good idea to place the config below main. I prefer to placing it in _setup method, however if ray don't allow assign one same GPU to different trials, I think the config is useless and can be removed.

facaiy · 2018-03-19T14:12:39Z

@ericl Thanks, gpu_resource did work. However, it seems that ray-tune cannot assign the same GPU to different trials, although tensorflow can share one GPU with multiple processes by using config.gpu_options.per_process_gpu_memory_fraction = 0.25, right?

 74/781 [=>............................] - ETA: 456s - loss: 2.1866 - acc: 0.1924Error starting runner, abort: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ray/tune/trial_runner.py", line 203, in _launch_trial
    trial.start()
  File "/usr/lib/python2.7/site-packages/ray/tune/trial.py", line 133, in start
    self._setup_runner()
  File "/usr/lib/python2.7/site-packages/ray/tune/trial.py", line 363, in _setup_runner
    logger_creator=logger_creator)
  File "/usr/lib/python2.7/site-packages/ray/actor.py", line 728, in remote
    resources, ray.worker.global_worker)
  File "/usr/lib/python2.7/site-packages/ray/actor.py", line 361, in export_actor
    resources.get("GPU", 0), worker.redis_client)
  File "/usr/lib/python2.7/site-packages/ray/utils.py", line 301, in select_local_scheduler
    "information is {}.".format(local_schedulers))
Exception: Could not find a node with enough GPUs or other resources to create this actor. The local scheduler information is [{'ClientType': 'local_scheduler', 'Deleted': False, 'LocalSchedulerSocketName': '/tmp/scheduler79148953', 'AuxAddress': '127.0.0.1:51998', u'GPU': 2.0, u'CPU': 32.0, 'DBClientID': 'f1fa40a0c96401c698245c83b5791ee7d85a8035'}].

AmplabJenkins · 2018-03-19T14:57:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4406/
Test PASSed.

ericl · 2018-03-19T18:45:30Z

Ray doesn't currently support fractional GPU assignment: #402

One workaround is to make "fake" gpus, and manually set cuda visible devices. For example, you could do --num-gpus=4, and the
os.environ["CUDA_VISIBLE_DEVICES"] = str(ray.get_gpu_ids()[0] % 4)

facaiy · 2018-03-19T23:16:32Z

@ericl @richardliaw Thanks for your help. I think it's not feasible to use ray with fractional GPU now, so I change GPUs resource requirements to 4. In fact because cifar10 is a small dataset, hence it's OK to run all trails in CPUs / GPUs mode.

AmplabJenkins · 2018-03-20T00:01:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4413/
Test PASSed.

AmplabJenkins · 2018-03-20T00:07:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4414/
Test PASSed.

AmplabJenkins · 2018-03-20T00:17:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4416/
Test PASSed.

ericl

Lgtm. Do you have any results that go with the example? It could be good to update the pbt docs to reference this

ericl · 2018-03-26T00:48:24Z

Filed #1777 to fix the lint error.

* commit 'f69cbd35d4e86f2a3c2ace875aaf8166edb69f5d': (64 commits) Bump version to 0.4.0. (ray-project#1745) Fix monitor.py bottleneck by removing excess Redis queries. (ray-project#1786) Convert the ObjectTable implementation to a Log (ray-project#1779) Acquire worker lock when importing actor. (ray-project#1783) Introduce a log interface for the new GCS (ray-project#1771) [tune] Fix linting error (ray-project#1777) [tune] Added pbt with keras on cifar10 dataset example (ray-project#1729) Add a GCS table for the xray task flatbuffer (ray-project#1775) [tune] Change tune resource request syntax to be less confusing (ray-project#1764) Remove from X import Y convention in RLlib ES. (ray-project#1774) Check if the provider is external before getting the config. (ray-project#1743) Request and cancel notifications in the new GCS API (ray-project#1758) Fix resource bookkeeping for blocked actor methods. (ray-project#1766) Fix bug when connecting another driver in local case. (ray-project#1760) Define string prefixes for all tables in the new GCS API (ray-project#1755) [rllib] Update RLlib to work with new actor scheduling behavior (ray-project#1754) Redirect output of all processes by default. (ray-project#1752) Add API for getting total cluster resources. (ray-project#1736) Always send actor creation tasks to the global scheduler. (ray-project#1757) Print error when actor takes too long to start, and refactor error me… (ray-project#1747) ... # Conflicts: # python/ray/rllib/__init__.py # python/ray/rllib/dqn/dqn.py # python/ray/rllib/dqn/dqn_evaluator.py # python/ray/rllib/dqn/dqn_replay_evaluator.py # python/ray/rllib/optimizers/__init__.py # python/ray/rllib/tuned_examples/pong-dqn.yaml

[tune] Added pbt with keras on cifar10 dataset example

76d096f

richardliaw reviewed Mar 17, 2018

View reviewed changes

richardliaw reviewed Mar 18, 2018

View reviewed changes

ENH: add gpu resources

e4ec2df

facaiy added 2 commits March 20, 2018 07:08

CLN: requires 4 GPUs resource

ed45a24

CLN: use single quotes

9e81945

CLN: don't save model by default

23e5d5a

ericl approved these changes Mar 25, 2018

View reviewed changes

ericl merged commit 6b1e592 into ray-project:master Mar 25, 2018

facaiy deleted the DOC/example_pbt_with_keras branch March 28, 2018 01:51

		from ray.tune.pbt import PopulationBasedTraining


		config = tf.ConfigProto(log_device_placement=True)

[tune] Added pbt with keras on cifar10 dataset example #1729

[tune] Added pbt with keras on cifar10 dataset example #1729

Uh oh!

Conversation

facaiy commented Mar 17, 2018

Uh oh!

AmplabJenkins commented Mar 17, 2018

Uh oh!

richardliaw Mar 17, 2018

Choose a reason for hiding this comment

Uh oh!

facaiy Mar 17, 2018

Choose a reason for hiding this comment

Uh oh!

ericl Mar 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facaiy Mar 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl Mar 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richardliaw Mar 18, 2018

Choose a reason for hiding this comment

Uh oh!

facaiy Mar 19, 2018

Choose a reason for hiding this comment

Uh oh!

facaiy commented Mar 19, 2018

Uh oh!

AmplabJenkins commented Mar 19, 2018

Uh oh!

ericl commented Mar 19, 2018

Uh oh!

facaiy commented Mar 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Mar 20, 2018

Uh oh!

AmplabJenkins commented Mar 20, 2018

Uh oh!

AmplabJenkins commented Mar 20, 2018

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

ericl commented Mar 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ericl Mar 17, 2018 •

edited

Loading

facaiy Mar 18, 2018 •

edited

Loading

ericl Mar 18, 2018 •

edited

Loading

facaiy commented Mar 19, 2018 •

edited

Loading