Skip to content

Conversation

@facaiy
Copy link
Contributor

@facaiy facaiy commented Mar 17, 2018

Hi, I write a simple example to test PBT with keras on cifar10 dataset. Feel free to merge or close it.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4369/
Test PASSed.

The model comes from: https://zhuanlan.zhihu.com/p/29214791,
and it gets to about 87% validation accuracy in 100 epochs.

Note that the scipt cannot init cuda in parallel, hence it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me more about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I write the script and run it on a local machine (with 2 GPU). And I find that all trials report "failed call to cuInit: CUDA_ERROR_NO_DEVICE". You can run the script and reproduce the problem.

Note that tf.keras shares a global graph and session, which might result in conflict when training in parallel in local machine.

Copy link
Contributor

@ericl ericl Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would work if you also set "resources": {"gpu": 1} in the train_spec. By default, if GPU resources are not requested then CUDA_VISIBLE_DEVICES will be set to the empty string which will disable GPU access.

Copy link
Contributor Author

@facaiy facaiy Mar 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl So ray always reset CUDA_VISIBLE_DEVICES ? I'd like to take a try later.

Copy link
Contributor

@ericl ericl Mar 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, CUDA_VISIBLE_DEVICES will be assigned to be consistent with ray.get_gpu_ids(). So if you request 1 gpu it gets set to some gpu id, 2 gpus a list of two ids, 0 means the empty string.

from ray.tune.pbt import PopulationBasedTraining


config = tf.ConfigProto(log_device_placement=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be moved into the ifmain right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because tf.keras share a global setter, I'm not sure whether it is a good idea to place the config below main. I prefer to placing it in _setup method, however if ray don't allow assign one same GPU to different trials, I think the config is useless and can be removed.

@facaiy
Copy link
Contributor Author

facaiy commented Mar 19, 2018

@ericl Thanks, gpu_resource did work. However, it seems that ray-tune cannot assign the same GPU to different trials, although tensorflow can share one GPU with multiple processes by using config.gpu_options.per_process_gpu_memory_fraction = 0.25, right?

 74/781 [=>............................] - ETA: 456s - loss: 2.1866 - acc: 0.1924Error starting runner, abort: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ray/tune/trial_runner.py", line 203, in _launch_trial
    trial.start()
  File "/usr/lib/python2.7/site-packages/ray/tune/trial.py", line 133, in start
    self._setup_runner()
  File "/usr/lib/python2.7/site-packages/ray/tune/trial.py", line 363, in _setup_runner
    logger_creator=logger_creator)
  File "/usr/lib/python2.7/site-packages/ray/actor.py", line 728, in remote
    resources, ray.worker.global_worker)
  File "/usr/lib/python2.7/site-packages/ray/actor.py", line 361, in export_actor
    resources.get("GPU", 0), worker.redis_client)
  File "/usr/lib/python2.7/site-packages/ray/utils.py", line 301, in select_local_scheduler
    "information is {}.".format(local_schedulers))
Exception: Could not find a node with enough GPUs or other resources to create this actor. The local scheduler information is [{'ClientType': 'local_scheduler', 'Deleted': False, 'LocalSchedulerSocketName': '/tmp/scheduler79148953', 'AuxAddress': '127.0.0.1:51998', u'GPU': 2.0, u'CPU': 32.0, 'DBClientID': 'f1fa40a0c96401c698245c83b5791ee7d85a8035'}].

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4406/
Test PASSed.

@ericl
Copy link
Contributor

ericl commented Mar 19, 2018

Ray doesn't currently support fractional GPU assignment: #402

One workaround is to make "fake" gpus, and manually set cuda visible devices. For example, you could do --num-gpus=4, and the
os.environ["CUDA_VISIBLE_DEVICES"] = str(ray.get_gpu_ids()[0] % 4)

@facaiy
Copy link
Contributor Author

facaiy commented Mar 19, 2018

@ericl @richardliaw Thanks for your help. I think it's not feasible to use ray with fractional GPU now, so I change GPUs resource requirements to 4. In fact because cifar10 is a small dataset, hence it's OK to run all trails in CPUs / GPUs mode.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4413/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4414/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4416/
Test PASSed.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. Do you have any results that go with the example? It could be good to update the pbt docs to reference this

@ericl ericl merged commit 6b1e592 into ray-project:master Mar 25, 2018
@ericl
Copy link
Contributor

ericl commented Mar 26, 2018

Filed #1777 to fix the lint error.

@facaiy facaiy deleted the DOC/example_pbt_with_keras branch March 28, 2018 01:51
royf added a commit to royf/ray that referenced this pull request Apr 22, 2018
* commit 'f69cbd35d4e86f2a3c2ace875aaf8166edb69f5d': (64 commits)
  Bump version to 0.4.0. (ray-project#1745)
  Fix monitor.py bottleneck by removing excess Redis queries. (ray-project#1786)
  Convert the ObjectTable implementation to a Log (ray-project#1779)
  Acquire worker lock when importing actor. (ray-project#1783)
  Introduce a log interface for the new GCS (ray-project#1771)
  [tune] Fix linting error (ray-project#1777)
  [tune] Added pbt with keras on cifar10 dataset example (ray-project#1729)
  Add a GCS table for the xray task flatbuffer (ray-project#1775)
  [tune] Change tune resource request syntax to be less confusing (ray-project#1764)
  Remove from X import Y convention in RLlib ES. (ray-project#1774)
  Check if the provider is external before getting the config. (ray-project#1743)
  Request and cancel notifications in the new GCS API (ray-project#1758)
  Fix resource bookkeeping for blocked actor methods. (ray-project#1766)
  Fix bug when connecting another driver in local case. (ray-project#1760)
  Define string prefixes for all tables in the new GCS API (ray-project#1755)
  [rllib] Update RLlib to work with new actor scheduling behavior (ray-project#1754)
  Redirect output of all processes by default. (ray-project#1752)
  Add API for getting total cluster resources. (ray-project#1736)
  Always send actor creation tasks to the global scheduler. (ray-project#1757)
  Print error when actor takes too long to start, and refactor error me… (ray-project#1747)
  ...

# Conflicts:
#	python/ray/rllib/__init__.py
#	python/ray/rllib/dqn/dqn.py
#	python/ray/rllib/dqn/dqn_evaluator.py
#	python/ray/rllib/dqn/dqn_replay_evaluator.py
#	python/ray/rllib/optimizers/__init__.py
#	python/ray/rllib/tuned_examples/pong-dqn.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants