Multi-GPUs in tensorflow #3

hixiaye · 2019-05-14T13:40:18Z

HI @dstamoulis Thanks for your code!
I have modified TPU setting into GPU, like tf.estimator.Estimator, tf.estimator.RunConfig, and single GPU works.
However, when I apply "MirroredStrategy" into tf.estimator.RunConfig for multi-gpus, it can not work.
The Error is:
I0514 20:11:40.999713 139768726693632 tf_logging.py:115] Error reported to Coordinator:
Traceback (most recent call last):
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 783, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1168, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/data/project/tensorflow/FACE/SinglePath_NAS/single-path-nas-master_multi_gpus/nas-search/search_main.py", line 361, in nas_model_fn
train_op = ema.apply(ema_vars)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/training/moving_averages.py", line 431, in apply
self._averages[var], var, decay, zero_debias=zero_debias))
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/training/moving_averages.py", line 84, in assign_moving_average
with ops.colocate_with(variable):
File "/usr/local/miniconda3/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4092, in _colocate_with_for_gradient
with self.colocate_with(op, ignore_existing):
File "/usr/local/miniconda3/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4144, in colocate_with
op = internal_convert_to_tensor_or_indexed_slices(op, as_ref=True).op
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1305, in internal_convert_to_tensor_or_indexed_slices
value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1144, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/miniconda3/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/values.py", line 447, in _tensor_conversion_mirrored
assert not as_ref
AssertionError

Any help would be appreciated, thank you!

fabbrimatteo · 2019-05-17T19:22:06Z

Hi @sxs11,
I also want to make the Multi-GPU work. Can you share your code? Maybe we can help each other.

hixiaye · 2019-05-18T06:09:53Z

Hi @sxs11,
I also want to make the Multi-GPU work. Can you share your code? Maybe we can help each other.

tf.contrib.tpu.TPUEstimatorSpec() -> tf.estimator.EstimatorSpec()
tf.contrib.tpu.RunConfig() -> tf.estimator.RunConfig()
tf.contrib.tpu.TPUEstimator() -> tf.estimator.Estimator()
other points:
I delete the flags: 'use_tpu', 'tpu', 'gcp_project','tpu_zone' and set 'data_dir' default=None. (I just use the fake data for debug)

I use MirroredStrategy() for multi-gpus:
NUM_GPUS = 2
distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
gpu_options = tf.GPUOptions(allow_growth=True)
session_config = tf.ConfigProto(gpu_options=gpu_options)

distribution and session_config are arguments of tf.estimator.RunConfig()

fabbrimatteo · 2019-05-30T01:54:15Z

I solved by removing the moving_average_decay: default=0.

It seems that moving_average_decay is not compatible with Multi-GPU training

iamweiweishi · 2019-07-23T02:10:41Z

@sxs11 @fabbrimatteo Hi, I replaced 'tf.contrib.tpu.TPUEstimatorSpec' with 'tf.estimator.EstimatorSpec', but I found that the latter one does not have the parameter 'host_call', how to handle the problem? Many thanks.

QueeneTam · 2019-09-26T11:53:50Z

I solved by removing the moving_average_decay: default=0.

It seems that moving_average_decay is not compatible with Multi-GPU training

Hello, I encounter this problem when I want to reproduce this work. Can you share your code? It would be very appreciated! queene_tam@163.com is my email. Thanks a lot!

QueeneTam · 2019-10-08T10:17:45Z

Hi @sxs11,
I also want to make the Multi-GPU work. Can you share your code? Maybe we can help each other.

tf.contrib.tpu.TPUEstimatorSpec() -> tf.estimator.EstimatorSpec()
tf.contrib.tpu.RunConfig() -> tf.estimator.RunConfig()
tf.contrib.tpu.TPUEstimator() -> tf.estimator.Estimator()
other points:
I delete the flags: 'use_tpu', 'tpu', 'gcp_project','tpu_zone' and set 'data_dir' default=None. (I just use the fake data for debug)

I use MirroredStrategy() for multi-gpus:
NUM_GPUS = 2
distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
gpu_options = tf.GPUOptions(allow_growth=True)
session_config = tf.ConfigProto(gpu_options=gpu_options)

distribution and session_config are arguments of tf.estimator.RunConfig()

Hello, I encounter this problem when I want to reproduce this work. Can you share your code? It would be very appreciated! queene_tam@163.com is my email. Thanks a lot!

marsggbo · 2019-10-20T10:27:43Z

@sxs11 @fabbrimatteo Hi, I replaced 'tf.contrib.tpu.TPUEstimatorSpec' with 'tf.estimator.EstimatorSpec', but I found that the latter one does not have the parameter 'host_call', how to handle the problem? Many thanks.

Hello, I find a way to solve this problem. By reading the source code of TPUEstimatorSpec, I find it has a function as_estimator_spec, so you can only make the following modification, then it will work for GPUs:

def model_fn():
    ...
    spec = TPUEstimatorSpec(
                ...
               host_call=host_call
               ...
        )
    return spec.as_estimator_spec

fabbrimatteo mentioned this issue May 30, 2019

GPU support? #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPUs in tensorflow #3

Multi-GPUs in tensorflow #3

hixiaye commented May 14, 2019

fabbrimatteo commented May 17, 2019

hixiaye commented May 18, 2019

fabbrimatteo commented May 30, 2019

iamweiweishi commented Jul 23, 2019

QueeneTam commented Sep 26, 2019

QueeneTam commented Oct 8, 2019

marsggbo commented Oct 20, 2019

Multi-GPUs in tensorflow #3

Multi-GPUs in tensorflow #3

Comments

hixiaye commented May 14, 2019

fabbrimatteo commented May 17, 2019

hixiaye commented May 18, 2019

fabbrimatteo commented May 30, 2019

iamweiweishi commented Jul 23, 2019

QueeneTam commented Sep 26, 2019

QueeneTam commented Oct 8, 2019

marsggbo commented Oct 20, 2019