-
Notifications
You must be signed in to change notification settings - Fork 99
Description
Using latest develop-upstream branch and latest benchmarks master.
Running the tf_cnn_benchmarks.py code like so:
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpuEventually produces during warmup the following message
terminate called after throwing an instance of 'Kalmar::runtime_exception'
what(): HCC unpinned copy engine error
Aborted (core dumped)
If you set --local_parameter_device=gpu instead, the problem doesn't manifest.
However, the problem happens again even with --local_parameter_device=gpu during distributed training. Running 1 worker and 1 server like so:
# worker
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --worker_hosts=prj47-rack-02:50001 --job_name=worker --task_index=0 --server_protocol=grpc
# ps
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --wo^Cer_hosts=prj47-rack-02:50001 --job_name=ps --task_index=0 --server_protocol=grpc
At least with the distributed training, my guess is that tensors are moving from GPU to CPU prior to being packed into protobufs and shipped via grpc. Not sure why this is also happening during warm-up except that I specified the parameter device to be CPU, forcing a device to host copy for storing the params.
misc system info
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
lscpu
AMD EPYC 7551 32-Core Processor
uname -a
Linux prj47-rack-02 4.13.0-43-generic #48~16.04.1-Ubuntu SMP Thu May 17 12:56:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
LD_LIBRARY_PATH /home/jdaily/openmpi-3.1.0-install/lib
DYLD_LIBRARY_PATH is unset
rocm-clang-ocl/Ubuntu 16.04,now 0.3.0-c1b678e amd64 [installed,automatic]
rocm-dev/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-device-libs/Ubuntu 16.04,now 0.0.1 amd64 [installed]
rocm-dkms/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-libs/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-opencl/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed]
rocm-opencl-dev/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed]
rocm-profiler/Ubuntu 16.04,now 5.4.6797 amd64 [installed]
rocm-smi/Ubuntu 16.04,now 1.0.0-42-g0ae1c36 amd64 [installed,automatic]
rocm-utils/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocminfo/now 1.0.7 amd64 [installed,local]