Skip to content

errors in pin-in-place path in HCC unpinned copy engine  #27

@jeffdaily

Description

@jeffdaily

Using latest develop-upstream branch and latest benchmarks master.
Running the tf_cnn_benchmarks.py code like so:

python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu

Eventually produces during warmup the following message

terminate called after throwing an instance of 'Kalmar::runtime_exception'
  what():  HCC unpinned copy engine error
Aborted (core dumped)

If you set --local_parameter_device=gpu instead, the problem doesn't manifest.

However, the problem happens again even with --local_parameter_device=gpu during distributed training. Running 1 worker and 1 server like so:

# worker
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --worker_hosts=prj47-rack-02:50001 --job_name=worker --task_index=0 --server_protocol=grpc
# ps
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --wo^Cer_hosts=prj47-rack-02:50001 --job_name=ps --task_index=0 --server_protocol=grpc

At least with the distributed training, my guess is that tensors are moving from GPU to CPU prior to being packed into protobufs and shipped via grpc. Not sure why this is also happening during warm-up except that I specified the parameter device to be CPU, forcing a device to host copy for storing the params.

misc system info

c++ (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609

lscpu

AMD EPYC 7551 32-Core Processor

uname -a
Linux prj47-rack-02 4.13.0-43-generic #48~16.04.1-Ubuntu SMP Thu May 17 12:56:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

LD_LIBRARY_PATH /home/jdaily/openmpi-3.1.0-install/lib
DYLD_LIBRARY_PATH is unset

rocm-clang-ocl/Ubuntu 16.04,now 0.3.0-c1b678e amd64 [installed,automatic]
rocm-dev/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-device-libs/Ubuntu 16.04,now 0.0.1 amd64 [installed]
rocm-dkms/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-libs/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-opencl/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed]
rocm-opencl-dev/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed]
rocm-profiler/Ubuntu 16.04,now 5.4.6797 amd64 [installed]
rocm-smi/Ubuntu 16.04,now 1.0.0-42-g0ae1c36 amd64 [installed,automatic]
rocm-utils/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocminfo/now 1.0.7 amd64 [installed,local]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions