Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/opt/rocm-3.1.0/hip/../include/hip/hcc_detail/hip_runtime_api.h:48:10: fatal error: 'hsa/hsa.h' file not found #927

Closed
sumannelli-Ib opened this issue Apr 10, 2020 · 10 comments
Assignees

Comments

@sumannelli-Ib
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04

  • TensorFlow version:1.15.3

  • Python version:3.6.9

  • Installed using virtualenv? pip? conda?:pip

  • GCC/Compiler version (if compiling from source): 4:7.4.0-1ubuntu2.3

  • ROCm/MIOpen version:3.1

  • GPU model and memory:AMD radeon vega vii 16GB

When I am training the model for a simple MNIST dataset its working fine. But when I am using tensorflow object detection API training with tensorflow-rocm==1.15.3 it's not working. But tensorflow-rocm=1.15.0 is working with 2.10 but this is very slower than CPU.

Getting the below logs.

warning: :0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
warning: :0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
INFO:tensorflow:global_step/sec: 0
I0409 23:03:14.482075 139763101185792 supervisor.py:1099] global_step/sec: 0
INFO:tensorflow:Recording summary at step 412.
I0409 23:03:24.583948 139763092793088 supervisor.py:1050] Recording summary at step 412.
In file included from gridwise_convolution_implicit_gemm_v4_nchw_kcyx_nkhw_lds_double_buffer.cpp:1:
In file included from ./common_header.hpp:4:
In file included from ./config.hpp:4:
In file included from /opt/rocm-3.1.0/hip/../include/hip/hip_runtime.h:56:
In file included from /opt/rocm-3.1.0/hip/../include/hip/hcc_detail/hip_runtime.h:57:
In file included from /opt/rocm-3.1.0/hip/../include/hip/hip_runtime_api.h:342:
/opt/rocm-3.1.0/hip/../include/hip/hcc_detail/hip_runtime_api.h:48:10: fatal error: 'hsa/hsa.h' file not found
#include <hsa/hsa.h>
^~~~~~~~~~~
1 error generated.
In file included from gridwise_convolution_implicit_gemm_v4_nchw_kcyx_nkhw_lds_double_buffer.cpp:1:
In file included from ./common_header.hpp:4:
In file included from ./config.hpp:4:
In file included from /opt/rocm-3.1.0/hip/../include/hip/hip_runtime.h:56:
In file included from /opt/rocm-3.1.0/hip/../include/hip/hcc_detail/hip_runtime.h:57:
In file included from /opt/rocm-3.1.0/hip/../include/hip/hip_runtime_api.h:342:
/opt/rocm-3.1.0/hip/../include/hip/hcc_detail/hip_runtime_api.h:48:10: fatal error: 'hsa/hsa.h' file not found
#include <hsa/hsa.h>
^~~~~~~~~~~
1 error generated.
MIOpen Error: /root/driver/MLOpen/src/tmp_dir.cpp:47: Can't execute cd /tmp/miopen-gridwise_convolution_implicit_gemm_v4_nchw_kcyx_nkhw_lds_double_buffer.cpp-5b3a-efef-e8ae-4c89; KMOPTLLC="-mattr=+enable-ds128 -amdgpu-enable-global-sgpr-addr --amdgpu-spill-vgpr-to-agpr=0" /opt/rocm-3.1.0/hcc/bin/hcc -DCK_PARAM_PROBLEM_K=64 -DCK_PARAM_PROBLEM_C=256 -DCK_PARAM_PROBLEM_HI=2 -DCK_PARAM_PROBLEM_WI=2 -DCK_PARAM_PROBLEM_HO=2 -DCK_PARAM_PROBLEM_WO=2 -std=c++14 -DCK_PARAM_PROBLEM_CONV_DIRECTION_FORWARD=0 -DCK_PARAM_PROBLEM_CONV_DIRECTION_BACKWARD_DATA=0 -DCK_PARAM_PROBLEM_CONV_DIRECTION_BACKWARD_WEIGHT=1 -DCK_PARAM_PROBLEM_N=10 -DCK_PARAM_PROBLEM_Y=1 -DCK_PARAM_PROBLEM_X=1 -DCK_PARAM_PROBLEM_CONV_STRIDE_H=1 -DCK_PARAM_PROBLEM_CONV_STRIDE_W=1 -DCK_PARAM_PROBLEM_CONV_DILATION_H=1 -DCK_PARAM_PROBLEM_CONV_DILATION_W=1 -DCK_PARAM_TUNABLE_BLOCK_SIZE=64 -DCK_PARAM_TUNABLE_B_PER_BLOCK=16 -DCK_PARAM_TUNABLE_K_PER_BLOCK=32 -DCK_PARAM_TUNABLE_E_PER_BLOCK=4 -DCK_PARAM_DEPENDENT_GRID_SIZE=4 -DCK_PARAM_GEMM_N_REPEAT=2 -DCK_PARAM_GEMM_M_PER_THREAD_SUB_C=4 -DCK_PARAM_GEMM_N_PER_THREAD_SUB_C=4 -DCK_PARAM_GEMM_M_LEVEL0_CLUSTER=1 -DCK_PARAM_GEMM_N_LEVEL0_CLUSTER=4 -DCK_PARAM_GEMM_M_LEVEL1_CLUSTER=4 -DCK_PARAM_GEMM_N_LEVEL1_CLUSTER=4 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_E=4 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_N1=1 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_B=16 -DCK_PARAM_IN_BLOCK_COPY_CLUSTER_LENGTHS_N2=1 -DCK_PARAM_IN_BLOCK_COPY_SRC_DATA_PER_READ_B=1 -DCK_PARAM_IN_BLOCK_COPY_DST_DATA_PER_WRITE_N2=4 -DCK_PARAM_WEI_BLOCK_COPY_CLUSTER_LENGTHS_E=4 -DCK_PARAM_WEI_BLOCK_COPY_CLUSTER_LENGTHS_K=16 -DCK_PARAM_WEI_BLOCK_COPY_SRC_DATA_PER_READ_E=1 -DCK_PARAM_WEI_BLOCK_COPY_DST_DATA_PER_WRITE_K=2 -DCK_PARAM_EPACK_LENGTH=1 -DCK_THREADWISE_GEMM_USE_AMD_INLINE_ASM=1 -DCK_USE_AMD_INLINE_ASM=1 -D__HIP_PLATFORM_HCC__=1 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -mcpu=gfx906 -Wno-everything -amdgpu-target=gfx906 -Wno-unused-command-line-argument -I. -isystem /opt/rocm-3.1.0/hip/../include -isystem /opt/rocm-3.1.0/hip/include -isystem /opt/rocm/include -hc -Wl,--enable-new-dtags -hc -L /opt/rocm/lib -Wl,-rpath /opt/rocm/lib -Wl,--whole-archive -hc -fPIC -std=c++14 -isystem /opt/rocm/include -isystem /opt/rocm/include -ldl -hc -fPIC -std=c++14 -isystem /opt/rocm/include -isystem /opt/rocm/include -Wl,--no-whole-archive -ldl -lm -hc -fPIC -std=c++14 -isystem /opt/rocm/include -isystem /opt/rocm/include gridwise_convolution_implicit_gemm_v4_nchw_kcyx_nkhw_lds_double_buffer.cpp -o /tmp/miopen-gridwise_convolution_implicit_gemm_v4_nchw_kcyx_nkhw_lds_double_buffer.cpp-5b3a-efef-e8ae-4c89/gridwise_convolution_implicit_gemm_v4_nchw_kcyx_nkhw_lds_double_buffer.cpp.o
2020-04-09 23:04:00.090867: F tensorflow/stream_executor/rocm/rocm_dnn.cc:2778] Check failed: status == miopenStatusSuccess (7 vs. 0)Unable to find a suitable algorithm for doing backward filter convolution
Fatal Python error: Aborted

Thread 0x00007f1904baa700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 295 in wait
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 551 in wait
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 311 in wait_for_stop
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 293 in _close_on_stop
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 864 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f19053ab700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287 in _single_operation_run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257 in _run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 864 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1ce4ff9700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 295 in wait
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 551 in wait
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 311 in wait_for_stop
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 293 in _close_on_stop
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 864 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1ce57fa700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287 in _single_operation_run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257 in _run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 864 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1ce5ffb700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287 in _single_operation_run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257 in _run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 864 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1ce67fc700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287 in _single_operation_run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257 in _run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 864 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1ce6ffd700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287 in _single_operation_run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257 in _run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 864 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1ce77fe700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287 in _single_operation_run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257 in _run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 864 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1ce7fff700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287 in _single_operation_run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257 in _run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 864 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1d22ffd700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287 in _single_operation_run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257 in _run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 864 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1d237fe700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287 in _single_operation_run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257 in _run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 864 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1d227fc700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 299 in wait
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 551 in wait
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 311 in wait_for_stop
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 493 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1d21ffb700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 299 in wait
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 551 in wait
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 311 in wait_for_stop
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 493 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1d217fa700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350 in _run_fn
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359 in _do_run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180 in _run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/supervisor.py", line 1045 in run_loop
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 495 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1ebccbe700 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 295 in wait
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/queue.py", line 164 in get
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f1f2ccb8740 (most recent call first):
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350 in _run_fn
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359 in _do_run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180 in _run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 490 in train_step
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 775 in train
File "/home/suman/tensorflow/models/research/object_detection/legacy/trainer.py", line 417 in train
File "./legacy/train.py", line 181 in main
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 324 in new_func
File "/home/suman/.local/lib/python3.6/site-packages/absl/app.py", line 250 in _run_main
File "/home/suman/.local/lib/python3.6/site-packages/absl/app.py", line 299 in run
File "/home/suman/anaconda3/envs/tfff/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40 in run
File "./legacy/train.py", line 185 in
Aborted (core dumped)

@sumannelli-Ib
Copy link
Author

sumannelli-Ib commented Apr 10, 2020

Hi,
Just now figured out the issue, if I keep my batch size from 1 to 5 it's working. But this is the worst scenario. My GPU size is 16GB, in my other system which runs on CPU and 8 GB memory taking 24 as batch size.
If I use batch size as 5 model wouldn't learn anything, it never finds the Local minima because of noise with limited images and the loss never converge.

@sunway513
Copy link

Hi @sumannelli-Ib , thanks for creating this issue.
Tensorflow-rocm1.15.3 was built to work with ROCm3.3. It will not compatible to the ROCm3.1.0 on your system.
For compatibility details, please check the following document:
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md
To downgrade to tensorflow-rocm1.15.2:
pip3 install tensorflow-rocm==1.15.2

@sumannelli-Ib
Copy link
Author

sumannelli-Ib commented Apr 12, 2020

Hi sunway513,
Thanks for the reply.
Yes. I tried from the reference as below
Tensorflow ==1.15.2 with 3.1 but rocm drivers are not detecting throwing error. so I installed the tensorflow-rocm==1.15.3, this time it's detecting the drivers. But throwing the above-mentioned error.
Also tried the tensorflow-rocm==1.15.3 with 3.3 the same above error I am getting.
As mentioned in my second comment when I am making batch_size in between 1 to 5. Its working. If I keep it more than that throwing error

@sunway513
Copy link

Hi @sumannelli-Ib , at this point we still can not confirm if this issue is due to your local system configuration or a real issue that worth further investigation.
If you can kindly try your workload inside the following docker container, and let us know the exact steps to reproduce the problem if it still fails, we can then take it from there:
rocm/tensorflow:rocm3.3-tf1.15-dev
For instructions to setup docker container, please refer to the following documents:
https://github.com/RadeonOpenCompute/ROCm-docker/blob/master/quick-start.md
https://hub.docker.com/r/rocm/tensorflow

@sumannelli-Ib
Copy link
Author

Hi sunway513,

Followed the above references getting the below error. The issue seems to hip memory.

2020-04-15 15:49:47.266809: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libhip_hcc.so
2020-04-15 15:49:47.266921: E tensorflow/stream_executor/rocm/rocm_driver.cc:1031] could not retrieve ROCM device count: hipError_t(100)
2020-04-15 15:49:47.267145: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2020-04-15 15:49:47.288336: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3403500000 Hz
2020-04-15 15:49:47.288750: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55768b5909f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-15 15:49:47.288774: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-04-15 15:49:47.288846: E tensorflow/stream_executor/rocm/rocm_driver.cc:1031] could not retrieve ROCM device count: hipError_t(100)

@sumannelli-Ib
Copy link
Author

Hi Sunway513,
I had cleaned my setup and reinstalled, now it is working with ROCM 2.10.0 and Tensorflow 1.15.0. But this is slower than my CPU.

Please suggest me on this.

What is the 2.10.0-hipclang version? How it differs from 2.10.0. Can I use it for TFOD API?

Thanks
Suman Nelli

@sunway513
Copy link

Hi @sumannelli-Ib , hipclang is a new compiler infrastructure that ROCm will be migrated to in the near future. The branch you mentioned was to prepare for the needed changes on tensorflow code base, which is not directly related to TFOD API support.

@jerryyin
Copy link
Member

jerryyin commented May 4, 2020

@sunway513 Thanks for taking a look at this issue. I'm assigning the issue to you now so that it has an assignee. Please feel free to re-assign/close if you feel this is out of scope.

@sunway513
Copy link

@jerryyin , what we can do is very limited without reproducible steps inside Docker containers.

@jerryyin
Copy link
Member

jerryyin commented May 4, 2020

@sunway513 That's reasonable.

@sumannelli-Ib Judging that the original compiler issue is fixed. I'm closing this now. If you continue to see the performance issue, please feel free to submit a new one with detailed steps to reproduce.

@jerryyin jerryyin closed this as completed May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants