-
Notifications
You must be signed in to change notification settings - Fork 24
Multi-GPU training problem. #11
Comments
Hi @ginsongsong, Thank you for reporting this. Our understanding was that multi-GPU was working okay. That being said, we will try to reproduce this specific issue. How many GPUs did you test with? Best, Jeff |
Hi @parallelo Maybe the MIopen can't directly reduce memory footprint like cuDnn yet. I use rocm-smi to setup the GPU clock and GPU memory clock memory to top level, GPU DID Temp AvgPwr SCLK MCLK Fan Perf OverDrive ECC But for two MI25 GPU training in alexnet, it the clock level will decrease to the basic clock level . GPU DID Temp AvgPwr SCLK MCLK Fan Perf OverDrive ECC |
Hi @ginsongsong, Thanks for the extra details. Initially, let's try to focus on multi-GPU AlexNet, and then we can move from there. Can you please provide these further details?
Also, note that we'll have another ROCm point release coming pretty soon to test. Best, Jeff PS - I'll be out of town until Monday evening, but afterwards I'll be able to focus on this specific issue. |
Hi @parallelo , Thanks for your kind reply. For kfd information : All of the rocm libs are downloading from the debian packages. I saw a lots of hip-api error message from hipPointerGetAttributes function, For the full hipCaffe result log: MI25 lspci log Thanks for your help. |
Hi again @ginsongsong, I just tried ROCm 1.6.1 with the internal MIOpen repo built from source. Multi-GPU AlexNet and GoogleNet ran without error. There's expected to be an update soon to the public MIOpen repo, and you'll need those changes. To build MIOpen from source, please follow these instructions:
Then, set this environment variable (as a temp workaround):
For AlexNet, try something like this:
For GoogleNet, try something like this:
Hopefully this will help. Either way, let us know how it goes, and we'll get it figured out. Best, Jeff |
Thanks @parallelo , problem are solved. bvlc_Alexnet iteration=1000 bvlc_Googlenet iteration=1000 Thank you for your kind assistance. |
FWIW I'm having this issue on a fresh install of Ubuntu 16.04.3 and the ROCm 1.6.3 stack which was performed yesterday. Following the HipCaffe Quickstart, single-GPU training worked flawlessly with all the examples given. But running with the Following @parallelo's advice, I hit an issue while building MIOpen, seems I'm missing jamil@fridge:~/code/MIOpen/build (master%=) % make -j$(nproc) && make package -j$(nproc)
[ 6%] Built target addkernels
[ 8%] Linking CXX shared library ../lib/libMIOpen.so
ld: cannot find -lOpenSSL::Crypto
/opt/rocm/hcc-1.0/bin/hcc(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x2a)[0x1674f1a]
/opt/rocm/hcc-1.0/bin/hcc(_ZN4llvm3sys17RunSignalHandlersEv+0x3e)[0x1672fbe]
/opt/rocm/hcc-1.0/bin/hcc[0x167310c]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fd93835b390]
[0x7fd93878ba10]
Stack dump:
0. Program arguments: /opt/rocm/hcc-1.0/bin/hcc -fPIC -O3 -DNDEBUG -shared -Wl,-soname,libMIOpen.so.1 -o ../lib/libMIOpen.so.1 CMakeFiles/MIOpen.dir/convolution.cpp.o CMakeFiles/MIOpen.dir/convolution_api.cpp.o CMakeFiles/MIOpen.dir/convolution_fft.cpp.o CMakeFiles/MIOpen.dir/errors.cpp.o CMakeFiles/MIOpen.dir/load_file.cpp.o CMakeFiles/MIOpen.dir/pooling_api.cpp.o CMakeFiles/MIOpen.dir/kernel_warnings.cpp.o CMakeFiles/MIOpen.dir/logger.cpp.o CMakeFiles/MIOpen.dir/lrn_api.cpp.o CMakeFiles/MIOpen.dir/activ_api.cpp.o CMakeFiles/MIOpen.dir/handle_api.cpp.o CMakeFiles/MIOpen.dir/softmax_api.cpp.o CMakeFiles/MIOpen.dir/batch_norm.cpp.o CMakeFiles/MIOpen.dir/batch_norm_api.cpp.o CMakeFiles/MIOpen.dir/tensor.cpp.o CMakeFiles/MIOpen.dir/tensor_api.cpp.o CMakeFiles/MIOpen.dir/tmp_dir.cpp.o CMakeFiles/MIOpen.dir/binary_cache.cpp.o CMakeFiles/MIOpen.dir/md5.cpp.o CMakeFiles/MIOpen.dir/activ.cpp.o CMakeFiles/MIOpen.dir/kernel_cache.cpp.o CMakeFiles/MIOpen.dir/lrn.cpp.o CMakeFiles/MIOpen.dir/mlo_dir_conv.cpp.o CMakeFiles/MIOpen.dir/ocl/activ_ocl.cpp.o CMakeFiles/MIOpen.dir/ocl/batchnormocl.cpp.o CMakeFiles/MIOpen.dir/ocl/convolutionocl.cpp.o CMakeFiles/MIOpen.dir/ocl/convolutionocl_fft.cpp.o CMakeFiles/MIOpen.dir/ocl/lrn_ocl.cpp.o CMakeFiles/MIOpen.dir/ocl/mloNeuron.cpp.o CMakeFiles/MIOpen.dir/ocl/mloNorm.cpp.o CMakeFiles/MIOpen.dir/ocl/mloPooling.cpp.o CMakeFiles/MIOpen.dir/ocl/pooling_ocl.cpp.o CMakeFiles/MIOpen.dir/ocl/tensorocl.cpp.o CMakeFiles/MIOpen.dir/ocl/softmaxocl.cpp.o CMakeFiles/MIOpen.dir/ocl/utilocl.cpp.o CMakeFiles/MIOpen.dir/ocl/gcn_asm_utils.cpp.o CMakeFiles/MIOpen.dir/pooling.cpp.o CMakeFiles/MIOpen.dir/__/db.cpp.o CMakeFiles/MIOpen.dir/__/kernel.cpp.o CMakeFiles/MIOpen.dir/gemm.cpp.o CMakeFiles/MIOpen.dir/gemm_geometry.cpp.o CMakeFiles/MIOpen.dir/hip/hiperrors.cpp.o CMakeFiles/MIOpen.dir/hip/handlehip.cpp.o CMakeFiles/MIOpen.dir/hipoc/hipoc_kernel.cpp.o CMakeFiles/MIOpen.dir/hipoc/hipoc_program.cpp.o -lstdc++ -amdgpu-target=gfx803 -amdgpu-target=gfx900 -Wno-unused-command-line-argument /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hcc-1.0/lib/libhc_am.so /opt/rocm/miopengemm/lib/libmiopengemm.so -lOpenSSL::Crypto -lboost_filesystem -lboost_system -hc -L /opt/rocm/hcc-1.0/lib -Wl,-rpath /opt/rocm/hcc-1.0/lib -Wl,--whole-archive /opt/rocm/hcc-1.0/lib/libmcwamp.a -lunwind -Wl,--no-whole-archive -ldl -lm /opt/rocm/lib/libhsa-runtime64.so -lpthread /opt/rocm/opencl/lib/x86_64/libOpenCL.so -Wl,-rpath,/opt/rocm/hip/lib:/opt/rocm/hcc-1.0/lib:/opt/rocm/miopengemm/lib:/opt/rocm/lib:/opt/rocm/opencl/lib/x86_64:
Error running link command: Segmentation fault
src/CMakeFiles/MIOpen.dir/build.make:1295: recipe for target 'lib/libMIOpen.so.1' failed
make[2]: *** [lib/libMIOpen.so.1] Error 1
CMakeFiles/Makefile2:424: recipe for target 'src/CMakeFiles/MIOpen.dir/all' failed
make[1]: *** [src/CMakeFiles/MIOpen.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2 System details:
Figured I would post this here as another data point. I'm hoping these kinks get worked out soon... itching to use fp16 packed math for training my models! |
@jamilbk Can you try installing libssl-dev and check again? |
Yeah I had installed Perhaps I need to symlink the Apologies for my lacking Linux linker skills! |
Please paste the output of |
jamil@fridge:~/tmp/pmbw-0.6.2 % dpkg -l | grep libssl
ii libssl-dev:amd64 1.0.2g-1ubuntu4.8 amd64 Secure Sockets Layer toolkit - development files
ii libssl-doc 1.0.2g-1ubuntu4.8 all Secure Sockets Layer toolkit - development documentation
ii libssl1.0.0:amd64 1.0.2g-1ubuntu4.8 amd64 Secure Sockets Layer toolkit - shared libraries
ii libsslcommon2:amd64 0.16-9ubuntu2 amd64 enterprise messaging system - common SSL libraries
ii libsslcommon2-dev:amd64 0.16-9ubuntu2 amd64 enterprise messaging system - common SSL development files I had installed |
@pfultz2 do you have any insights? |
@jamilbk Did you try removing your build directory and starting again? |
No, this is an imported target in cmake that is defined by |
@pfultz2 Apologies for the delay. I've removed the jamil@fridge:~/dl/MIOpen (master=) % gst
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean
jamil@fridge:~/dl/MIOpen (master=) % mkdir -p build && cd buildjamil@fridge:~/dl/MIOpen/build (master=) % CXX=`which hcc` cmake -DHIP_OC_COMPILER=/opt/rocm/bin/clang-ocl -DCMAKE_PREFIX_PATH="/opt/rocm/hcc;/opt/rocm/hip" -DOPENCL_INCLUDE_DIRS="/opt/rocm/opencl/include" ..
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is Clang 5.0.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rocm/bin/hcc
-- Check for working CXX compiler: /opt/rocm/bin/hcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- hip compiler: /opt/rocm/bin/clang-ocl
-- HIP backend selected.
-- AMDGCN assembler: /opt/rocm/opencl/bin/x86_64/clang
-- Build with miopengemm
-- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libssl.so;/usr/lib/x86_64-linux-gnu/libcrypto.so (found version "1.0.2g")
-- Boost version: 1.58.0
-- Found the following Boost libraries:
-- filesystem
-- system
-- Clang tidy not found
-- Clang tidy checks: *,-cert-err60-cpp,-cert-msc30-c,-cert-msc50-cpp,-clang-analyzer-alpha.core.CastToStruct,-clang-analyzer-optin.performance.Padding,-clang-diagnostic-deprecated-declarations,-clang-diagnostic-extern-c-compat,-cppcoreguidelines-pro-bounds-array-to-pointer-decay,-cppcoreguidelines-pro-bounds-constant-array-index,-cppcoreguidelines-pro-bounds-pointer-arithmetic,-cppcoreguidelines-pro-type-member-init,-cppcoreguidelines-pro-type-reinterpret-cast,-cppcoreguidelines-pro-type-union-access,-cppcoreguidelines-pro-type-vararg,-cppcoreguidelines-special-member-functions,-google-explicit-constructor,-google-readability-braces-around-statements,-google-readability-todo,-google-runtime-int,-google-runtime-references,-hicpp-explicit-conversions,-hicpp-special-member-functions,-hicpp-use-equals-default,-hicpp-use-override,-llvm-header-guard,-llvm-include-order,-misc-macro-parentheses,-misc-misplaced-const,-misc-misplaced-widening-cast,-modernize-loop-convert,-modernize-pass-by-value,-modernize-use-default-member-init,-modernize-use-emplace,-modernize-use-equals-default,-modernize-use-transparent-functors,-performance-unnecessary-value-param,-readability-braces-around-statements,-readability-else-after-return,-readability-implicit-bool-cast,-readability-misleading-indentation,-readability-named-parameter,-modernize-use-override,-readability-non-const-parameter
-- Could NOT find LATEX (missing: LATEX_COMPILER)
Latex builder not found. Latex builder is required only for building the PDF documentation for MIOpen and is not necessary for building the library, or any other components. To build PDF documentation run make in /home/jamil/dl/MIOpen/doc/pdf, once a latex builder is installed.
-- MIOpen_VERSION= 1.1.1
-- CMAKE_BUILD_TYPE= Release
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
WARNING: Target "MIOpenDriver" has EXCLUDE_FROM_ALL set and will not be built by default but an install rule has been provided for it. CMake does not define behavior for this case.
-- Generating done
CMake Warning:
Manually-specified variables were not used by the project:
OPENCL_INCLUDE_DIRS
-- Build files have been written to: /home/jamil/dl/MIOpen/build
jamil@fridge:~/dl/MIOpen/build (master%=) % make -j$(nproc) && make package -j$(nproc)
Scanning dependencies of target addkernels
[ 2%] Building CXX object addkernels/CMakeFiles/addkernels.dir/include_inliner.cpp.o
[ 4%] Building CXX object addkernels/CMakeFiles/addkernels.dir/addkernels.cpp.o
[ 6%] Linking CXX executable ../bin/addkernels
[ 6%] Built target addkernels
[ 8%] Inlining MIOpen kernels
Scanning dependencies of target MIOpen
[ 10%] Building CXX object src/CMakeFiles/MIOpen.dir/convolution_fft.cpp.o
[ 12%] Building CXX object src/CMakeFiles/MIOpen.dir/errors.cpp.o
[ 14%] Building CXX object src/CMakeFiles/MIOpen.dir/convolution_api.cpp.o
[ 18%] Building CXX object src/CMakeFiles/MIOpen.dir/load_file.cpp.o
[ 18%] Building CXX object src/CMakeFiles/MIOpen.dir/logger.cpp.o
[ 20%] Building CXX object src/CMakeFiles/MIOpen.dir/activ_api.cpp.o
[ 22%] Building CXX object src/CMakeFiles/MIOpen.dir/lrn_api.cpp.o
[ 26%] Building CXX object src/CMakeFiles/MIOpen.dir/convolution.cpp.o
[ 26%] Building CXX object src/CMakeFiles/MIOpen.dir/pooling_api.cpp.o
[ 28%] Building CXX object src/CMakeFiles/MIOpen.dir/batch_norm.cpp.o
[ 30%] Building CXX object src/CMakeFiles/MIOpen.dir/softmax_api.cpp.o
[ 34%] Building CXX object src/CMakeFiles/MIOpen.dir/batch_norm_api.cpp.o
[ 34%] Building CXX object src/CMakeFiles/MIOpen.dir/handle_api.cpp.o
[ 38%] Building CXX object src/CMakeFiles/MIOpen.dir/kernel_warnings.cpp.o
[ 38%] Building CXX object src/CMakeFiles/MIOpen.dir/tmp_dir.cpp.o
[ 40%] Building CXX object src/CMakeFiles/MIOpen.dir/tensor.cpp.o
[ 42%] Building CXX object src/CMakeFiles/MIOpen.dir/mlo_dir_conv.cpp.o
[ 44%] Building CXX object src/CMakeFiles/MIOpen.dir/tensor_api.cpp.o
[ 46%] Building CXX object src/CMakeFiles/MIOpen.dir/binary_cache.cpp.o
[ 48%] Building CXX object src/CMakeFiles/MIOpen.dir/kernel_cache.cpp.o
[ 52%] Building CXX object src/CMakeFiles/MIOpen.dir/md5.cpp.o
[ 52%] Building CXX object src/CMakeFiles/MIOpen.dir/activ.cpp.o
[ 54%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/activ_ocl.cpp.o
[ 56%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/batchnormocl.cpp.o
[ 58%] Building CXX object src/CMakeFiles/MIOpen.dir/lrn.cpp.o
[ 60%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/mloNeuron.cpp.o
[ 62%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/mloNorm.cpp.o
[ 64%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/pooling_ocl.cpp.o
[ 66%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/mloPooling.cpp.o
[ 68%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/lrn_ocl.cpp.o
[ 70%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/convolutionocl.cpp.o
[ 72%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/convolutionocl_fft.cpp.o
In file included from /home/jamil/dl/MIOpen/src/md5.cpp:2:
In file included from /usr/include/openssl/md5.h:62:
/usr/include/openssl/e_os2.h:56:10: fatal error: 'openssl/opensslconf.h' file not found
#include <openssl/opensslconf.h>
^~~~~~~~~~~~~~~~~~~~~~~
[ 74%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/tensorocl.cpp.o
[ 76%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/softmaxocl.cpp.o
1 error generated.
src/CMakeFiles/MIOpen.dir/build.make:543: recipe for target 'src/CMakeFiles/MIOpen.dir/md5.cpp.o' failed
make[2]: *** [src/CMakeFiles/MIOpen.dir/md5.cpp.o] Error 1
make[2]: *** Deleting file 'src/CMakeFiles/MIOpen.dir/md5.cpp.o'
make[2]: *** Waiting for unfinished jobs.... |
@jamilbk Seems like you are installing
CXX=/opt/rocm/hcc/bin/hcc cmake -DMIOPEN_BACKEND=HIP -DCMAKE_PREFIX_PATH="/opt/rocm/hcc;/opt/rocm/hip" -DCMAKE_CXX_FLAGS="-isystem /usr/include/x86_64-linux-gnu/" .. |
Thanks @dagamayank that fixed the compile. Now I'm back to the original problem noted in this thread |
Issue summary
I had succeeded to training bvlc-alexnet and bvlc-googlenet models in single MI25 GPU.
When I changed the number of training GPU from 1 to all, caffe show the below message..
CPU memory:256GB swap:16GB
db:imagenet lmdb
batchsize:64
bvlc_alexnet:
I0719 10:51:50.941951 2540 solver.cpp:279] Solving AlexNet
I0719 10:51:50.941956 2540 solver.cpp:280] Learning Rate Policy: step
I0719 10:51:50.955250 2540 solver.cpp:337] Iteration 0, Testing net (#0)
I0719 10:54:02.507711 2540 solver.cpp:404] Test net output #0: accuracy = 0.00109375
I0719 10:54:02.508229 2540 solver.cpp:404] Test net output #1: loss = 6.91062 (* 1 = 6.91062 loss)
Memory access fault by GPU node-2 on address 0x422ea6b000. Reason: Page not present or supervisor privilege.
*** Aborted at 1500432842 (unix time) try "date -d @1500432842" if you are using GNU date ***
PC: @ 0x7f64489dc428 gsignal
*** SIGABRT (@0x9ec) received by PID 2540 (TID 0x7f642c526700) from PID 2540; stack trace: ***
@ 0x7f644ddd0390 (unknown)
@ 0x7f64489dc428 gsignal
@ 0x7f64489de02a abort
@ 0x7f644d9401c9 (unknown)
@ 0x7f644d9464e5 (unknown)
@ 0x7f644d91e9d7 (unknown)
@ 0x7f644ddc66ba start_thread
@ 0x7f6448aae3dd clone
@ 0x0 (unknown)
db:imagenet lmdb
batchsize:32
bvlc_googlenet:
I0719 00:12:28.380522 7405 solver.cpp:279] Solving GoogleNet
I0719 00:12:28.380544 7405 solver.cpp:280] Learning Rate Policy: step
Memory access fault by GPU node-2 on address 0x42309ba000. Reason: Page not present or supervisor privilege.
*** Aborted at 1500394348 (unix time) try "date -d @1500394348" if you are using GNU date ***
PC: @ 0x7f4078d7a428 gsignal
*** SIGABRT (@0x1CED) received by PID 7405 (TID 0x7f405c8c4700) from PID 7405; stack trace: ***
@ 0x7f407e16e390 (unknown)
@ 0x7f4078d7a428 gsignal
@ 0x7f4078d7c02a abort
@ 0x7f407dcde1c9 (unknown)
@ 0x7f407dce44e5 (unknown)
@ 0x7f407dcbc9d7 (unknown)
@ 0x7f407e1646ba start_thread
@ 0x7f4078e4c3dd clone
@ 0x0 (unknown)
Steps to reproduce
Using the latest ROCm from debian packages.
My caffe configuration:
USE_CUDNN := 0
USE_MIOPEN := 1
USE_LMDB := 1
BLAS := open
BLAS_INCLUDE := /opt/openBlas/include
BLAS_LIB := /opt/openBlas/lib
Your system configuration
Operating system: Ubuntu 16.04.2 LTS with 4.9.0-kfd-compute-rocm-rel-1.6-77
Compiler: GCC v5.4.0, HCC clang 5.0
CUDA version (if applicable): not applicable
CUDNN version (if applicable): not applicable
BLAS: OpenBlas
Python or MATLAB version (for pycaffe and matcaffe respectively): not applicable
The text was updated successfully, but these errors were encountered: