Multi-GPU training problem. #11

ginsongsong · 2017-07-20T01:57:14Z

Issue summary

I had succeeded to training bvlc-alexnet and bvlc-googlenet models in single MI25 GPU.
When I changed the number of training GPU from 1 to all, caffe show the below message..
CPU memory:256GB swap:16GB
db:imagenet lmdb
batchsize:64
bvlc_alexnet:

I0719 10:51:50.941951 2540 solver.cpp:279] Solving AlexNet
I0719 10:51:50.941956 2540 solver.cpp:280] Learning Rate Policy: step
I0719 10:51:50.955250 2540 solver.cpp:337] Iteration 0, Testing net (#0)
I0719 10:54:02.507711 2540 solver.cpp:404] Test net output #0: accuracy = 0.00109375
I0719 10:54:02.508229 2540 solver.cpp:404] Test net output #1: loss = 6.91062 (* 1 = 6.91062 loss)
Memory access fault by GPU node-2 on address 0x422ea6b000. Reason: Page not present or supervisor privilege.
*** Aborted at 1500432842 (unix time) try "date -d @1500432842" if you are using GNU date ***
PC: @ 0x7f64489dc428 gsignal
*** SIGABRT (@0x9ec) received by PID 2540 (TID 0x7f642c526700) from PID 2540; stack trace: ***
@ 0x7f644ddd0390 (unknown)
@ 0x7f64489dc428 gsignal
@ 0x7f64489de02a abort
@ 0x7f644d9401c9 (unknown)
@ 0x7f644d9464e5 (unknown)
@ 0x7f644d91e9d7 (unknown)
@ 0x7f644ddc66ba start_thread
@ 0x7f6448aae3dd clone
@ 0x0 (unknown)

db:imagenet lmdb
batchsize:32
bvlc_googlenet:

I0719 00:12:28.380522 7405 solver.cpp:279] Solving GoogleNet
I0719 00:12:28.380544 7405 solver.cpp:280] Learning Rate Policy: step
Memory access fault by GPU node-2 on address 0x42309ba000. Reason: Page not present or supervisor privilege.
*** Aborted at 1500394348 (unix time) try "date -d @1500394348" if you are using GNU date ***
PC: @ 0x7f4078d7a428 gsignal
*** SIGABRT (@0x1CED) received by PID 7405 (TID 0x7f405c8c4700) from PID 7405; stack trace: ***
@ 0x7f407e16e390 (unknown)
@ 0x7f4078d7a428 gsignal
@ 0x7f4078d7c02a abort
@ 0x7f407dcde1c9 (unknown)
@ 0x7f407dce44e5 (unknown)
@ 0x7f407dcbc9d7 (unknown)
@ 0x7f407e1646ba start_thread
@ 0x7f4078e4c3dd clone
@ 0x0 (unknown)

Steps to reproduce

Using the latest ROCm from debian packages.

My caffe configuration:

USE_CUDNN := 0
USE_MIOPEN := 1
USE_LMDB := 1
BLAS := open
BLAS_INCLUDE := /opt/openBlas/include
BLAS_LIB := /opt/openBlas/lib

Your system configuration

Operating system: Ubuntu 16.04.2 LTS with 4.9.0-kfd-compute-rocm-rel-1.6-77
Compiler: GCC v5.4.0, HCC clang 5.0
CUDA version (if applicable): not applicable
CUDNN version (if applicable): not applicable
BLAS: OpenBlas
Python or MATLAB version (for pycaffe and matcaffe respectively): not applicable

parallelo · 2017-07-20T19:11:46Z

Hi @ginsongsong,

Thank you for reporting this. Our understanding was that multi-GPU was working okay. That being said, we will try to reproduce this specific issue.

How many GPUs did you test with?

Best,

Jeff

ginsongsong · 2017-07-21T02:05:19Z

Hi @parallelo
I use two MI25 GPUs to training bvlc_caffe models,
and I can use example test for cifar10_quick for two MI25 GPUs.

Maybe the MIopen can't directly reduce memory footprint like cuDnn yet.
My Resnet50 model can use batchsize=6 in P100-PCIE 16GB GPU, but for MI25 16GB GPU I can't use any batchsize in hipcaffe.
resnet.zip

I use rocm-smi to setup the GPU clock and GPU memory clock memory to top level,
following information was captured from rocm-smi in single gpu training in alexnet.

GPU DID Temp AvgPwr SCLK MCLK Fan Perf OverDrive ECC
1 6860 69.0c 177.0W 1500Mhz 945Mhz 0.0% manual 0% N/A
2 6860 52.0c 68.0W 1500Mhz 945Mhz 0.0% manual 0% N/A

But for two MI25 GPU training in alexnet, it the clock level will decrease to the basic clock level .

GPU DID Temp AvgPwr SCLK MCLK Fan Perf OverDrive ECC
1 6860 36.0c 66.0W 852Mhz 167Mhz 0.0% manual 0% N/A
2 6860 38.0c 68.0W 825Mhz 167Mhz 0.0% manual 0% N/A

parallelo · 2017-07-21T04:16:44Z

Hi @ginsongsong,

Thanks for the extra details. Initially, let's try to focus on multi-GPU AlexNet, and then we can move from there.

Can you please provide these further details?

Can you confirm that all of your components are from ROCm 1.6? (both kfd and user-level components)
Are you building any libs from source? (e.g. MIOpen, rocBLAS, etc)
Please try dropping the AlexNet batch size down, and see if this changes the situation.
Please set both of these environment variables, re-run, and report the full hipCaffe run log:
- export HIP_TRACE_API=1
- export HCC_SERIALIZE_KERNEL=1

Also, note that we'll have another ROCm point release coming pretty soon to test.

Best,

Jeff

PS - I'll be out of town until Monday evening, but afterwards I'll be able to focus on this specific issue.

ginsongsong · 2017-07-24T02:04:29Z

Hi @parallelo ,

Thanks for your kind reply.

For kfd information :
root@AMD:/home/gin/hipCaffe# uname -a
Linux AMD 4.9.0-kfd-compute-rocm-rel-1.6-77 #1 SMP Wed Jun 28 07:30:27 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux

All of the rocm libs are downloading from the debian packages.
cxlactivitylogger is already the newest version (5.1.6386).
hcc is already the newest version (1.0.17262).
miopen-hip is already the newest version (1.0.0).
miopengemm is already the newest version (1.0.1).
rocblas is already the newest version (0.5.2.0).
rocm is already the newest version (1.6.77).
rocm-libs is already the newest version (1.6.77).
rocm-opencl is already the newest version (1.2.0-1424893).
rocm-opencl-dev is already the newest version (1.2.0-1424893).
rocm-profiler is already the newest version (5.1.6386).
rocm-utils is already the newest version (1.0.0).

I saw a lots of hip-api error message from hipPointerGetAttributes function,
maybe p2p function can't get device pointer from PCIE?

For the full hipCaffe result log:
Single GPU:
http://122.147.187.124/resultMI25/Result-MI25x1_PCIe16_17_07_24_15_04_1000_ITER_BS64_ALEXNET.txt
Multi-GPU:
http://122.147.187.124/resultMI25/Result-MI25x2_PCIe16_17_07_24_15_11_1000_ITER_BS64_ALEXNET.txt

MI25 lspci log
lspci_MI25.txt

Thanks for your help.

parallelo · 2017-07-27T23:57:34Z

Hi again @ginsongsong,

I just tried ROCm 1.6.1 with the internal MIOpen repo built from source. Multi-GPU AlexNet and GoogleNet ran without error.

There's expected to be an update soon to the public MIOpen repo, and you'll need those changes.

To build MIOpen from source, please follow these instructions:

# Install rocm-cmake (needed by miopen)
cd ~
git clone https://github.com/RadeonOpenCompute/rocm-cmake.git
cd rocm-cmake
mkdir build && cd build && cmake .. && make -j$(nproc) && make -j$(nproc) package
sudo dpkg -i ./rocm-cmake*.deb

# Install MIOpen 
cd ~
git clone https://github.com/ROCmSoftwarePlatform/MIOpen.git
cd MIOpen
mkdir -p build && cd build && \
    CXX=$HCC_HOME/bin/hcc cmake -DHIP_OC_COMPILER=/opt/rocm/bin/clang-ocl -DCMAKE_PREFIX_PATH="$HCC_HOME;$HIP_PATH" -DOPENCL_INCLUDE_DIRS="$OPENCL_ROOT/include" ..
make -j$(nproc) && make package -j$(nproc)
sudo dpkg -i ./MIOpen*.deb

Then, set this environment variable (as a temp workaround):

export HCC_UNPINNED_COPY_MODE=2

For AlexNet, try something like this:

cd $CAFFE_ROOT

# Params to be set by the user
gpuids="0,1"
batchsize_per_gpu=128
iterations=500
model_path=./models/bvlc_alexnet

# Update the train_val prototxt's batch size
train_val_prototxt=${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
cp ${model_path}/train_val.prototxt ${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
sed -i "s|batch_size: 256|batch_size: ${batchsize_per_gpu}|g" ./${train_val_prototxt}

# Update the solver prototxt's max_iter and train_val prototxt path
solver_prototxt=${model_path}/solver_short.prototxt
cp ${model_path}/solver.prototxt ${solver_prototxt}
sed -i "s|max_iter: 10000000|max_iter: ${iterations}|g" ${solver_prototxt}
sed -i "s|${model_path}/train_val.prototxt|${train_val_prototxt}|g" ${solver_prototxt}

# Run on ImageNet data
ngpus=$(( 1 + $(grep -o "," <<< "$g" | wc -l) ))
train_log=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}.log
train_log_sec=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}_sec.log
./build/tools/caffe train --solver=${solver_prototxt} --gpu ${gpuids} 2>&1 | tee ${train_log}

For GoogleNet, try something like this:

cd $CAFFE_ROOT

# Params to be set by the user
gpuids="0,1"
batchsize_per_gpu=16
iterations=500
model_path=./models/bvlc_googlenet

# Update the train_val prototxt's batch size
train_val_prototxt=${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
cp ${model_path}/train_val.prototxt ${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
sed -i "s|batch_size: 32|batch_size: ${batchsize_per_gpu}|g" ./${train_val_prototxt}

# Update the solver prototxt's max_iter and train_val prototxt path
solver_prototxt=${model_path}/solver_short.prototxt
cp ${model_path}/solver.prototxt ${solver_prototxt}
sed -i "s|max_iter: 10000000|max_iter: ${iterations}|g" ${solver_prototxt}
sed -i "s|${model_path}/train_val.prototxt|${train_val_prototxt}|g" ${solver_prototxt}

# Run on ImageNet data
ngpus=$(( 1 + $(grep -o "," <<< "$g" | wc -l) ))
train_log=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}.log
train_log_sec=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}_sec.log
./build/tools/caffe train --solver=${solver_prototxt} --gpu ${gpuids} 2>&1 | tee ${train_log}

Hopefully this will help. Either way, let us know how it goes, and we'll get it figured out.

Best,

Jeff

ginsongsong · 2017-07-31T03:42:14Z

Thanks @parallelo , problem are solved.
I removed the hipcaffe and cloned it again.
Following you step to rebuild my hipcaffe, and my multiGPU alexnet and Googlenet can work on my MI25.
Bellow attachment was the result captured from hipcaffe.

bvlc_Alexnet iteration=1000
Result-MI25x2_PCIe16_17_07_31_10_50_10000ITER_BS64_ALEXNET.txt

bvlc_Googlenet iteration=1000
Result-MI25x2_PCIe16_17_07_31_11_13_10000ITER_BS32_GOOGLENET.txt

Thank you for your kind assistance.

jamilbk · 2017-09-26T14:15:37Z

FWIW I'm having this issue on a fresh install of Ubuntu 16.04.3 and the ROCm 1.6.3 stack which was performed yesterday. Following the HipCaffe Quickstart, single-GPU training worked flawlessly with all the examples given. But running with the --gpu "0,1" flag caused the same issue @ginsongsong had above. Running with any combination of gpus besides 0 caused the issue for me.

Following @parallelo's advice, I hit an issue while building MIOpen, seems I'm missing OpenSSL::Crypto. Still trying to figure out which package provides that on Ubuntu 16.04. Here's the crash log:

jamil@fridge:~/code/MIOpen/build (master%=) % make -j$(nproc) && make package -j$(nproc)
[  6%] Built target addkernels
[  8%] Linking CXX shared library ../lib/libMIOpen.so
ld: cannot find -lOpenSSL::Crypto
/opt/rocm/hcc-1.0/bin/hcc(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x2a)[0x1674f1a]
/opt/rocm/hcc-1.0/bin/hcc(_ZN4llvm3sys17RunSignalHandlersEv+0x3e)[0x1672fbe]
/opt/rocm/hcc-1.0/bin/hcc[0x167310c]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fd93835b390]
[0x7fd93878ba10]
Stack dump:
0.	Program arguments: /opt/rocm/hcc-1.0/bin/hcc -fPIC -O3 -DNDEBUG -shared -Wl,-soname,libMIOpen.so.1 -o ../lib/libMIOpen.so.1 CMakeFiles/MIOpen.dir/convolution.cpp.o CMakeFiles/MIOpen.dir/convolution_api.cpp.o CMakeFiles/MIOpen.dir/convolution_fft.cpp.o CMakeFiles/MIOpen.dir/errors.cpp.o CMakeFiles/MIOpen.dir/load_file.cpp.o CMakeFiles/MIOpen.dir/pooling_api.cpp.o CMakeFiles/MIOpen.dir/kernel_warnings.cpp.o CMakeFiles/MIOpen.dir/logger.cpp.o CMakeFiles/MIOpen.dir/lrn_api.cpp.o CMakeFiles/MIOpen.dir/activ_api.cpp.o CMakeFiles/MIOpen.dir/handle_api.cpp.o CMakeFiles/MIOpen.dir/softmax_api.cpp.o CMakeFiles/MIOpen.dir/batch_norm.cpp.o CMakeFiles/MIOpen.dir/batch_norm_api.cpp.o CMakeFiles/MIOpen.dir/tensor.cpp.o CMakeFiles/MIOpen.dir/tensor_api.cpp.o CMakeFiles/MIOpen.dir/tmp_dir.cpp.o CMakeFiles/MIOpen.dir/binary_cache.cpp.o CMakeFiles/MIOpen.dir/md5.cpp.o CMakeFiles/MIOpen.dir/activ.cpp.o CMakeFiles/MIOpen.dir/kernel_cache.cpp.o CMakeFiles/MIOpen.dir/lrn.cpp.o CMakeFiles/MIOpen.dir/mlo_dir_conv.cpp.o CMakeFiles/MIOpen.dir/ocl/activ_ocl.cpp.o CMakeFiles/MIOpen.dir/ocl/batchnormocl.cpp.o CMakeFiles/MIOpen.dir/ocl/convolutionocl.cpp.o CMakeFiles/MIOpen.dir/ocl/convolutionocl_fft.cpp.o CMakeFiles/MIOpen.dir/ocl/lrn_ocl.cpp.o CMakeFiles/MIOpen.dir/ocl/mloNeuron.cpp.o CMakeFiles/MIOpen.dir/ocl/mloNorm.cpp.o CMakeFiles/MIOpen.dir/ocl/mloPooling.cpp.o CMakeFiles/MIOpen.dir/ocl/pooling_ocl.cpp.o CMakeFiles/MIOpen.dir/ocl/tensorocl.cpp.o CMakeFiles/MIOpen.dir/ocl/softmaxocl.cpp.o CMakeFiles/MIOpen.dir/ocl/utilocl.cpp.o CMakeFiles/MIOpen.dir/ocl/gcn_asm_utils.cpp.o CMakeFiles/MIOpen.dir/pooling.cpp.o CMakeFiles/MIOpen.dir/__/db.cpp.o CMakeFiles/MIOpen.dir/__/kernel.cpp.o CMakeFiles/MIOpen.dir/gemm.cpp.o CMakeFiles/MIOpen.dir/gemm_geometry.cpp.o CMakeFiles/MIOpen.dir/hip/hiperrors.cpp.o CMakeFiles/MIOpen.dir/hip/handlehip.cpp.o CMakeFiles/MIOpen.dir/hipoc/hipoc_kernel.cpp.o CMakeFiles/MIOpen.dir/hipoc/hipoc_program.cpp.o -lstdc++ -amdgpu-target=gfx803 -amdgpu-target=gfx900 -Wno-unused-command-line-argument /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hcc-1.0/lib/libhc_am.so /opt/rocm/miopengemm/lib/libmiopengemm.so -lOpenSSL::Crypto -lboost_filesystem -lboost_system -hc -L /opt/rocm/hcc-1.0/lib -Wl,-rpath /opt/rocm/hcc-1.0/lib -Wl,--whole-archive /opt/rocm/hcc-1.0/lib/libmcwamp.a -lunwind -Wl,--no-whole-archive -ldl -lm /opt/rocm/lib/libhsa-runtime64.so -lpthread /opt/rocm/opencl/lib/x86_64/libOpenCL.so -Wl,-rpath,/opt/rocm/hip/lib:/opt/rocm/hcc-1.0/lib:/opt/rocm/miopengemm/lib:/opt/rocm/lib:/opt/rocm/opencl/lib/x86_64: 
Error running link command: Segmentation fault
src/CMakeFiles/MIOpen.dir/build.make:1295: recipe for target 'lib/libMIOpen.so.1' failed
make[2]: *** [lib/libMIOpen.so.1] Error 1
CMakeFiles/Makefile2:424: recipe for target 'src/CMakeFiles/MIOpen.dir/all' failed
make[1]: *** [src/CMakeFiles/MIOpen.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

System details:

Threadripper 1950x
Gigabyte Gaming 7 x399
4x Vega FE

Figured I would post this here as another data point. I'm hoping these kinks get worked out soon... itching to use fp16 packed math for training my models!

dagamayank · 2017-09-26T14:42:14Z

I hit an issue while building MIOpen, seems I'm missing OpenSSL::Crypto. Still trying to figure out which package provides that on Ubuntu 16.04.

@jamilbk Can you try installing libssl-dev and check again? sudo apt-get install libssl-dev.

jamilbk · 2017-09-26T19:25:44Z

Yeah I had installed libssl-dev because of a previous OpenSSL error. Then I ran into this issue which I fixed by symlinking opensslconf.h into /usr/lib/openssl and then I hit this error.

Perhaps I need to symlink the OpenSSL::Crypto library file as well? This all seems to be caused by libssl-dev being installed into the 64-bit specific dirs and MIOpen not looking there?

Apologies for my lacking Linux linker skills!

dagamayank · 2017-09-26T19:39:20Z

Please paste the output of dpkg -l | grep libssl

jamilbk · 2017-09-26T19:54:14Z

jamil@fridge:~/tmp/pmbw-0.6.2 % dpkg -l | grep libssl
ii  libssl-dev:amd64                                   1.0.2g-1ubuntu4.8                                amd64        Secure Sockets Layer toolkit - development files
ii  libssl-doc                                         1.0.2g-1ubuntu4.8                                all          Secure Sockets Layer toolkit - development documentation
ii  libssl1.0.0:amd64                                  1.0.2g-1ubuntu4.8                                amd64        Secure Sockets Layer toolkit - shared libraries
ii  libsslcommon2:amd64                                0.16-9ubuntu2                                    amd64        enterprise messaging system - common SSL libraries
ii  libsslcommon2-dev:amd64                            0.16-9ubuntu2                                    amd64        enterprise messaging system - common SSL development files

I had installed libsslcommon2 in the offchance it was related somehow.

dagamayank · 2017-09-27T16:32:56Z

@pfultz2 do you have any insights?

pfultz2 · 2017-09-29T23:43:12Z

@jamilbk Did you try removing your build directory and starting again?

pfultz2 · 2017-09-29T23:48:02Z

Perhaps I need to symlink the OpenSSL::Crypto library file as well?

No, this is an imported target in cmake that is defined by find_package(OpenSSL). The fact that it shows up as -lOpenSSL::Crypto means it was not defined because it was not found the first time running cmake. Clearing out the build directory, and re-running cmake should fix it.

jamilbk · 2017-10-01T04:24:01Z

@pfultz2 Apologies for the delay. I've removed the build dir and retried parallelo's commands for building MIOpen with the same result -- 'openssl/opensslconf.h' file not found -- though it exists at /usr/include/x86_64-linux-gnu/openssl/opensslconf.h. I have rocm, libssl-dev, and all the packages installed listed on the hipCaffe quickstart. Here's the full log:

jamil@fridge:~/dl/MIOpen (master=) % gst
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean
jamil@fridge:~/dl/MIOpen (master=) % mkdir -p build && cd buildjamil@fridge:~/dl/MIOpen/build (master=) % CXX=`which hcc` cmake -DHIP_OC_COMPILER=/opt/rocm/bin/clang-ocl -DCMAKE_PREFIX_PATH="/opt/rocm/hcc;/opt/rocm/hip" -DOPENCL_INCLUDE_DIRS="/opt/rocm/opencl/include" ..
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is Clang 5.0.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rocm/bin/hcc
-- Check for working CXX compiler: /opt/rocm/bin/hcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- hip compiler: /opt/rocm/bin/clang-ocl
-- HIP backend selected.
-- AMDGCN assembler: /opt/rocm/opencl/bin/x86_64/clang
-- Build with miopengemm
-- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libssl.so;/usr/lib/x86_64-linux-gnu/libcrypto.so (found version "1.0.2g")
-- Boost version: 1.58.0
-- Found the following Boost libraries:
--   filesystem
--   system
-- Clang tidy not found
-- Clang tidy checks: *,-cert-err60-cpp,-cert-msc30-c,-cert-msc50-cpp,-clang-analyzer-alpha.core.CastToStruct,-clang-analyzer-optin.performance.Padding,-clang-diagnostic-deprecated-declarations,-clang-diagnostic-extern-c-compat,-cppcoreguidelines-pro-bounds-array-to-pointer-decay,-cppcoreguidelines-pro-bounds-constant-array-index,-cppcoreguidelines-pro-bounds-pointer-arithmetic,-cppcoreguidelines-pro-type-member-init,-cppcoreguidelines-pro-type-reinterpret-cast,-cppcoreguidelines-pro-type-union-access,-cppcoreguidelines-pro-type-vararg,-cppcoreguidelines-special-member-functions,-google-explicit-constructor,-google-readability-braces-around-statements,-google-readability-todo,-google-runtime-int,-google-runtime-references,-hicpp-explicit-conversions,-hicpp-special-member-functions,-hicpp-use-equals-default,-hicpp-use-override,-llvm-header-guard,-llvm-include-order,-misc-macro-parentheses,-misc-misplaced-const,-misc-misplaced-widening-cast,-modernize-loop-convert,-modernize-pass-by-value,-modernize-use-default-member-init,-modernize-use-emplace,-modernize-use-equals-default,-modernize-use-transparent-functors,-performance-unnecessary-value-param,-readability-braces-around-statements,-readability-else-after-return,-readability-implicit-bool-cast,-readability-misleading-indentation,-readability-named-parameter,-modernize-use-override,-readability-non-const-parameter
-- Could NOT find LATEX (missing:  LATEX_COMPILER)
Latex builder not found. Latex builder is required only for building the PDF documentation for MIOpen and is not necessary for building the library, or any other components. To build PDF documentation run make in /home/jamil/dl/MIOpen/doc/pdf, once a latex builder is installed.
-- MIOpen_VERSION= 1.1.1
-- CMAKE_BUILD_TYPE= Release
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
WARNING: Target "MIOpenDriver" has EXCLUDE_FROM_ALL set and will not be built by default but an install rule has been provided for it.  CMake does not define behavior for this case.
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    OPENCL_INCLUDE_DIRS


-- Build files have been written to: /home/jamil/dl/MIOpen/build
jamil@fridge:~/dl/MIOpen/build (master%=) % make -j$(nproc) && make package -j$(nproc)
Scanning dependencies of target addkernels
[  2%] Building CXX object addkernels/CMakeFiles/addkernels.dir/include_inliner.cpp.o
[  4%] Building CXX object addkernels/CMakeFiles/addkernels.dir/addkernels.cpp.o
[  6%] Linking CXX executable ../bin/addkernels
[  6%] Built target addkernels
[  8%] Inlining MIOpen kernels
Scanning dependencies of target MIOpen
[ 10%] Building CXX object src/CMakeFiles/MIOpen.dir/convolution_fft.cpp.o
[ 12%] Building CXX object src/CMakeFiles/MIOpen.dir/errors.cpp.o
[ 14%] Building CXX object src/CMakeFiles/MIOpen.dir/convolution_api.cpp.o
[ 18%] Building CXX object src/CMakeFiles/MIOpen.dir/load_file.cpp.o
[ 18%] Building CXX object src/CMakeFiles/MIOpen.dir/logger.cpp.o
[ 20%] Building CXX object src/CMakeFiles/MIOpen.dir/activ_api.cpp.o
[ 22%] Building CXX object src/CMakeFiles/MIOpen.dir/lrn_api.cpp.o
[ 26%] Building CXX object src/CMakeFiles/MIOpen.dir/convolution.cpp.o
[ 26%] Building CXX object src/CMakeFiles/MIOpen.dir/pooling_api.cpp.o
[ 28%] Building CXX object src/CMakeFiles/MIOpen.dir/batch_norm.cpp.o
[ 30%] Building CXX object src/CMakeFiles/MIOpen.dir/softmax_api.cpp.o
[ 34%] Building CXX object src/CMakeFiles/MIOpen.dir/batch_norm_api.cpp.o
[ 34%] Building CXX object src/CMakeFiles/MIOpen.dir/handle_api.cpp.o
[ 38%] Building CXX object src/CMakeFiles/MIOpen.dir/kernel_warnings.cpp.o
[ 38%] Building CXX object src/CMakeFiles/MIOpen.dir/tmp_dir.cpp.o
[ 40%] Building CXX object src/CMakeFiles/MIOpen.dir/tensor.cpp.o
[ 42%] Building CXX object src/CMakeFiles/MIOpen.dir/mlo_dir_conv.cpp.o
[ 44%] Building CXX object src/CMakeFiles/MIOpen.dir/tensor_api.cpp.o
[ 46%] Building CXX object src/CMakeFiles/MIOpen.dir/binary_cache.cpp.o
[ 48%] Building CXX object src/CMakeFiles/MIOpen.dir/kernel_cache.cpp.o
[ 52%] Building CXX object src/CMakeFiles/MIOpen.dir/md5.cpp.o
[ 52%] Building CXX object src/CMakeFiles/MIOpen.dir/activ.cpp.o
[ 54%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/activ_ocl.cpp.o
[ 56%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/batchnormocl.cpp.o
[ 58%] Building CXX object src/CMakeFiles/MIOpen.dir/lrn.cpp.o
[ 60%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/mloNeuron.cpp.o
[ 62%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/mloNorm.cpp.o
[ 64%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/pooling_ocl.cpp.o
[ 66%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/mloPooling.cpp.o
[ 68%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/lrn_ocl.cpp.o
[ 70%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/convolutionocl.cpp.o
[ 72%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/convolutionocl_fft.cpp.o
In file included from /home/jamil/dl/MIOpen/src/md5.cpp:2:
In file included from /usr/include/openssl/md5.h:62:
/usr/include/openssl/e_os2.h:56:10: fatal error: 'openssl/opensslconf.h' file not found
#include <openssl/opensslconf.h>
         ^~~~~~~~~~~~~~~~~~~~~~~
[ 74%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/tensorocl.cpp.o
[ 76%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/softmaxocl.cpp.o
1 error generated.
src/CMakeFiles/MIOpen.dir/build.make:543: recipe for target 'src/CMakeFiles/MIOpen.dir/md5.cpp.o' failed
make[2]: *** [src/CMakeFiles/MIOpen.dir/md5.cpp.o] Error 1
make[2]: *** Deleting file 'src/CMakeFiles/MIOpen.dir/md5.cpp.o'
make[2]: *** Waiting for unfinished jobs....

dagamayank · 2017-10-01T16:43:18Z

@jamilbk Seems like you are installing libssl using apt-get. Please follow the instructions on README on how fix the above error. Review this particular section:
"An example cmake step can be:

OpenSSL installed using apt-get on Ubuntu v16? Yes.

CXX=/opt/rocm/hcc/bin/hcc cmake -DMIOPEN_BACKEND=HIP -DCMAKE_PREFIX_PATH="/opt/rocm/hcc;/opt/rocm/hip" -DCMAKE_CXX_FLAGS="-isystem /usr/include/x86_64-linux-gnu/" ..
"
You need to add -DCMAKE_CXX_FLAGS="-isystem /usr/include/x86_64-linux-gnu/ to your cmake step.

jamilbk · 2017-10-03T04:56:31Z

Thanks @dagamayank that fixed the compile. Now I'm back to the original problem noted in this thread Page not present or supervisor privilege. -- but I suspect it's because I need to set the batch size as @parallelo pointed out and download some ImageNet data to test with. I'll keep you updated if that doesn't fix the issue.

ginsongsong closed this as completed Jul 31, 2017

jamilbk mentioned this issue Oct 24, 2017

Memory error on Vega 10 plaidml/plaidml#34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training problem. #11

Multi-GPU training problem. #11

ginsongsong commented Jul 20, 2017 •

edited

Loading

parallelo commented Jul 20, 2017

ginsongsong commented Jul 21, 2017 •

edited

Loading

parallelo commented Jul 21, 2017

ginsongsong commented Jul 24, 2017 •

edited

Loading

parallelo commented Jul 27, 2017

ginsongsong commented Jul 31, 2017

jamilbk commented Sep 26, 2017 •

edited

Loading

dagamayank commented Sep 26, 2017

jamilbk commented Sep 26, 2017 •

edited

Loading

dagamayank commented Sep 26, 2017

jamilbk commented Sep 26, 2017

dagamayank commented Sep 27, 2017

pfultz2 commented Sep 29, 2017

pfultz2 commented Sep 29, 2017

jamilbk commented Oct 1, 2017

dagamayank commented Oct 1, 2017

jamilbk commented Oct 3, 2017

Multi-GPU training problem. #11

Multi-GPU training problem. #11

Comments

ginsongsong commented Jul 20, 2017 • edited Loading

Issue summary

Steps to reproduce

Your system configuration

parallelo commented Jul 20, 2017

ginsongsong commented Jul 21, 2017 • edited Loading

parallelo commented Jul 21, 2017

ginsongsong commented Jul 24, 2017 • edited Loading

parallelo commented Jul 27, 2017

ginsongsong commented Jul 31, 2017

jamilbk commented Sep 26, 2017 • edited Loading

dagamayank commented Sep 26, 2017

jamilbk commented Sep 26, 2017 • edited Loading

dagamayank commented Sep 26, 2017

jamilbk commented Sep 26, 2017

dagamayank commented Sep 27, 2017

pfultz2 commented Sep 29, 2017

pfultz2 commented Sep 29, 2017

jamilbk commented Oct 1, 2017

dagamayank commented Oct 1, 2017

jamilbk commented Oct 3, 2017

ginsongsong commented Jul 20, 2017 •

edited

Loading

ginsongsong commented Jul 21, 2017 •

edited

Loading

ginsongsong commented Jul 24, 2017 •

edited

Loading

jamilbk commented Sep 26, 2017 •

edited

Loading

jamilbk commented Sep 26, 2017 •

edited

Loading