Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

unix-cpu MKL/MKL-DNN Test Time #18244

Closed
szha opened this issue May 5, 2020 · 14 comments
Closed

unix-cpu MKL/MKL-DNN Test Time #18244

szha opened this issue May 5, 2020 · 14 comments

Comments

@szha
Copy link
Member

szha commented May 5, 2020

Description

Since #18146 we introduced parallel testing in CI with the hope of reducing test time. However, during that effort we noticed that the MKL and MKLDNN tests run slower than the setting without MKL or MKLDNN. This issue summarizes the set up and current time difference.

Setup

The results in this issue come from this CI run, and the time difference is similar in master branch validation.

To show the results, we compare the following test nodes.

Python 3: CPU

Build

build_ubuntu_cpu_openblas() {
    set -ex
    cd /work/build
    CXXFLAGS="-Wno-error=strict-overflow" CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_TVM_OP=ON \
        -DUSE_CPP_PACKAGE=ON \
        -DUSE_MKL_IF_AVAILABLE=OFF \
        -DUSE_MKLDNN=OFF \
        -DUSE_CUDA=OFF \
        -DUSE_DIST_KVSTORE=ON \
        -DBUILD_CYTHON_MODULES=ON \
        -G Ninja /work/mxnet
    ninja
}

Python 3: MKL-CPU

Build

build_ubuntu_cpu_mkl() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKLDNN=OFF \
        -DUSE_CUDA=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKL_IF_AVAILABLE=ON \
        -DUSE_BLAS=MKL \
        -GNinja /work/mxnet
    ninja
}

Python 3: MKLDNN-CPU

Build

build_ubuntu_cpu_mkldnn() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKL_IF_AVAILABLE=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKLDNN=ON \
        -DUSE_CUDA=OFF \
        -DUSE_CPP_PACKAGE=ON \
        -G Ninja /work/mxnet
    ninja
}

Python 3: MKLDNN-MKL-CPU

Build

build_ubuntu_cpu_mkldnn_mkl() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKLDNN=ON \
        -DUSE_CUDA=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKL_IF_AVAILABLE=ON \
        -DUSE_BLAS=MKL \
        -GNinja /work/mxnet
    ninja
}

Tests

Each of the test node runs one of the two following test functions

python3_ut

unittest_ubuntu_python3_cpu() {
    set -ex
    export PYTHONPATH=./python/
    export MXNET_MKLDNN_DEBUG=0  # Ignored if not present
    export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
    export MXNET_SUBGRAPH_VERBOSE=0
    export MXNET_ENABLE_CYTHON=0
    export DMLC_LOG_STACK_TRACE_DEPTH=10
    pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
    pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
    pytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantization
}

python_ut_mkldnn

unittest_ubuntu_python3_cpu_mkldnn() {
    set -ex
    export PYTHONPATH=./python/
    export MXNET_MKLDNN_DEBUG=0  # Ignored if not present
    export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
    export MXNET_SUBGRAPH_VERBOSE=0
    export MXNET_ENABLE_CYTHON=0
    export DMLC_LOG_STACK_TRACE_DEPTH=10
    pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
    pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
    pytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mkl
}

Test steps

In order to show fine-grain time result, I break down each of the steps in the test nodes as the following:

  • docker launch: the time it takes to prepare and launch the docker
  • parallel unittest: pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
  • serial unittest: pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
  • quantization test: pytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantization
  • mkl test: pytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mkl

Results

The unit for the following results are seconds. Python 3: CPU results are considered baseline.

Test docker launch parallel unittest serial unittest quantization mkl test
Python 3: CPU 892 981 951 39 N/A
Python 3: MKL-CPU 874 4996 (+409%) 4380 (+361%) 67 N/A
Python 3: MKLDNN-CPU 892 5096 (+419%) 3314 (+248%) N/A 1080
Python 3: MKLDNN-MKL-CPU 901 3899 (+397%) 3507 (+269%) N/A 1210
@szha
Copy link
Member Author

szha commented May 5, 2020

cc @TaoLv @PatricZhao

@szha szha mentioned this issue May 5, 2020
6 tasks
@pengzhao-intel
Copy link
Contributor

Thanks to raising this concern. we will take a look soon.

@TaoLv
Copy link
Member

TaoLv commented May 6, 2020

I notice MXNET_MKLDNN_DEBUG is already disabled. I suspect some of the overhead come from primitive creation as the backend is designed for real world usages where the primitives will be created and cached in the first iteration, and reused in the follows. But in unit tests, most of primitives will be created and only used once.

@szha Do you have any detailed report where we can find the test time for each test for test file?

@szha
Copy link
Member Author

szha commented May 6, 2020

@TaoLv for the example link I included, you can find the log where top 50 most time consuming tests are listed for each run.

@szha
Copy link
Member Author

szha commented May 16, 2020

Any update on this?

szha added a commit that referenced this issue May 17, 2020
* run operator tests with naive engine

* fix take tests

* update skip mark

* fix cuda error reset

* adjust tests

* disable parallel testing and naive engine for mkl/mkldnn #18244
@szha
Copy link
Member Author

szha commented May 17, 2020

For now I disabled the MKL/MKLDNN parallel testing and naive engine change due to the excessive testing time these builds take. for testing, revert this commit: 9b7b7e2

@bgawrych
Copy link
Contributor

@szha I've done some testing in docker for parallel tests and have some overview. For testing I've used about 35 long tests.
pytest -n 4 spawns much more threads for MKL/MKLDNN built and all the time during test CPU usage is 100% (24 cores used out of 24). So it's probably starvation problem.
Threads spawned:
MKL/MKLDNN build: most of the time it was about 250, but pick was before end and it was 265
CPU build : 140 (pick 150)

I've tested time for MKLDNN built with set OMP_NUM_THREADS=n/4 (n - number of cores)
and it was significat faster (about 1.7x)

I'm gonna do some testing with MKL and OMP flags and point the best configuration and also figure out 'serial' case

cc: @TaoLv

@bgawrych
Copy link
Contributor

Second problem is that on CI docker image, mxnet built with cmake is probably not using Intel OpenMP, but provided with GCC OpenMP :

But strange thing is that ldd command shows different linking:

  • for make: ldd lib/libmxnet.so | grep omp shows:
    libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
  • for cmake: ldd build/libmxnet. so | grep omp shows:
    libomp.so => /work/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
    Removing 3rdparty/openmp didn’t help

• mxnet built with make build system is faster about 8x times
o test_operator.py::test_lstm_bidirectional – cmake built execution time: 305 sec
o test_operator.py::test_lstm_bidirectional – make built execution time: 34.05 sec

• I've tried lots of build combinations with different env. variables
eg. setting LD_LIBRARY_PATH, etc.
nothing helps, trying and digging more

@leezu
Copy link
Contributor

leezu commented May 28, 2020

cmake build uses MKL by default if available. You can look at the cmake configuration output. To force using MKL, set cmake -DUSE_BLAS=MKL. Reference https://github.com/apache/incubator-mxnet/blob/382279e0d5b88b2bd05e6c00105fe1e9eb2bc600/cmake/ChooseBlas.cmake#L21-L32

If MKL Blas is used, OpenMP will not be built

https://github.com/apache/incubator-mxnet/blob/382279e0d5b88b2bd05e6c00105fe1e9eb2bc600/CMakeLists.txt#L402-L427

There may be a bug in these two reference, or your build environment isn't correct?

@bgawrych
Copy link
Contributor

I discovered that slowdown is caused by ENABLE_TESTCOVERAGE=ON flag not by OpenMP.
I think overhead comes from locking some resources by injected code, because
OMP_NUM_THREADS=1 improves speed of tests.

With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:

  • test_operator.py::test_lstm_bidirectional – test time: 305s
  • tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 479.93s

With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:

  • test_operator.py::test_lstm_bidirectional – test time: 236.45s
  • tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 336.66s

Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:

  • test_operator.py::test_lstm_bidirectional – test time: 40.93s
  • tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 58.14s

Summary:

  • setting OMP_NUM_THREADS=1 helps
  • ENABLE_TESTCOVERAGE flag is root cause of slowdown
  • setting OMP_NUM_THREADS/4 - for parallel tests helps (or first bullet)

CC: @TaoLv , @leezu , @szha

@leezu
Copy link
Contributor

leezu commented May 29, 2020

Thank you for looking into this! Do you have any insights into why this only affects MKL builds?

@bgawrych
Copy link
Contributor

bgawrych commented Jun 1, 2020

@leezu

Thank you for looking into this! Do you have any insights into why this only affects MKL builds?

This affects CPU (openBLAS) build as well.
CPU build measurement:

With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:

  • test_operator.py::test_lstm_bidirectional – test time: 105.87s and 112.04s

With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:

  • test_operator.py::test_lstm_bidirectional – test time: 89.02s and 91.87s

Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:

  • test_operator.py::test_lstm_bidirectional – test time: 86.25s and 86.75s

Without ENABLE_TESTCOVERAGE & OMP_NUM_THREADS=12:

  • test_operator.py::test_lstm_bidirectional – test time: 64.68s and 65.76s

Last test case shows that openBLAS build have some problems with utilizing all threads.

AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this issue Jul 6, 2020
* run operator tests with naive engine

* fix take tests

* update skip mark

* fix cuda error reset

* adjust tests

* disable parallel testing and naive engine for mkl/mkldnn apache#18244
@bgawrych
Copy link
Contributor

@szha Can we close this issue as root cause was found?

@szha
Copy link
Member Author

szha commented Jul 27, 2020

@bgawrych yes. thank you for tracking this down.

@szha szha closed this as completed Jul 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants