unix-cpu MKL/MKL-DNN Test Time #18244

szha · 2020-05-05T22:05:47Z

Description

Since #18146 we introduced parallel testing in CI with the hope of reducing test time. However, during that effort we noticed that the MKL and MKLDNN tests run slower than the setting without MKL or MKLDNN. This issue summarizes the set up and current time difference.

Setup

The results in this issue come from this CI run, and the time difference is similar in master branch validation.

To show the results, we compare the following test nodes.

Python 3: CPU

Build

build_ubuntu_cpu_openblas() {
    set -ex
    cd /work/build
    CXXFLAGS="-Wno-error=strict-overflow" CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_TVM_OP=ON \
        -DUSE_CPP_PACKAGE=ON \
        -DUSE_MKL_IF_AVAILABLE=OFF \
        -DUSE_MKLDNN=OFF \
        -DUSE_CUDA=OFF \
        -DUSE_DIST_KVSTORE=ON \
        -DBUILD_CYTHON_MODULES=ON \
        -G Ninja /work/mxnet
    ninja
}

Python 3: MKL-CPU

Build

build_ubuntu_cpu_mkl() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKLDNN=OFF \
        -DUSE_CUDA=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKL_IF_AVAILABLE=ON \
        -DUSE_BLAS=MKL \
        -GNinja /work/mxnet
    ninja
}

Python 3: MKLDNN-CPU

Build

build_ubuntu_cpu_mkldnn() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKL_IF_AVAILABLE=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKLDNN=ON \
        -DUSE_CUDA=OFF \
        -DUSE_CPP_PACKAGE=ON \
        -G Ninja /work/mxnet
    ninja
}

Python 3: MKLDNN-MKL-CPU

Build

build_ubuntu_cpu_mkldnn_mkl() {
    set -ex
    cd /work/build
    CC=gcc-7 CXX=g++-7 cmake \
        -DCMAKE_BUILD_TYPE="RelWithDebInfo" \
        -DENABLE_TESTCOVERAGE=ON \
        -DUSE_MKLDNN=ON \
        -DUSE_CUDA=OFF \
        -DUSE_TVM_OP=ON \
        -DUSE_MKL_IF_AVAILABLE=ON \
        -DUSE_BLAS=MKL \
        -GNinja /work/mxnet
    ninja
}

Tests

Each of the test node runs one of the two following test functions

python3_ut

unittest_ubuntu_python3_cpu() {
    set -ex
    export PYTHONPATH=./python/
    export MXNET_MKLDNN_DEBUG=0  # Ignored if not present
    export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
    export MXNET_SUBGRAPH_VERBOSE=0
    export MXNET_ENABLE_CYTHON=0
    export DMLC_LOG_STACK_TRACE_DEPTH=10
    pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
    pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
    pytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantization
}

python_ut_mkldnn

unittest_ubuntu_python3_cpu_mkldnn() {
    set -ex
    export PYTHONPATH=./python/
    export MXNET_MKLDNN_DEBUG=0  # Ignored if not present
    export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
    export MXNET_SUBGRAPH_VERBOSE=0
    export MXNET_ENABLE_CYTHON=0
    export DMLC_LOG_STACK_TRACE_DEPTH=10
    pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
    pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
    pytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mkl
}

Test steps

In order to show fine-grain time result, I break down each of the steps in the test nodes as the following:

docker launch: the time it takes to prepare and launch the docker
parallel unittest: pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest
serial unittest: pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest
quantization test: pytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantization
mkl test: pytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mkl

Results

The unit for the following results are seconds. Python 3: CPU results are considered baseline.

Test	docker launch	parallel unittest	serial unittest	quantization	mkl test
Python 3: CPU	892	981	951	39	N/A
Python 3: MKL-CPU	874	4996 (+409%)	4380 (+361%)	67	N/A
Python 3: MKLDNN-CPU	892	5096 (+419%)	3314 (+248%)	N/A	1080
Python 3: MKLDNN-MKL-CPU	901	3899 (+397%)	3507 (+269%)	N/A	1210

The text was updated successfully, but these errors were encountered:

szha · 2020-05-05T22:06:04Z

cc @TaoLv @PatricZhao

pengzhao-intel · 2020-05-06T01:26:05Z

Thanks to raising this concern. we will take a look soon.

TaoLv · 2020-05-06T01:35:47Z

I notice MXNET_MKLDNN_DEBUG is already disabled. I suspect some of the overhead come from primitive creation as the backend is designed for real world usages where the primitives will be created and cached in the first iteration, and reused in the follows. But in unit tests, most of primitives will be created and only used once.

@szha Do you have any detailed report where we can find the test time for each test for test file?

szha · 2020-05-06T17:46:54Z

@TaoLv for the example link I included, you can find the log where top 50 most time consuming tests are listed for each run.

szha · 2020-05-16T01:09:19Z

Any update on this?

* run operator tests with naive engine * fix take tests * update skip mark * fix cuda error reset * adjust tests * disable parallel testing and naive engine for mkl/mkldnn #18244

szha · 2020-05-17T18:28:04Z

For now I disabled the MKL/MKLDNN parallel testing and naive engine change due to the excessive testing time these builds take. for testing, revert this commit: 9b7b7e2

bgawrych · 2020-05-21T14:08:01Z

@szha I've done some testing in docker for parallel tests and have some overview. For testing I've used about 35 long tests.
pytest -n 4 spawns much more threads for MKL/MKLDNN built and all the time during test CPU usage is 100% (24 cores used out of 24). So it's probably starvation problem.
Threads spawned:
MKL/MKLDNN build: most of the time it was about 250, but pick was before end and it was 265
CPU build : 140 (pick 150)

I've tested time for MKLDNN built with set OMP_NUM_THREADS=n/4 (n - number of cores)
and it was significat faster (about 1.7x)

I'm gonna do some testing with MKL and OMP flags and point the best configuration and also figure out 'serial' case

cc: @TaoLv

bgawrych · 2020-05-28T13:19:41Z

Second problem is that on CI docker image, mxnet built with cmake is probably not using Intel OpenMP, but provided with GCC OpenMP :

But strange thing is that ldd command shows different linking:

for make: ldd lib/libmxnet.so | grep omp shows:
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
for cmake: ldd build/libmxnet. so | grep omp shows:
libomp.so => /work/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
Removing 3rdparty/openmp didn’t help

• mxnet built with make build system is faster about 8x times
o test_operator.py::test_lstm_bidirectional – cmake built execution time: 305 sec
o test_operator.py::test_lstm_bidirectional – make built execution time: 34.05 sec

• I've tried lots of build combinations with different env. variables
eg. setting LD_LIBRARY_PATH, etc.
nothing helps, trying and digging more

leezu · 2020-05-28T16:34:04Z

cmake build uses MKL by default if available. You can look at the cmake configuration output. To force using MKL, set cmake -DUSE_BLAS=MKL. Reference https://github.com/apache/incubator-mxnet/blob/382279e0d5b88b2bd05e6c00105fe1e9eb2bc600/cmake/ChooseBlas.cmake#L21-L32

If MKL Blas is used, OpenMP will not be built

https://github.com/apache/incubator-mxnet/blob/382279e0d5b88b2bd05e6c00105fe1e9eb2bc600/CMakeLists.txt#L402-L427

There may be a bug in these two reference, or your build environment isn't correct?

bgawrych · 2020-05-29T11:01:09Z

I discovered that slowdown is caused by ENABLE_TESTCOVERAGE=ON flag not by OpenMP.
I think overhead comes from locking some resources by injected code, because
OMP_NUM_THREADS=1 improves speed of tests.

With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:

test_operator.py::test_lstm_bidirectional – test time: 305s
tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 479.93s

With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:

test_operator.py::test_lstm_bidirectional – test time: 236.45s
tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 336.66s

Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:

test_operator.py::test_lstm_bidirectional – test time: 40.93s
tests/python/unittest/test_gluon.py::test_slice_pooling2d_slice_pooling2d - test time: 58.14s

Summary:

setting OMP_NUM_THREADS=1 helps
ENABLE_TESTCOVERAGE flag is root cause of slowdown
setting OMP_NUM_THREADS/4 - for parallel tests helps (or first bullet)

CC: @TaoLv , @leezu , @szha

leezu · 2020-05-29T18:15:50Z

Thank you for looking into this! Do you have any insights into why this only affects MKL builds?

bgawrych · 2020-06-01T09:01:54Z

@leezu

Thank you for looking into this! Do you have any insights into why this only affects MKL builds?

This affects CPU (openBLAS) build as well.
CPU build measurement:

With ENABLE_TESTCOVERAGE=1 & unset OMP_NUM_THREADS:

test_operator.py::test_lstm_bidirectional – test time: 105.87s and 112.04s

With ENABLE_TESTCOVERAGE=1 & OMP_NUM_THREADS=1:

test_operator.py::test_lstm_bidirectional – test time: 89.02s and 91.87s

Without ENABLE_TESTCOVERAGE & unset OMP_NUM_THREADS:

test_operator.py::test_lstm_bidirectional – test time: 86.25s and 86.75s

Without ENABLE_TESTCOVERAGE & OMP_NUM_THREADS=12:

test_operator.py::test_lstm_bidirectional – test time: 64.68s and 65.76s

Last test case shows that openBLAS build have some problems with utilizing all threads.

* run operator tests with naive engine * fix take tests * update skip mark * fix cuda error reset * adjust tests * disable parallel testing and naive engine for mkl/mkldnn apache#18244

bgawrych · 2020-07-27T12:03:43Z

@szha Can we close this issue as root cause was found?

szha · 2020-07-27T18:27:45Z

@bgawrych yes. thank you for tracking this down.

szha added MKL CI MKLDNN labels May 5, 2020

szha mentioned this issue May 5, 2020

[CI] run pytest in parallel #18146

Merged

6 tasks

This was referenced May 15, 2020

test_metric_performance #18330

Open

Skip test_metric_performance on MKL builds #18335

Closed

szha added a commit to szha/mxnet that referenced this issue May 16, 2020

disable parallel testing and naive engine for mkl/mkldnn apache#18244

599bef9

szha added a commit to szha/mxnet that referenced this issue May 16, 2020

disable parallel testing and naive engine for mkl/mkldnn apache#18244

9b7b7e2

This was referenced May 29, 2020

CI: tests hang during test collection stage #18442

Open

Disable test coverage in MKL builds #18443

Merged

szha closed this as completed Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unix-cpu MKL/MKL-DNN Test Time #18244

unix-cpu MKL/MKL-DNN Test Time #18244

szha commented May 5, 2020

szha commented May 5, 2020

pengzhao-intel commented May 6, 2020

TaoLv commented May 6, 2020 •

edited

Loading

szha commented May 6, 2020

szha commented May 16, 2020

szha commented May 17, 2020

bgawrych commented May 21, 2020

bgawrych commented May 28, 2020

leezu commented May 28, 2020 •

edited

Loading

bgawrych commented May 29, 2020

leezu commented May 29, 2020 •

edited

Loading

bgawrych commented Jun 1, 2020

bgawrych commented Jul 27, 2020

szha commented Jul 27, 2020

unix-cpu MKL/MKL-DNN Test Time #18244

unix-cpu MKL/MKL-DNN Test Time #18244

Comments

szha commented May 5, 2020

Description

Setup

Tests

Test steps

Results

szha commented May 5, 2020

pengzhao-intel commented May 6, 2020

TaoLv commented May 6, 2020 • edited Loading

szha commented May 6, 2020

szha commented May 16, 2020

szha commented May 17, 2020

bgawrych commented May 21, 2020

bgawrych commented May 28, 2020

leezu commented May 28, 2020 • edited Loading

bgawrych commented May 29, 2020

leezu commented May 29, 2020 • edited Loading

bgawrych commented Jun 1, 2020

bgawrych commented Jul 27, 2020

szha commented Jul 27, 2020

TaoLv commented May 6, 2020 •

edited

Loading

leezu commented May 28, 2020 •

edited

Loading

leezu commented May 29, 2020 •

edited

Loading