-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Basic CPU Kernel OMP selection based upon whether GPU has been used #7854
Conversation
src/engine/threaded_engine.h
Outdated
#if MXNET_USE_CUDA | ||
if(run_ctx.ctx.dev_mask() == gpu::kDevMask) { | ||
// Signify to kernel that GPU is being used | ||
mxnet::op::mxnet_op::KernelState::SetUsingGPU(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be done in gpu's lazy alloc queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queue doesn't know that it's executing GPU, right? Are you suggesting to set the queue to know that it should call this function? That seems kind of messy, right?
This needs to be combined with launching more cpu workers when using GPU to be useful. |
The complex logic to determine how many would be needed is coming in a separate PR |
src/operator/mxnet_op.h
Outdated
@@ -221,12 +236,23 @@ template<typename OP> | |||
struct Kernel<OP, cpu> { | |||
template<typename ...Args> | |||
inline static void Launch(mshadow::Stream<cpu> *s, int N, Args... args) { | |||
#if (MXNET_USE_CUDA == 0) | |||
#if MXNET_USE_CUDA == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't even need this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not
* Disabling the test_CSVIter for now This test causing random failure while running on windows. Disabling it for now till we fix it. An git hub issue has been created to track it. * Update test_io.py * Update test_io.py
src/operator/mxnet_op.cc
Outdated
@@ -0,0 +1,31 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/engine/threaded_engine.h
Outdated
@@ -293,6 +301,19 @@ class ThreadedEngine : public Engine { | |||
finished_cv_.notify_all(); | |||
} | |||
|
|||
static int DefaultOMPThreadsPerWorker() { | |||
int cores = std::thread::hardware_concurrency(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
physical core or logical core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am changing in other branch to omp number of processors call.
src/engine/threaded_engine.h
Outdated
cores = omp_get_num_threads(); | ||
} else { | ||
// By default, leave one core to run the engine | ||
--cores; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may need to leave more threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually for CPU only case 1 is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "case 1"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "case 1"?
src/operator/mxnet_op.h
Outdated
OP::Map(i, args...); | ||
} | ||
} else { | ||
#pragma omp parallel for num_threads(omp_cores - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am changing to omp_threads only. Was leaving a thread for the engine. However in this PR I only wish to keep same OMP behavior other than consistent behavior between CPU and GPU builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving 1 for the engine.
Eventually probably OMP will be, for example, divided by number of concurrent ops running. Also I am working on tuning which will likely come into play as well. For a later PR early next week.
Please be advised this change is only meant to have the previous behavior for CPU builds when running in GPU mode with GPU not used. More elegant OMP behavior is forthcoming in a later PR next week. |
…. This is not changed from master branch. Trying a different format.
src/engine/threaded_engine.h
Outdated
// TODO(cjolivier01): Programatically obtain hyperthreading count (if supported) | ||
// Taking max including omp_get_max_threads() in case this implementation of OMP accounts for | ||
// hyperthreading | ||
return std::max(omp_get_max_threads(), omp_get_num_procs()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may have been set by the environment variable, or it may have been set elsewhere to something lower. However, a previous call to set_max... in some library may have reduced it, but we want to either use a larger number in the environment (ie they wish to use hyperthreading number * procs), or the number of procs.
More OMP tuning is in coming PR (including recursion depth, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would user opt to use less number of threads than the number of cores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model serving where you have a separate webserver process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to allow environment override. omp_get_max_threads() can be implementation-specific and may take into account hyperthreading or whatever. Otherwise, we use # procs (per Eric)
This is an example of CI leaving artifacts from previous builds in the workspace on the builkd machine (ie lua-package/) |
…to optimize_basic_omp
Trigger build
apache#7854 Unit test framework for C++ timing of generic operators. Activation operator converted to Kernel from MShadow. Performance improves (see below). OLD Timing: 50 iterations of 10 calls, shape = (1,1,28,28) Fully connected: Timing [Forward] 56.215 ms, avg: 0.11243 ms X 500 passes Fully connected: Timing [Backward] 69.322 ms, avg: 0.138644 ms X 500 passes Timing: 50 iterations of 10 calls, shape = (1,3,28,28) Fully connected: Timing [Forward] 24.187 ms, avg: 0.048374 ms X 500 passes Fully connected: Timing [Backward] 33.798 ms, avg: 0.067596 ms X 500 passes Timing: 50 iterations of 10 calls, shape = (50,1,18,32) Fully connected: Timing [Forward] 98.219 ms, avg: 0.196438 ms X 500 passes Fully connected: Timing [Backward] 35.933 ms, avg: 0.071866 ms X 500 passes Timing: 50 iterations of 10 calls, shape = (50,3,18,32) Fully connected: Timing [Forward] 346.737 ms, avg: 0.693474 ms X 500 passes Fully connected: Timing [Backward] 60.141 ms, avg: 0.120282 ms X 500 passes Timing: 50 iterations of 10 calls, shape = (20,3,128,128) Fully connected: Timing [Forward] 3607.84 ms, avg: 7.21567 ms X 500 passes Fully connected: Timing [Backward] 387.725 ms, avg: 0.77545 ms X 500 passes NEW Timing: 50 iterations of 10 calls, shape = (1,1,28,28) Fully connected: Timing [Forward] 44.111 ms, avg: 0.088222 ms X 500 passes Fully connected: Timing [Backward] 0.84 ms, avg: 0.00168 ms X 500 passes Timing: 50 iterations of 10 calls, shape = (1,3,28,28) Fully connected: Timing [Forward] 16.093 ms, avg: 0.032186 ms X 500 passes Fully connected: Timing [Backward] 1.419 ms, avg: 0.002838 ms X 500 passes Timing: 50 iterations of 10 calls, shape = (50,1,18,32) Fully connected: Timing [Forward] 137.882 ms, avg: 0.275764 ms X 500 passes Fully connected: Timing [Backward] 38.945 ms, avg: 0.07789 ms X 500 passes Timing: 50 iterations of 10 calls, shape = (50,3,18,32) Fully connected: Timing [Forward] 340.161 ms, avg: 0.680322 ms X 500 passes Fully connected: Timing [Backward] 68.256 ms, avg: 0.136512 ms X 500 passes Timing: 50 iterations of 10 calls, shape = (20,3,128,128) Fully connected: Timing [Forward] 3465.03 ms, avg: 6.93005 ms X 500 passes Fully connected: Timing [Backward] 322.912 ms, avg: 0.645824 ms X 500 passes - [ ] Passed code style checking (`make lint`) - [ ] Changes are complete (i.e. I finished coding on this PR) - [ ] All changes have test coverage - [ ] For user-facing API changes, API doc string has been updated. - [ ] To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change - [ ] Feature1, tests, (and when applicable, API doc) - [ ] Feature2, tests, (and when applicable, API doc) - If this change is a backward incompatible change, why must this change be made. - Intersting edge cases to note here
Note, I have a couple of PR's stacked up behind this one... |
…pache#7854) * Basic CPU Kernel OMP selection based upon whether GPU has been used * lint * Disabling the test_CSVIter for now (apache#7829) * Disabling the test_CSVIter for now This test causing random failure while running on windows. Disabling it for now till we fix it. An git hub issue has been created to track it. * Update test_io.py * Update test_io.py * Use OMP thread count as test in Kernel, set count for Kernel loop * lint * removed * Remove assert * Adjust DefaultOMPThreadsPerWorker * remove -1 from omp_cores * Trigger build * It is not clear why pylint claims that this is re-imported. It is not. This is not changed from master branch. Trying a different format. * lint * lint * Change getter/setter naming style * allow env override * check environment directly, since OMP_NUM_THREADS mnay have odd formatting (i.e. 3, 2"). * CR comments * Squashed commit of the following: commit ec704f1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:29:25 2017 -0700 Fix formatting commit 0218c49 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:21:48 2017 -0700 Splitting unary ops commit 9abbba1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 11:38:04 2017 -0700 split unary * Update mxnet_predict0.cc * Update mxnet_predict0.cc * fix oversight with bracket * Binary scatter working on CPU and GPU * return unchanged * This test case is BS. I can't even tell what's wrong on the CI build because so many errors coming from this test. * inconsequential cleanup * Update test_kvstore.py * Update CMakeLists.txt * Update CMakeLists.txt trigger build * force fail * remove forced error * test clean every make * Test * Copy Jenkinsfile from upstream/master to fix the build. * logic was reversed * Update threaded_engine.h Trigger build * Trigger rebuild * Trigger build * Trigger build
#8232) * GPROF update, also include include/mxnet/*.h as sources for CLionwq * Added FindGperftools.cmake * Add option USE_GPERFTOOLS * Add option USE_GPERFTOOLS * Add option USE_GPERFTOOLS * USE_GPERFTOOLS off by default for now * Add Apache license to FindGperftools.cmake * Update CMakeLists.txt Try to use GPerftools or JEmalloc by default * Update CMakeLists.txt Off by default for now * internal labeling * gperftools and jemalloc * gperftools and jemalloc on by default * Fixing the Caught error (#8199) * Temporarily disable some unit tests to fix the build (#8253) * Temporarily disable the following unit tests that have been causing build failures: test_rms: This can be re-enabled once #8230 is fixed. test_autograd_save_memory: This can be re-enabled once #8211 is fixed. * OMP num threads 0->1 * remove check * Update documentation links to point to mxnet.incubator.apache.org Update documentation links to point to mxnet.incubator.apache.org * add export to gluon (#8212) * add export * fix * add test * fix nnvm * fix * ReleaseFeedback: License Files (#8247) * Updating license Headers * License changes * Sequential aug (#8243) * add sequentialAug * add type for castaug * modify docs * Basic CPU Kernel OMP selection based upon whether GPU has been used (#7854) * Basic CPU Kernel OMP selection based upon whether GPU has been used * lint * Disabling the test_CSVIter for now (#7829) * Disabling the test_CSVIter for now This test causing random failure while running on windows. Disabling it for now till we fix it. An git hub issue has been created to track it. * Update test_io.py * Update test_io.py * Use OMP thread count as test in Kernel, set count for Kernel loop * lint * removed * Remove assert * Adjust DefaultOMPThreadsPerWorker * remove -1 from omp_cores * Trigger build * It is not clear why pylint claims that this is re-imported. It is not. This is not changed from master branch. Trying a different format. * lint * lint * Change getter/setter naming style * allow env override * check environment directly, since OMP_NUM_THREADS mnay have odd formatting (i.e. 3, 2"). * CR comments * Squashed commit of the following: commit ec704f1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:29:25 2017 -0700 Fix formatting commit 0218c49 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:21:48 2017 -0700 Splitting unary ops commit 9abbba1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 11:38:04 2017 -0700 split unary * Update mxnet_predict0.cc * Update mxnet_predict0.cc * fix oversight with bracket * Binary scatter working on CPU and GPU * return unchanged * This test case is BS. I can't even tell what's wrong on the CI build because so many errors coming from this test. * inconsequential cleanup * Update test_kvstore.py * Update CMakeLists.txt * Update CMakeLists.txt trigger build * force fail * remove forced error * test clean every make * Test * Copy Jenkinsfile from upstream/master to fix the build. * logic was reversed * Update threaded_engine.h Trigger build * Trigger rebuild * Trigger build * Trigger build * Multiplatform docker based builds (#7792) * Add dockerized multi-architecture build files * Add android arm64 build * Operators for sum(csr, axis=0) and sum(csr, axis=1) (#8174) * Add Infer storage for sparse slice operator * Remove unused files * Indentation fix and add gpu test for fallback * Change sum builtin to py_sum * Add sum_axis(csr,axis=0)=dense and sum(csr,axis=1)=dense operator * Documentation changes for sparse * Add fallback unittest for keepdims and exclude * PR review based changes : * Fix CHECK_NE * Change in_stype to int * Using const int instead of int * Initialize mid with the start * Generalizing * OMP num threads 0->1 * remove check
apache#8232) * GPROF update, also include include/mxnet/*.h as sources for CLionwq * Added FindGperftools.cmake * Add option USE_GPERFTOOLS * Add option USE_GPERFTOOLS * Add option USE_GPERFTOOLS * USE_GPERFTOOLS off by default for now * Add Apache license to FindGperftools.cmake * Update CMakeLists.txt Try to use GPerftools or JEmalloc by default * Update CMakeLists.txt Off by default for now * internal labeling * gperftools and jemalloc * gperftools and jemalloc on by default * Fixing the Caught error (apache#8199) * Temporarily disable some unit tests to fix the build (apache#8253) * Temporarily disable the following unit tests that have been causing build failures: test_rms: This can be re-enabled once apache#8230 is fixed. test_autograd_save_memory: This can be re-enabled once apache#8211 is fixed. * OMP num threads 0->1 * remove check * Update documentation links to point to mxnet.incubator.apache.org Update documentation links to point to mxnet.incubator.apache.org * add export to gluon (apache#8212) * add export * fix * add test * fix nnvm * fix * ReleaseFeedback: License Files (apache#8247) * Updating license Headers * License changes * Sequential aug (apache#8243) * add sequentialAug * add type for castaug * modify docs * Basic CPU Kernel OMP selection based upon whether GPU has been used (apache#7854) * Basic CPU Kernel OMP selection based upon whether GPU has been used * lint * Disabling the test_CSVIter for now (apache#7829) * Disabling the test_CSVIter for now This test causing random failure while running on windows. Disabling it for now till we fix it. An git hub issue has been created to track it. * Update test_io.py * Update test_io.py * Use OMP thread count as test in Kernel, set count for Kernel loop * lint * removed * Remove assert * Adjust DefaultOMPThreadsPerWorker * remove -1 from omp_cores * Trigger build * It is not clear why pylint claims that this is re-imported. It is not. This is not changed from master branch. Trying a different format. * lint * lint * Change getter/setter naming style * allow env override * check environment directly, since OMP_NUM_THREADS mnay have odd formatting (i.e. 3, 2"). * CR comments * Squashed commit of the following: commit ec704f1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:29:25 2017 -0700 Fix formatting commit 0218c49 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:21:48 2017 -0700 Splitting unary ops commit 9abbba1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 11:38:04 2017 -0700 split unary * Update mxnet_predict0.cc * Update mxnet_predict0.cc * fix oversight with bracket * Binary scatter working on CPU and GPU * return unchanged * This test case is BS. I can't even tell what's wrong on the CI build because so many errors coming from this test. * inconsequential cleanup * Update test_kvstore.py * Update CMakeLists.txt * Update CMakeLists.txt trigger build * force fail * remove forced error * test clean every make * Test * Copy Jenkinsfile from upstream/master to fix the build. * logic was reversed * Update threaded_engine.h Trigger build * Trigger rebuild * Trigger build * Trigger build * Multiplatform docker based builds (apache#7792) * Add dockerized multi-architecture build files * Add android arm64 build * Operators for sum(csr, axis=0) and sum(csr, axis=1) (apache#8174) * Add Infer storage for sparse slice operator * Remove unused files * Indentation fix and add gpu test for fallback * Change sum builtin to py_sum * Add sum_axis(csr,axis=0)=dense and sum(csr,axis=1)=dense operator * Documentation changes for sparse * Add fallback unittest for keepdims and exclude * PR review based changes : * Fix CHECK_NE * Change in_stype to int * Using const int instead of int * Initialize mid with the start * Generalizing * OMP num threads 0->1 * remove check
#8232) * GPROF update, also include include/mxnet/*.h as sources for CLionwq * Added FindGperftools.cmake * Add option USE_GPERFTOOLS * Add option USE_GPERFTOOLS * Add option USE_GPERFTOOLS * USE_GPERFTOOLS off by default for now * Add Apache license to FindGperftools.cmake * Update CMakeLists.txt Try to use GPerftools or JEmalloc by default * Update CMakeLists.txt Off by default for now * internal labeling * gperftools and jemalloc * gperftools and jemalloc on by default * Fixing the Caught error (#8199) * Temporarily disable some unit tests to fix the build (#8253) * Temporarily disable the following unit tests that have been causing build failures: test_rms: This can be re-enabled once #8230 is fixed. test_autograd_save_memory: This can be re-enabled once #8211 is fixed. * OMP num threads 0->1 * remove check * Update documentation links to point to mxnet.incubator.apache.org Update documentation links to point to mxnet.incubator.apache.org * add export to gluon (#8212) * add export * fix * add test * fix nnvm * fix * ReleaseFeedback: License Files (#8247) * Updating license Headers * License changes * Sequential aug (#8243) * add sequentialAug * add type for castaug * modify docs * Basic CPU Kernel OMP selection based upon whether GPU has been used (#7854) * Basic CPU Kernel OMP selection based upon whether GPU has been used * lint * Disabling the test_CSVIter for now (#7829) * Disabling the test_CSVIter for now This test causing random failure while running on windows. Disabling it for now till we fix it. An git hub issue has been created to track it. * Update test_io.py * Update test_io.py * Use OMP thread count as test in Kernel, set count for Kernel loop * lint * removed * Remove assert * Adjust DefaultOMPThreadsPerWorker * remove -1 from omp_cores * Trigger build * It is not clear why pylint claims that this is re-imported. It is not. This is not changed from master branch. Trying a different format. * lint * lint * Change getter/setter naming style * allow env override * check environment directly, since OMP_NUM_THREADS mnay have odd formatting (i.e. 3, 2"). * CR comments * Squashed commit of the following: commit ec704f1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:29:25 2017 -0700 Fix formatting commit 0218c49 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:21:48 2017 -0700 Splitting unary ops commit 9abbba1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 11:38:04 2017 -0700 split unary * Update mxnet_predict0.cc * Update mxnet_predict0.cc * fix oversight with bracket * Binary scatter working on CPU and GPU * return unchanged * This test case is BS. I can't even tell what's wrong on the CI build because so many errors coming from this test. * inconsequential cleanup * Update test_kvstore.py * Update CMakeLists.txt * Update CMakeLists.txt trigger build * force fail * remove forced error * test clean every make * Test * Copy Jenkinsfile from upstream/master to fix the build. * logic was reversed * Update threaded_engine.h Trigger build * Trigger rebuild * Trigger build * Trigger build * Multiplatform docker based builds (#7792) * Add dockerized multi-architecture build files * Add android arm64 build * Operators for sum(csr, axis=0) and sum(csr, axis=1) (#8174) * Add Infer storage for sparse slice operator * Remove unused files * Indentation fix and add gpu test for fallback * Change sum builtin to py_sum * Add sum_axis(csr,axis=0)=dense and sum(csr,axis=1)=dense operator * Documentation changes for sparse * Add fallback unittest for keepdims and exclude * PR review based changes : * Fix CHECK_NE * Change in_stype to int * Using const int instead of int * Initialize mid with the start * Generalizing * OMP num threads 0->1 * remove check
…pache#7854) * Basic CPU Kernel OMP selection based upon whether GPU has been used * lint * Disabling the test_CSVIter for now (apache#7829) * Disabling the test_CSVIter for now This test causing random failure while running on windows. Disabling it for now till we fix it. An git hub issue has been created to track it. * Update test_io.py * Update test_io.py * Use OMP thread count as test in Kernel, set count for Kernel loop * lint * removed * Remove assert * Adjust DefaultOMPThreadsPerWorker * remove -1 from omp_cores * Trigger build * It is not clear why pylint claims that this is re-imported. It is not. This is not changed from master branch. Trying a different format. * lint * lint * Change getter/setter naming style * allow env override * check environment directly, since OMP_NUM_THREADS mnay have odd formatting (i.e. 3, 2"). * CR comments * Squashed commit of the following: commit ec704f1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:29:25 2017 -0700 Fix formatting commit 0218c49 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:21:48 2017 -0700 Splitting unary ops commit 9abbba1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 11:38:04 2017 -0700 split unary * Update mxnet_predict0.cc * Update mxnet_predict0.cc * fix oversight with bracket * Binary scatter working on CPU and GPU * return unchanged * This test case is BS. I can't even tell what's wrong on the CI build because so many errors coming from this test. * inconsequential cleanup * Update test_kvstore.py * Update CMakeLists.txt * Update CMakeLists.txt trigger build * force fail * remove forced error * test clean every make * Test * Copy Jenkinsfile from upstream/master to fix the build. * logic was reversed * Update threaded_engine.h Trigger build * Trigger rebuild * Trigger build * Trigger build
apache#8232) * GPROF update, also include include/mxnet/*.h as sources for CLionwq * Added FindGperftools.cmake * Add option USE_GPERFTOOLS * Add option USE_GPERFTOOLS * Add option USE_GPERFTOOLS * USE_GPERFTOOLS off by default for now * Add Apache license to FindGperftools.cmake * Update CMakeLists.txt Try to use GPerftools or JEmalloc by default * Update CMakeLists.txt Off by default for now * internal labeling * gperftools and jemalloc * gperftools and jemalloc on by default * Fixing the Caught error (apache#8199) * Temporarily disable some unit tests to fix the build (apache#8253) * Temporarily disable the following unit tests that have been causing build failures: test_rms: This can be re-enabled once apache#8230 is fixed. test_autograd_save_memory: This can be re-enabled once apache#8211 is fixed. * OMP num threads 0->1 * remove check * Update documentation links to point to mxnet.incubator.apache.org Update documentation links to point to mxnet.incubator.apache.org * add export to gluon (apache#8212) * add export * fix * add test * fix nnvm * fix * ReleaseFeedback: License Files (apache#8247) * Updating license Headers * License changes * Sequential aug (apache#8243) * add sequentialAug * add type for castaug * modify docs * Basic CPU Kernel OMP selection based upon whether GPU has been used (apache#7854) * Basic CPU Kernel OMP selection based upon whether GPU has been used * lint * Disabling the test_CSVIter for now (apache#7829) * Disabling the test_CSVIter for now This test causing random failure while running on windows. Disabling it for now till we fix it. An git hub issue has been created to track it. * Update test_io.py * Update test_io.py * Use OMP thread count as test in Kernel, set count for Kernel loop * lint * removed * Remove assert * Adjust DefaultOMPThreadsPerWorker * remove -1 from omp_cores * Trigger build * It is not clear why pylint claims that this is re-imported. It is not. This is not changed from master branch. Trying a different format. * lint * lint * Change getter/setter naming style * allow env override * check environment directly, since OMP_NUM_THREADS mnay have odd formatting (i.e. 3, 2"). * CR comments * Squashed commit of the following: commit ec704f1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:29:25 2017 -0700 Fix formatting commit 0218c49 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 12:21:48 2017 -0700 Splitting unary ops commit 9abbba1 Author: Olivier <coolivie@amazon.com> Date: Mon Sep 25 11:38:04 2017 -0700 split unary * Update mxnet_predict0.cc * Update mxnet_predict0.cc * fix oversight with bracket * Binary scatter working on CPU and GPU * return unchanged * This test case is BS. I can't even tell what's wrong on the CI build because so many errors coming from this test. * inconsequential cleanup * Update test_kvstore.py * Update CMakeLists.txt * Update CMakeLists.txt trigger build * force fail * remove forced error * test clean every make * Test * Copy Jenkinsfile from upstream/master to fix the build. * logic was reversed * Update threaded_engine.h Trigger build * Trigger rebuild * Trigger build * Trigger build * Multiplatform docker based builds (apache#7792) * Add dockerized multi-architecture build files * Add android arm64 build * Operators for sum(csr, axis=0) and sum(csr, axis=1) (apache#8174) * Add Infer storage for sparse slice operator * Remove unused files * Indentation fix and add gpu test for fallback * Change sum builtin to py_sum * Add sum_axis(csr,axis=0)=dense and sum(csr,axis=1)=dense operator * Documentation changes for sparse * Add fallback unittest for keepdims and exclude * PR review based changes : * Fix CHECK_NE * Change in_stype to int * Using const int instead of int * Initialize mid with the start * Generalizing * OMP num threads 0->1 * remove check
First iteration for performance enhancements
If GPU isn't used, then use OMP for running CPU kernels
GPU usage is triggerred by ThreadedEngine or NaiveEngine
Currently, the intended net effect of this PR is to allow for normal OMP behavior for GPU builds when the GPU is not used. More robust OMP thread management is forthcoming.