-
-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the new TBB interface and use tbb::task_arena #2261
Conversation
Signed-off-by: Hamada S. Badr <hamada.s.badr@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Merge once tests pass. Thank you.
task_arena is what services should use to run one chain... |
Oh right, I tried that once. Let me dig that up and take another look. |
@wds15 @rok-cesnovar |
@rok-cesnovar No, I don't think so. You can now create a math/stan/math/prim/core/init_threadpool_tbb.hpp Lines 87 to 95 in 861ce93
Also, check lines 227-236 of the new
And here's the relevant part of the functionality replacement table:
|
I think that task arena is also available in the version we are currently using. Anyhow, should be simple to verify with a quick test.But its getting late here, so will do that tomorrow. |
@rok-cesnovar I see. #1949 is a different, but related, issue though. This PR fixes the new interface that is built on the current internal threading code. Update: Now that I understand the cause of #1949, |
As I understood the TBB, if you you run code in a custom task_arena with a defined concurrency level, then that is what this-task-arena-max_concurrency will report to you just that. |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
The arena should be initialized to apply the concurrency limit. If there's no arena instance initialized, Also, by testing, |
@hsbadr will get to this in a couple of days. Sorry for the delay. |
Thanks for the heads up! It isn't urgent as I'm merging this in my fork, but wanted the fixes to be in the |
See RcppCore/RcppParallel#141, stan-dev/math/pull/2257, and stan-dev/math/pull/2261. Signed-off-by: Hamada S. Badr <hamada.s.badr@gmail.com>
Hi @rok-cesnovar, Any update? Sorry to bother you. I'd like to drop my fork but waiting on this. We can merge it to fix the new TBB code and test #1949 later, preferably after the internal TBB code gets updated. Thanks! Happy holidays! |
Sorry, been busy. Should get to it in a day pr two. |
Sorry @hsbadr only getting back to this now. Can you give me some help testing this. Lets say I take this release and place it in "~/Stan/newTBB" What do I need to set to make it work? |
By this release, I meant https://github.com/oneapi-src/oneTBB/releases/tag/v2021.1.1 |
@rok-cesnovar No worries. Hope you’re having an enjoyable holiday break! I added testing instructions in the description:
You may download OneAPI from here. |
Just build it, or install the binaries, and set the TBB environment variables. Then, with #2261, you can run |
@rok-cesnovar For installing
|
Thanks so much! |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
It's great that you've successfully built On a related note, do I need to do anything for
|
I've added your test in
Your test Edit: @rok-cesnovar I see what you mean now; the printed number is larger, It could be a problem in the test itself since I'm using threading with |
Yes, it passes for me as well, because in order for the test to pass the number of threads used just needs to be greater than 1. The problem is that the number printed there is 33. However it should be at most 4 (due to STAN_NUM_THREADS=4).
Yes, but with the current TBB we use in Stan Math we can limit the number of threads used with reduce sum. With the new interface we can not as shown by this test (or rather the print in the test). By merging the current state of this branch we would indeed allow the use of the latest versions of TBB, which is great and thanks for working on that! The problem is that the users would have no control over the number of threads used which they do right now. And I am not sure I like that. I was under the impression you solved this with the global control thing. @wds15 your thoughts on this? I am guessing you agree we dont want to merge this without a way to control the number of threads.
I have no problem with using threading with cmdstan as well. The problem is that I have no control over the number of threads used. Not in this test and not in Cmdstan.
I agree that this is useful. I see that RcppParallel allows using system installed TBB so this has its purpose. No doubt about that. But I think we do need it to respect the user-set limit of threads. |
Agreed. I'm looking into it. You missed this question:
|
Ah, sorry. I dont know much about rstan, what I do now is that the parser requires no flags. And AFAIK rstan has STAN_THREADS set by default so that should be it I think. |
The use of the environment variable STAN_NUM_THREADS is in a way a workaround and we should actually have the stan layer be aware of threads by chain, but the Stan layer does not know about threads at all right now. So for math it means that we need to limit the number of threads used to STAN_NUM_THREADS as a default, yes. In a future when the stan services supports controlling threads we can hopefully get rid of STAN_NUM_THREADS, but that will still take some time. However, for this PR this means that whenever STAN_NUM_THREADS is set, then it needs to be respected. |
@rok-cesnovar @wds15 I've found the culprit. In your test above, and most code in In sum, |
@rok-cesnovar Here's a fixed version of your test: #ifdef STAN_THREADS
std::vector<int> threading_test_global;
struct threading_test_lpdf {
template <typename T1>
inline auto operator()(const std::vector<T1>&, std::size_t start,
std::size_t end, std::ostream* msgs) const {
tbb::task_arena& tbb_arena = stan::math::init_threadpool_tbb();
tbb_arena.execute([&]() {
threading_test_global[start] = tbb::this_task_arena::current_thread_index();
});
return stan::return_type_t<T1>(0);
}
};
TEST(StanMathRev_reduce_sum, threading) {
threading_test_global = std::vector<int>(10000, 0);
std::vector<stan::math::var> data(threading_test_global.size(), 0);
stan::math::reduce_sum<threading_test_lpdf>(data, 1, nullptr);
auto uniques = std::set<int>(threading_test_global.begin(),
threading_test_global.end());
EXPECT_GT(uniques.size(), 1);
std::cout << uniques.size() << std::endl;
stan::math::recover_memory();
}
#endif The key change is to join the initialized arena and execute: tbb::task_arena& tbb_arena = stan::math::init_threadpool_tbb();
tbb_arena.execute([this]() {
...
}); Applying this in any code will respect threading options in |
@rok-cesnovar Please check the fixed test above. Do you want me to deprecate |
Thanks @hsbadr. That makes sense. But that only fixes the issue with the test, but not the actual behavior of reduce_sum which would still not respect STAN_NUM_THREADS. |
Yes, I'm working on this. |
This allows `init_threadpool_tbb()` to effectively limit the total number of worker threads that can be active in the task scheduler defined by `STAN_NUM_THREADS`. Note that maximum allowed parallelism will be from 1 to `STAN_NUM_THREADS` threads.
@rok-cesnovar 5e2a557 should fix the global control to restrict the total number of threads respecting |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
This adds support for within-chain threading in the new version of `rstan >= 2.25 | Stan >= 2.25`. It has been tested with stan-dev#887, paul-buerkner/brms#1074, and stan-dev/math#2261. A new `rsran` option `threads_per_chain` has been added to control the within-chain number of threads (`threads$threads`): ```r rstan::rstan_options(threads_per_chain = threads$threads) ``` If the model is compiled with threading support, the number of threads to use in parallelized sections _within_ an MCMC chain (e.g., when using the `Stan` functions `reduce_sum()` or `map_rect()`). The actual number of CPU cores used is `chains * threads_per_chain` where `chains` is the number of parallel chains. For an example of using threading, see [Reduce Sum: A Minimal Example](https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html)
@rok-cesnovar I've tested this for within-chain threading (using |
Hey, this is good to go, but we are waiting on the feature freeze for 2.26 to pass. So I will merge this on Tuesday if all goes well. Thank you for working on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested again and the number of threads is now respected for all cases. Good to go. Thanks @hsbadr!
Summary
This builds on #2257 and fixes the related tests when the TBB environment variables are set correctly to use an externally updated library, including oneTBB. It fixes the
LDFLAGS_TBB
compiler flags and usestbb::task_arena
in replacement of the old (and now removed)tbb::task_schedular_init
.tbb::task_arena
andtbb::global_control
to manage the task scheduler arena.oneTBB
.This is related to #2257 and RcppCore/RcppParallel#141.
Tests
For example, installing oneTBB on Linux 64-bit (
x86_64
) to$HOME
directory (change if needed!):TBB
for the installation prefix,TBB_INC
for the directory that includes the header files, andTBB_LIB
for the libraries directory).For example, installing oneTBB on Linux 64-bit (
x86_64
) to$HOME
directory (change if needed!):Note that the test names for the new TBB interface include a
tbb_new_
prefix while the default tests usetbb_
prefix. So, make sure that you run the new tests (intel_tbb_new_init
andintel_tbb_new_late_init
reported in the test results) or otherwise your build has used the internal TBB source code (i.e., the environment isn't set correctly for using the external TBB library).Side Effects
N/A
Release notes
Added support for the new TBB interface and allowed using an external TBB library.
Checklist
Math issue N/A
Copyright holder: Hamada S. Badr hamada.s.badr@gmail.com
the basic tests are passing
./runTests.py test/unit
)make test-headers
)make test-math-dependencies
)make doxygen
)make cpplint
)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested