Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the new TBB interface and use tbb::task_arena #2261

Merged
merged 13 commits into from
Jan 27, 2021

Conversation

hsbadr
Copy link
Member

@hsbadr hsbadr commented Dec 15, 2020

Summary

This builds on #2257 and fixes the related tests when the TBB environment variables are set correctly to use an externally updated library, including oneTBB. It fixes the LDFLAGS_TBB compiler flags and uses tbb::task_arena in replacement of the old (and now removed) tbb::task_schedular_init.

  • Using tbb::task_arena and tbb::global_control to manage the task scheduler arena.
  • Fixed compilation errors with when linking external Intel oneTBB / OneAPI.
  • Fixed global control to restrict the maximum allowed number of threads.
  • Updated compiler flags to fix unit tests with oneTBB.
  • Updated Intel TBB instructions in the README.

This is related to #2257 and RcppCore/RcppParallel#141.

Tests

For example, installing oneTBB on Linux 64-bit (x86_64) to $HOME directory (change if needed!):

    TBB_VERSION="2021.1.1"

    wget https://github.com/oneapi-src/oneTBB/releases/download/v2021.1.1/oneapi-tbb-$TBB_VERSION-lin.tgz
    tar zxvf oneapi-tbb-$TBB_VERSION-lin.tgz -C $HOME

    export TBB="$HOME/oneapi-tbb-$TBB_VERSION"
  • Set the TBB environment variables (specifically: TBB for the installation prefix, TBB_INC for the directory that includes the header files, and TBB_LIB for the libraries directory).

For example, installing oneTBB on Linux 64-bit (x86_64) to $HOME directory (change if needed!):

    source $TBB/env/vars.sh intel64

    export TBB_INC="$TBB/include"
    export TBB_LIB="$TBB/lib/intel64/gcc4.8"
    #export LD_LIBRARY_PATH="$TBB_LIB:$LD_LIBRARY_PATH"

    mkdir -p ~/.config/stan
    echo TBB_INTERFACE_NEW=true>> ~/.config/stan/make.local

    # Checks:
    ls -lAh $TBB_INC
    # should list `oneapi` and `tbb` directories that include the headers
    ls -lAh $TBB_LIB
    # should list the TBB libraries, including `libtbb.so`
  • Run TBB tests:
> ./runTests.py test/unit/math/prim/core/init_threadpool_tbb_test.cpp
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from intel_tbb_new_init
[ RUN      ] intel_tbb_new_init.check_status
[       OK ] intel_tbb_new_init.check_status (1 ms)
[----------] 1 test from intel_tbb_new_init (1 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2 ms total)
[  PASSED  ] 1 test.
> ./runTests.py test/unit/math/prim/core/init_threadpool_tbb_late_test.cpp
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from intel_tbb_new_late_init
[ RUN      ] intel_tbb_new_late_init.check_status
[       OK ] intel_tbb_new_late_init.check_status (0 ms)
[----------] 1 test from intel_tbb_new_late_init (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 1 test.

Note that the test names for the new TBB interface include a tbb_new_ prefix while the default tests use tbb_ prefix. So, make sure that you run the new tests (intel_tbb_new_init and intel_tbb_new_late_init reported in the test results) or otherwise your build has used the internal TBB source code (i.e., the environment isn't set correctly for using the external TBB library).

Side Effects

N/A

Release notes

Added support for the new TBB interface and allowed using an external TBB library.

Checklist

  • Math issue N/A

  • Copyright holder: Hamada S. Badr hamada.s.badr@gmail.com

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

Signed-off-by: Hamada S. Badr <hamada.s.badr@gmail.com>
rok-cesnovar
rok-cesnovar previously approved these changes Dec 15, 2020
Copy link
Member

@rok-cesnovar rok-cesnovar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Merge once tests pass. Thank you.

@wds15
Copy link
Contributor

wds15 commented Dec 15, 2020

task_arena is what services should use to run one chain...

@rok-cesnovar
Copy link
Member

Oh right, I tried that once. Let me dig that up and take another look.

@rok-cesnovar rok-cesnovar dismissed their stale review December 15, 2020 21:02

Need to do a local test.

@hsbadr
Copy link
Member Author

hsbadr commented Dec 15, 2020

task_arena is what services should use to run one chain...

@wds15 @rok-cesnovar tbb::task_arena replaces tbb::task_schedular_init in the new interface. CHeck the tbbrevamp.pdf here.

@rok-cesnovar
Copy link
Member

Yes, but, I believe this way we use TBB, the threading functions would always use the max concurrency, not respecting what was set when initializing. This is what @wds15 remembered me.

See #1949

will double check this.

@hsbadr
Copy link
Member Author

hsbadr commented Dec 15, 2020

@rok-cesnovar No, I don't think so. You can now create a task_arena with certain concurrency limits. init_threadpool_tbb() will initialize the requested number of threads.

inline tbb::task_arena& init_threadpool_tbb() {
int tbb_max_threads = internal::get_num_threads();
static tbb::task_arena tbb_arena(tbb_max_threads, 1,
tbb::task_arena::priority::normal);
tbb_arena.initialize();
return tbb_arena;
}

Also, check lines 227-236 of the new tbb/task_arena.h header

    //! Creates task_arena with certain concurrency limits
    /** Sets up settings only, real construction is deferred till the first method invocation
     *  @arg max_concurrency specifies total number of slots in arena where threads work
     *  @arg reserved_for_masters specifies number of slots to be used by master threads only.
     *       Value of 1 is default and reflects behavior of implicit arenas.
     **/
    task_arena(int max_concurrency_ = automatic, unsigned reserved_for_masters = 1,
               priority a_priority = priority::normal)
        : task_arena_base(max_concurrency_, reserved_for_masters, a_priority)
    {}

And here's the relevant part of the functionality replacement table:

The following table summarizes the TBB functionality that significantly duplicates other existing functionality or have little practical usage.

TBB functionality Replacement
tbb::task_schedular_init tbb::task_arena, tbb::global_control (it will be extended to support the blocking terminate preview functionality)

@rok-cesnovar
Copy link
Member

I think that task arena is also available in the version we are currently using. Anyhow, should be simple to verify with a quick test.But its getting late here, so will do that tomorrow.

@hsbadr
Copy link
Member Author

hsbadr commented Dec 15, 2020

I think that task arena is also available in the version we are currently using. Anyhow, should be simple to verify with a quick test.But its getting late here, so will do that tomorrow.

@rok-cesnovar I see. #1949 is a different, but related, issue though. This PR fixes the new interface that is built on the current internal threading code.

Update: Now that I understand the cause of #1949, tbb::this_task_arena::max_concurrency() should only be used to get the default number of threads that the hardware supports while a task_arena instance should be created with custom parameters (including max_concurrency for the limited number of threads) that can be attached to the internal arena currently used by the calling thread, via tbb::task_arena(tbb::task_arena::attach()), and preferably with a global task_scheduler_observer.

@wds15
Copy link
Contributor

wds15 commented Dec 16, 2020

As I understood the TBB, if you you run code in a custom task_arena with a defined concurrency level, then that is what this-task-arena-max_concurrency will report to you just that.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.41 3.43 0.99 -0.77% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -2.03% slower
eight_schools/eight_schools.stan 0.11 0.12 0.96 -4.06% slower
gp_regr/gp_regr.stan 0.15 0.16 0.97 -3.14% slower
irt_2pl/irt_2pl.stan 5.45 5.46 1.0 -0.23% slower
performance.compilation 88.96 85.81 1.04 3.55% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.41 8.38 1.0 0.31% faster
pkpd/one_comp_mm_elim_abs.stan 30.23 29.29 1.03 3.09% faster
sir/sir.stan 144.47 142.75 1.01 1.19% faster
gp_regr/gen_gp_data.stan 0.05 0.04 1.01 1.17% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.92 2.92 1.0 0.0% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.4 0.96 -3.7% slower
arK/arK.stan 2.5 2.52 0.99 -0.89% slower
arma/arma.stan 0.59 0.58 1.0 0.23% faster
garch/garch.stan 0.61 0.6 1.0 0.18% faster
Mean result: 0.997067772134

Jenkins Console Log
Blue Ocean
Commit hash: a3aa7a1


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@hsbadr
Copy link
Member Author

hsbadr commented Dec 16, 2020

As I understood the TBB, if you you run code in a custom task_arena with a defined concurrency level, then that is what this-task-arena-max_concurrency will report to you just that.

The arena should be initialized to apply the concurrency limit. If there's no arena instance initialized, tbb::this_task_arena::max_concurrency() returns the number of cores anyway.

Also, by testing, tbb::global_control::max_allowed_parallelism parameter defaults to zero until it's set using tbb::global_control(tbb::global_control::max_allowed_parallelism, <number of threads>) to limit the total number of worker threads that can be active in the task scheduler.

@rok-cesnovar
Copy link
Member

@hsbadr will get to this in a couple of days. Sorry for the delay.

@hsbadr
Copy link
Member Author

hsbadr commented Dec 18, 2020

@hsbadr will get to this in a couple of days. Sorry for the delay.

Thanks for the heads up! It isn't urgent as I'm merging this in my fork, but wanted the fixes to be in the develop branch if anyone else would like to try the new interface. I use Intel OpeAPI and I've to opt in the new interface; otherwise, the system headers conflict with the internal TBB headers and cause compilation errors.

hsbadr added a commit to hsbadr/rstan that referenced this pull request Dec 18, 2020
@hsbadr
Copy link
Member Author

hsbadr commented Dec 23, 2020

@hsbadr will get to this in a couple of days. Sorry for the delay.

Hi @rok-cesnovar,

Any update? Sorry to bother you. I'd like to drop my fork but waiting on this. We can merge it to fix the new TBB code and test #1949 later, preferably after the internal TBB code gets updated. Thanks!

Happy holidays!
Hamada

@rok-cesnovar
Copy link
Member

Sorry, been busy. Should get to it in a day pr two.

@rok-cesnovar
Copy link
Member

Sorry @hsbadr only getting back to this now. Can you give me some help testing this.

Lets say I take this release and place it in "~/Stan/newTBB" What do I need to set to make it work?

@rok-cesnovar
Copy link
Member

@hsbadr
Copy link
Member Author

hsbadr commented Dec 29, 2020

Sorry @hsbadr only getting back to this now. Can you give me some help testing this.

Lets say I take this release and place it in "~/Stan/newTBB" What do I need to set to make it work?

@rok-cesnovar No worries. Hope you’re having an enjoyable holiday break!

I added testing instructions in the description:

To test the new TBB interface:

  • Install a new version of TBB library, oneTBB, or the binaries from OneAPI.
  • Set the TBB environment variables (specifically: TBB for the installation prefix, TBB_INC for the directory that includes the header files, TBB_LIB for the libraries directory, and TBB_INTERFACE_NEW=true in the user-defined variables in ~/.config/stan/make.local or make/local). For example, source /opt/intel/oneapi/setvars.sh or
    export TBB="/opt/intel/oneapi/tbb/latest"
    export TBB_INC="/opt/intel/oneapi/tbb/latest/include"
    export TBB_LIB="/opt/intel/oneapi/tbb/latest/lib/intel64/gcc4.8"

and

    mkdir -p ~/.config/stan
    echo TBB_INTERFACE_NEW=true>> ~/.config/stan/make.local
  • Run TBB tests:
> ./runTests.py test/unit/math/prim/core/init_threadpool_tbb_test.cpp
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from intel_tbb_new_init
[ RUN      ] intel_tbb_new_init.check_status
[       OK ] intel_tbb_new_init.check_status (1 ms)
[----------] 1 test from intel_tbb_new_init (1 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2 ms total)
[  PASSED  ] 1 test.

You may download OneAPI from here.

@hsbadr
Copy link
Member Author

hsbadr commented Dec 29, 2020

By this release, I meant https://github.com/oneapi-src/oneTBB/releases/tag/v2021.1.1

Just build it, or install the binaries, and set the TBB environment variables. Then, with #2261, you can run math and or stan tests as usual.

@hsbadr
Copy link
Member Author

hsbadr commented Dec 29, 2020

@rok-cesnovar For installing oneTBB on Linux 64-bit (x86_64) to the $HOME directory (change if needed!):

    TBB_VERSION="2021.1.1"

    wget https://github.com/oneapi-src/oneTBB/releases/download/v2021.1.1/oneapi-tbb-$TBB_VERSION-lin.tgz
    tar zxvf oneapi-tbb-$TBB_VERSION-lin.tgz -C $HOME
    export TBB="$HOME/oneapi-tbb-$TBB_VERSION"
    export TBB_INC="$TBB/include"
    export TBB_LIB="$TBB/lib/intel64/gcc4.8"
    export LD_LIBRARY_PATH="$TBB_LIB:$LD_LIBRARY_PATH"

    mkdir -p ~/.config/stan
    echo TBB_INTERFACE_NEW=true>> $HOME/.config/stan/make.local

    # Checks:
    ls -lAh $TBB_INC
    # should list `oneapi` and `tbb` directories that include the headers
    ls -lAh $TBB_LIB
    # should list the TBB libraries, including `libtbb.so`
    cat $HOME/.config/stan/make.local | grep TBB_INTERFACE_NEW
    # should list `TBB_INTERFACE_NEW=true`

The test names for the new TBB interface include a tbb_new_ prefix while the default tests use tbb_ prefix. So, make sure that you run the new tests (intel_tbb_new_init and intel_tbb_new_late_init reported in the test results) or otherwise your build has used the internal TBB source code (i.e., the environment isn't set correctly for using the external TBB library).

@rok-cesnovar
Copy link
Member

Thanks so much!

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.37 3.4 0.99 -0.87% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.97 -2.87% slower
eight_schools/eight_schools.stan 0.11 0.11 0.97 -2.83% slower
gp_regr/gp_regr.stan 0.15 0.15 1.04 3.45% faster
irt_2pl/irt_2pl.stan 5.25 5.17 1.02 1.53% faster
performance.compilation 90.3 88.41 1.02 2.1% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.62 8.66 1.0 -0.47% slower
pkpd/one_comp_mm_elim_abs.stan 28.59 31.15 0.92 -8.97% slower
sir/sir.stan 143.16 135.3 1.06 5.49% faster
gp_regr/gen_gp_data.stan 0.04 0.04 0.98 -2.28% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.05 3.04 1.0 0.2% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.41 0.92 -9.07% slower
arK/arK.stan 2.53 2.54 1.0 -0.18% slower
arma/arma.stan 0.6 0.61 0.98 -2.55% slower
garch/garch.stan 0.67 0.69 0.97 -2.7% slower
Mean result: 0.98822052253

Jenkins Console Log
Blue Ocean
Commit hash: a3aa7a1


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@hsbadr
Copy link
Member Author

hsbadr commented Jan 17, 2021

It's great that you've successfully built math with external oneTBB. That's the purpose of this PR. I think that the threading issue (and #1949) needs more work in other places. I'll look into it, but we could merge this to fix the new interface with using external TBB library, which are broken on develop (note that this PR doesn't affect the default build options with the internal TBB source code).

On a related note, do I need to do anything forrstan 2.25+ (stan-dev/rstan#887) to support threading per chain with reduce_sum besides exposing STAN_NUM_THREADS with a new argument threads_per_chain and modifying the generated Stan code? Is there an argument to let stanc generate the code compatible with per-chain threading? If so, how to pass it via V8::v8()? Thanks!

Ok, it seems that the TBB_INTERFACE_NEW does not actually restrict the number of threads.

This is the test I ran. STAN_THREADS=true set in make/local and TBB_INTERFACE_NEW=true:

#ifdef STAN_THREADS	
std::vector<int> threading_test_global;	
struct threading_test_lpdf {	
  template <typename T1>	
  inline auto operator()(const std::vector<T1>&, std::size_t start,	
                         std::size_t end, std::ostream* msgs) const {	
    threading_test_global[start] = tbb::this_task_arena::current_thread_index();	

    return stan::return_type_t<T1>(0);	
  }	
};	

TEST(StanMathRev_reduce_sum, threading) {	
  stan::math::init_threadpool_tbb();
  threading_test_global = std::vector<int>(10000, 0);	
  std::vector<stan::math::var> data(threading_test_global.size(), 0);	

  stan::math::reduce_sum<threading_test_lpdf>(data, 1, nullptr);	

  auto uniques = std::set<int>(threading_test_global.begin(),	
                          threading_test_global.end());	

  EXPECT_GT(uniques.size(), 1);	
  std::cout << uniques.size() << std::endl;
  stan::math::recover_memory();	
}	
#endif

Then ran:
STAN_NUM_THREADS=2 python3 runTests.py test/unit/math/rev/functor/reduce_sum_test.cpp
STAN_NUM_THREADS=4 python3 runTests.py test/unit/math/rev/functor/reduce_sum_test.cpp

The test prints the number of threads used and that should usually be STAN_NUM_THREADS but should definitely never be larger than STAN_NUM_THREADS. That is true with what is on develop but isnt with TBB_INTERFACE_NEW. I get value around 6-7.

@hsbadr
Copy link
Member Author

hsbadr commented Jan 17, 2021

I've added your test in test/unit/math/rev/functor/reduce_sum_test.cpp and rerun the tests; here's what I get:

STAN_NUM_THREADS=4 python3 runTests.py test/unit/math/rev/functor/reduce_sum_test.cpp

------------------------------------------------------------
make -j1 test/unit/math/rev/functor/reduce_sum_test
make: 'test/unit/math/rev/functor/reduce_sum_test' is up to date.
------------------------------------------------------------
test/unit/math/rev/functor/reduce_sum_test --gtest_output="xml:test/unit/math/rev/functor/reduce_sum_test.xml"
Running main() from lib/benchmark_1.5.1/googletest/googletest/src/gtest_main.cc
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from StanMathRev_reduce_sum
[ RUN      ] StanMathRev_reduce_sum.no_args
[       OK ] StanMathRev_reduce_sum.no_args (0 ms)
[ RUN      ] StanMathRev_reduce_sum.value
[       OK ] StanMathRev_reduce_sum.value (13 ms)
[ RUN      ] StanMathRev_reduce_sum.gradient
[       OK ] StanMathRev_reduce_sum.gradient (2 ms)
[ RUN      ] StanMathRev_reduce_sum.grainsize
[       OK ] StanMathRev_reduce_sum.grainsize (5 ms)
[ RUN      ] StanMathRev_reduce_sum.nesting_gradient
[       OK ] StanMathRev_reduce_sum.nesting_gradient (3 ms)
[ RUN      ] StanMathRev_reduce_sum.grouped_gradient
[       OK ] StanMathRev_reduce_sum.grouped_gradient (3 ms)
[ RUN      ] StanMathRev_reduce_sum.grouped_gradient_eigen
[       OK ] StanMathRev_reduce_sum.grouped_gradient_eigen (2 ms)
[ RUN      ] StanMathRev_reduce_sum.slice_group_gradient
[       OK ] StanMathRev_reduce_sum.slice_group_gradient (2 ms)
[ RUN      ] StanMathRev_reduce_sum.linked_args
[       OK ] StanMathRev_reduce_sum.linked_args (0 ms)
[ RUN      ] StanMathRev_reduce_sum.threading
33
[       OK ] StanMathRev_reduce_sum.threading (1 ms)
[----------] 10 tests from StanMathRev_reduce_sum (31 ms total)

[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (31 ms total)
[  PASSED  ] 10 tests.

Your test StanMathRev_reduce_sum.threading has passed as well.

Edit: @rok-cesnovar I see what you mean now; the printed number is larger, It could be a problem in the test itself since I'm using threading with CmdStan + oneTBB with no problem. I'll look into it anyway, but I need this merged to fix my environment and not rely on my fork. Its purpose, as I mentioned earlier, is to fix building math with oneTBB.

@rok-cesnovar
Copy link
Member

Yes, it passes for me as well, because in order for the test to pass the number of threads used just needs to be greater than 1. The problem is that the number printed there is 33. However it should be at most 4 (due to STAN_NUM_THREADS=4).

I think that the threading issue (and #1949) needs more work in other places.

Yes, but with the current TBB we use in Stan Math we can limit the number of threads used with reduce sum. With the new interface we can not as shown by this test (or rather the print in the test).

By merging the current state of this branch we would indeed allow the use of the latest versions of TBB, which is great and thanks for working on that! The problem is that the users would have no control over the number of threads used which they do right now. And I am not sure I like that. I was under the impression you solved this with the global control thing.

@wds15 your thoughts on this? I am guessing you agree we dont want to merge this without a way to control the number of threads.

see what you mean now; the printed number is larger, It could be a problem in the test itself since I'm using threading with CmdStan + oneTBB with no problem.

I have no problem with using threading with cmdstan as well. The problem is that I have no control over the number of threads used. Not in this test and not in Cmdstan.

I'll look into it anyway, but I need this merged to fix my environment and not rely on my fork. Its purpose, as I mentioned earlier, is to fix building math with oneTBB

I agree that this is useful. I see that RcppParallel allows using system installed TBB so this has its purpose. No doubt about that. But I think we do need it to respect the user-set limit of threads.

@hsbadr
Copy link
Member Author

hsbadr commented Jan 17, 2021

Yes, it passes for me as well, because in order for the test to pass the number of threads used just needs to be greater than 1. The problem is that the number printed there is 33. However it should be at most 4 (due to STAN_NUM_THREADS=4).

I think that the threading issue (and #1949) needs more work in other places.

Yes, but with the current TBB we use in Stan Math we can limit the number of threads used with reduce sum. With the new interface we can not as shown by this test (or rather the print in the test).

By merging the current state of this branch we would indeed allow the use of the latest versions of TBB, which is great and thanks for working on that! The problem is that the users would have no control over the number of threads used which they do right now. And I am not sure I like that. I was under the impression you solved this with the global control thing.

@wds15 your thoughts on this? I am guessing you agree we dont want to merge this without a way to control the number of threads.

see what you mean now; the printed number is larger, It could be a problem in the test itself since I'm using threading with CmdStan + oneTBB with no problem.

I have no problem with using threading with cmdstan as well. The problem is that I have no control over the number of threads used. Not in this test and not in Cmdstan.

I'll look into it anyway, but I need this merged to fix my environment and not rely on my fork. Its purpose, as I mentioned earlier, is to fix building math with oneTBB

I agree that this is useful. I see that RcppParallel allows using system installed TBB so this has its purpose. No doubt about that. But I think we do need it to respect the user-set limit of threads.

Agreed. I'm looking into it.

You missed this question:

On a related note, do I need to do anything forrstan 2.25+ (stan-dev/rstan#887) to support threading per chain with reduce_sum besides exposing STAN_NUM_THREADS with a new argument threads_per_chain and modifying the generated Stan code? Is there an argument to let stanc generate the code compatible with per-chain threading? If so, how to pass it via V8::v8()? Thanks!

@rok-cesnovar
Copy link
Member

Ah, sorry.

I dont know much about rstan, what I do now is that the parser requires no flags. And AFAIK rstan has STAN_THREADS set by default so that should be it I think.

@wds15
Copy link
Contributor

wds15 commented Jan 17, 2021

The use of the environment variable STAN_NUM_THREADS is in a way a workaround and we should actually have the stan layer be aware of threads by chain, but the Stan layer does not know about threads at all right now. So for math it means that we need to limit the number of threads used to STAN_NUM_THREADS as a default, yes. In a future when the stan services supports controlling threads we can hopefully get rid of STAN_NUM_THREADS, but that will still take some time.

However, for this PR this means that whenever STAN_NUM_THREADS is set, then it needs to be respected.

@hsbadr
Copy link
Member Author

hsbadr commented Jan 17, 2021

@rok-cesnovar @wds15 I've found the culprit. In your test above, and most code in math, you use tbb::this_task_arena (e.g., tbb::this_task_arena::current_thread_index() in the above test) without executing the code inside the arena created by stan::math::init_threadpool_tbb();. I've fixed the above test and am scanning the affected code and will push a fix that will likely fix #1949 too.

In sum, tbb::this_task_arena gets the correct max_concurrency and current_thread_index only if it's executed inside the initialized arena (otherwise, it uses the defaults), using execute(). More soon!

@hsbadr
Copy link
Member Author

hsbadr commented Jan 17, 2021

@rok-cesnovar Here's a fixed version of your test:

#ifdef STAN_THREADS
std::vector<int> threading_test_global;
struct threading_test_lpdf {
  template <typename T1>
  inline auto operator()(const std::vector<T1>&, std::size_t start,
                         std::size_t end, std::ostream* msgs) const {

    tbb::task_arena& tbb_arena = stan::math::init_threadpool_tbb();
    tbb_arena.execute([&]() {
      threading_test_global[start] = tbb::this_task_arena::current_thread_index();
    });

    return stan::return_type_t<T1>(0);
  }
};

TEST(StanMathRev_reduce_sum, threading) {
  threading_test_global = std::vector<int>(10000, 0);
  std::vector<stan::math::var> data(threading_test_global.size(), 0);

  stan::math::reduce_sum<threading_test_lpdf>(data, 1, nullptr);

  auto uniques = std::set<int>(threading_test_global.begin(),
                          threading_test_global.end());

  EXPECT_GT(uniques.size(), 1);
  std::cout << uniques.size() << std::endl;
  stan::math::recover_memory();
}
#endif

The key change is to join the initialized arena and execute:

    tbb::task_arena& tbb_arena = stan::math::init_threadpool_tbb();
    tbb_arena.execute([this]() {
      ...
    });

Applying this in any code will respect threading options in init_threadpool_tbb (currently, number of threads).

@hsbadr
Copy link
Member Author

hsbadr commented Jan 17, 2021

@rok-cesnovar Please check the fixed test above. Do you want me to deprecate STAN_NUM_THREADS in this PR? It's unrelated topic though. This works as expected to me and the tests should join the arena and execute as demonstrated above.

@rok-cesnovar
Copy link
Member

Thanks @hsbadr. That makes sense.

But that only fixes the issue with the test, but not the actual behavior of reduce_sum which would still not respect STAN_NUM_THREADS.

@hsbadr
Copy link
Member Author

hsbadr commented Jan 17, 2021

But that only fixes the issue with the test, but not the actual behavior of reduce_sum which would still not respect STAN_NUM_THREADS.

Yes, I'm working on this.

This allows `init_threadpool_tbb()` to effectively limit the total number of worker threads that can be active in the task scheduler defined by `STAN_NUM_THREADS`.

Note that maximum allowed parallelism will be from 1 to `STAN_NUM_THREADS` threads.
@hsbadr
Copy link
Member Author

hsbadr commented Jan 17, 2021

@rok-cesnovar 5e2a557 should fix the global control to restrict the total number of threads respecting STAN_NUM_THREADS, but it can take values from 1 to STAN_NUM_THREADS threads. So, your "custom" test may fail randomly, if it gets a single thread. Please test now and let me know if this fixes the other issues too. All unit tests have passed, both for the new and old TBB interface.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.44 3.39 1.01 1.45% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.0 -0.34% slower
eight_schools/eight_schools.stan 0.11 0.11 0.98 -2.23% slower
gp_regr/gp_regr.stan 0.16 0.15 1.05 4.84% faster
irt_2pl/irt_2pl.stan 5.15 5.18 0.99 -0.75% slower
performance.compilation 92.25 90.14 1.02 2.28% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.62 8.63 1.0 -0.04% slower
pkpd/one_comp_mm_elim_abs.stan 28.82 29.58 0.97 -2.66% slower
sir/sir.stan 135.61 138.36 0.98 -2.03% slower
gp_regr/gen_gp_data.stan 0.05 0.05 1.0 -0.49% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.09 3.07 1.01 0.6% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.38 1.03 3.13% faster
arK/arK.stan 2.52 2.51 1.01 0.68% faster
arma/arma.stan 0.6 0.6 1.01 0.54% faster
garch/garch.stan 0.57 0.57 1.01 1.09% faster
Mean result: 1.00446372031

Jenkins Console Log
Blue Ocean
Commit hash: 669c670


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

hsbadr added a commit to hsbadr/rstan that referenced this pull request Jan 23, 2021
This adds support for within-chain threading in the new version of `rstan >= 2.25 | Stan >= 2.25`. It has been tested with stan-dev#887, paul-buerkner/brms#1074, and stan-dev/math#2261.

A new `rsran` option `threads_per_chain` has been added to control the within-chain number of threads (`threads$threads`):
```r
rstan::rstan_options(threads_per_chain = threads$threads)
```

If the model is compiled with threading support, the number of threads to use in parallelized sections _within_ an MCMC chain (e.g., when using the `Stan` functions `reduce_sum()` or `map_rect()`). The actual number of CPU cores used is `chains * threads_per_chain` where `chains` is the number of parallel chains. For an example of using threading, see [Reduce Sum: A Minimal Example](https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html)
@hsbadr
Copy link
Member Author

hsbadr commented Jan 23, 2021

@rok-cesnovar I've tested this for within-chain threading (using reduce_sum() / map_rect() functions) with both cmdstanr and rstan 2.26 and it works as expected (i.e., users can control the number of threads, respecting STAN_NUM_THREADS). Are you waiting for something else to be tested?

@rok-cesnovar
Copy link
Member

Hey, this is good to go, but we are waiting on the feature freeze for 2.26 to pass. So I will merge this on Tuesday if all goes well. Thank you for working on this.

Copy link
Member

@rok-cesnovar rok-cesnovar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested again and the number of threads is now respected for all cases. Good to go. Thanks @hsbadr!

@rok-cesnovar rok-cesnovar merged commit 121535e into stan-dev:develop Jan 27, 2021
@hsbadr hsbadr deleted the tbb_interface branch January 27, 2021 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants