Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R parallel computing deadlock with openblas #75506

Closed
2 tasks done
ywwry66 opened this issue Apr 19, 2021 · 17 comments
Closed
2 tasks done

R parallel computing deadlock with openblas #75506

ywwry66 opened this issue Apr 19, 2021 · 17 comments
Labels
bug Reproducible Homebrew/homebrew-core bug outdated PR was locked due to age stale No recent activity

Comments

@ywwry66
Copy link
Contributor

ywwry66 commented Apr 19, 2021

brew gist-logs <formula> link OR brew config AND brew doctor output

brew config:

HOMEBREW_VERSION: 3.1.2-51-g0e65d96
ORIGIN: https://github.com/Homebrew/brew
HEAD: 0e65d96e54f788b9284f92c26dc845abddffdc7f
Last commit: 5 hours ago
Core tap ORIGIN: https://github.com/Homebrew/homebrew-core
Core tap HEAD: 566d44698e4d69c602d5ada68acee5f084edb071
Core tap last commit: 5 hours ago
Core tap branch: master
HOMEBREW_PREFIX: /usr/local
HOMEBREW_CASK_OPTS: []
HOMEBREW_DISPLAY: /private/tmp/com.apple.launchd.tx3DikcsUF/org.xquartz:0
HOMEBREW_EDITOR: emacs
HOMEBREW_MAKE_JOBS: 4
Homebrew Ruby: 2.6.3 => /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/bin/ruby
CPU: quad-core 64-bit kabylake
Clang: 12.0 build 1200
Git: 2.31.1 => /usr/local/bin/git
Curl: 7.64.1 => /usr/bin/curl
macOS: 11.2.3-x86_64
CLT: 12.4.0.0.1.1610135815
Xcode: N/A


brew doctor:
Your system is ready to brew.

  • I ran brew update and am still able to reproduce my issue.
  • I have resolved all warnings from brew doctor and that did not fix my problem.

What were you trying to do (and why)?

Use mclapply command in R to do parallel computing, which essentially calls fork(2)

What happened (include all command output)?

The 4 child processes forked by R uses 0% CPU and nothing is returned to the R console.

What did you expect to happen?

R should do the computation properly and print the returned value to console.

Step-by-step reproduction instructions (by running brew commands)

brew install R
R

then in R console:

library(parallel)
num_cores <- detectCores()

a <- replicate(1000, rnorm(1000))
determinant(a)

a <- list()
for (i in 1:10) a[[i]] <- replicate(1000, rnorm(1000))
mclapply(a, determinant, mc.cores = num_cores)
@ywwry66 ywwry66 added the bug Reproducible Homebrew/homebrew-core bug label Apr 19, 2021
@ywwry66 ywwry66 changed the title R parallel computing deadlock with openMP R parallel computing deadlock with openblas Apr 19, 2021
@ywwry66
Copy link
Contributor Author

ywwry66 commented Apr 19, 2021

In my homebrew discussions post, I thought this was resolved in a new release of R, but it came back later. After some research (OpenMathLib/OpenBLAS#294), I figured this is caused by openMP after linking openblas to R (strictly speaking, by GNU's implementation of libgomp). There are 3 ways to solve this issue as I tested on my machine:

  1. Use clang instead of gcc to compile openblas, because libomp seems to work for me. I built openblas successfully with the following:
CC=/usr/local/opt/llvm/bin/clang make USE_OPENMP=1 LDFLAGS="-L/usr/local/opt/libomp/lib"

The downside of this method is extra dependency of libomp.

  1. Don't use USE_OPENMP=1 when building openblas. This will use pthread(3) instead of openMP. Not sure how pthread compares to openMP in terms of performance. I don't see other caveats with this fix.

  2. Don't link openblas at all. This is of course not good since openblas can greatly boost the performance of R. (r: use openblas #68015)

@ywwry66
Copy link
Contributor Author

ywwry66 commented Apr 21, 2021

Just did more tests with the method 1 above, and it actually does not work with the current openblas 0.3.13, giving the following compilation error:

Undefined symbols for architecture x86_64:
  "___kmpc_for_static_fini", referenced from:
      _.omp_outlined. in libopenblasp-r0.3.13.a(blas_server.o)
  "___kmpc_for_static_init_8", referenced from:
      _.omp_outlined. in libopenblasp-r0.3.13.a(blas_server.o)
  "___kmpc_fork_call", referenced from:
      _exec_blas in libopenblasp-r0.3.13.a(blas_server.o)
  "___kmpc_global_thread_num", referenced from:
      _exec_blas in libopenblasp-r0.3.13.a(blas_server.o)
  "___kmpc_push_num_threads", referenced from:
      _exec_blas in libopenblasp-r0.3.13.a(blas_server.o)
ld: symbol(s) not found for architecture x86_64
collect2: error: ld returned 1 exit status
make[1]: *** [libopenblasp-r0.3.13.dylib] Error 1
make: *** [shared] Error 2

But the pre-release 0.3.14 works just fine. So if we are to use clang in conjunction with gfortran, we may wait until the next release of openblas.

@danielnachun
Copy link
Member

I think we shouldn't be using OpenMP at all macOS or Linux for openblas. pthread is much simpler and more 'native' to Unix systems than OpenMP. The main purpose of OpenMP as I understand it was to create a parallel interface which works on both *nix and Windows, which we don't really care about given that we are a *nix only project (Homebrew on Linux works great in Windows Subsystem for Linux, btw).

We should still build openblas with gcc because it requires gfortran. It might be possible to build the C parts with clang and the Fortran parts with gfortran, but as you've already seen, this can lead to some weird build issues.

If you can test disabling OpenMP entirely and still build with gcc, that will tell us if that is a viable solution. In this case we will need to make sure that everything builds and links correctly and that a) multithreading works and b) the performance is acceptable.

One useful comparison you could also make is by installing the cask version of R which uses the Accelerate framework by default instead of openblas. Ideally the multithreading performance with the formula R using openblas should be equal to or better than the cask R using Accelerate.

@danielnachun
Copy link
Member

danielnachun commented Apr 23, 2021

library(parallel)
num_cores <- detectCores()

a <- replicate(1000, rnorm(1000))
determinant(a)

a <- list()
for (i in 1:10) a[[i]] <- replicate(1000, rnorm(1000))
mclapply(a, determinant, mc.cores = num_cores)

I'm a little unclear why you are calling determinant inside of a call to mclapply. If determinant is a multithreaded call, then you should just be doing it one matrix at a time, and letting OpenBLAS use all available threads on your system. Right now what I expect would happen with this is that you will initiate determinant on each of the matrices simultaneously, and then for each call to determinant, OpenBLAS will try to use all available threads.

This will end up spawning far more threads than your CPU can actually run in parallel, and it will put most of the threads in sleep (state S in htop) as the CPU tries to timeshare between them. This would actually create more overhead than just calling determinant on one matrix at a time, where for each call, OpenBLAS uses all available threads. It's probably not a bad a idea to make multiple calls so you can average the performance, but I think they should be done sequentially, not in parallel like you currently do.

@ywwry66
Copy link
Contributor Author

ywwry66 commented Apr 26, 2021

I am calling determinant inside mclapply simply for a minimal example to reveal the bug. In practice, determinant should be replaced by some custom function, where there can be multithreaded code as wells as singlethreaded code inside the function body. In this case, I believe it becomes art of trade-off between the number of threads for openblas and number of child processes for fork in order to achieve optimal performance (I am not a pro but this is my understanding). There is another bug in the current build of openblas: Setting OMP_NUM_THREADS inside an R session by

Sys.setenv(OMP_NUM_THREADS=1)

does not work. Every time I call fork with R, I have to set OMP_NUM_THREADS in the shell, which would affect the whole R session. This makes it so hard to balance the trade-off I mentioned above.

I have tested openblas 0.3.14 built against pthread and openMP. I cannot compare to Apple accelerate framework because cask version of R uses the builtin blas which is the slowest (at least on my system). I cannot even find the corresponding library to link manually on my disk. Anyways, I modified the code in #68015 a little bit:

require(Matrix)
set.seed(1)
a <- rnorm(2800 * 2800)
dim(a) <- c(2800, 2800)
timing <- numeric(10)
for (i in 1:10) {
    invisible(gc())
    timing[i] <- system.time(
        b <- crossprod(a)
    )[3]
}
sum(timing)

On my dual core quad thread cpu, the result for pthread is 3.025s (R has 2 threads), while the result for openMP (R has 4 threads) is 3.058s. So I don't think there is a significant performance difference. It is interesting that with pthread, the number of R threads never goes above 2 during my testing. I will test this with an M1 Mac and update the result later.

@ywwry66
Copy link
Contributor Author

ywwry66 commented Apr 26, 2021

On M1, the result for pthread is 1.847s (R has 8 threads), the result for openMP (R has 8 threads) is 1.833s.

Another thing I just realized: pthread only solves the deadlock, but does not solve the Sys.setenv(OMP_NUM_THREADS=1) bug. Only using clang with openMP can solve both.

@danielnachun
Copy link
Member

On my dual core quad thread cpu, the result for pthread is 3.025s (R has 2 threads), while the result for openMP (R has 4 threads) is 3.058s.
On M1, the result for pthread is 1.847s (R has 8 threads), the result for openMP (R has 8 threads) is 1.833s

It's great to see that pthread offers comparable performance on both CPU architectures. This to me is a good reason for us to disable OpenMP support entirely.

Another thing I just realized: pthread only solves the deadlock, but does not solve the Sys.setenv(OMP_NUM_THREADS=1) bug. Only using clang with openMP can solve both.

To set the number of threads in OpenBLAS in R, you should check out RhpcBLASctl https://cran.r-project.org/web/packages/RhpcBLASctl/index.html. I've found it works great for me. Alternatively one could use OPENBLAS_NUM_THREADS, which should set the number of threads regardless of which parallel backend is used (whereas OMP_NUM_THREADS will only work with the OpenMP backend).

I believe it becomes art of trade-off between the number of threads for openblas and number of child processes for fork in order to achieve optimal performance (I am not a pro but this is my understanding).

Usually the tradeoff is that you don't want to end up spawning more threads than you have cores. When there are more threads than cores, the CPU has to keep moving threads in and out of sleep, which incurs overhead.

Handling multithreading with linear algebra libraries is tricky. I've usually found that best approach is to use mclapply to multithread the parts of my R code which are not using BLAS/LAPACK calls, and then drop down to single threaded R code using BLAS/LAPACK, so that OpenBLAS can handle its own multithreading.

@danielnachun
Copy link
Member

Feel free to open a pull request to disable OpenMP for OpenBLAS. I think the testing you've done is sufficient to show that there is a no performance difference when using pthreads instead, and this appears to fix a bug, so I'd be happy to review it.

@carlocab
Copy link
Member

Feel free to open a pull request to disable OpenMP for OpenBLAS. I think the testing you've done is sufficient to show that there is a no performance difference when using pthreads instead, and this appears to fix a bug, so I'd be happy to review it.

I don't think this is a good idea. It could have an impact on the way other formulae use OpenBLAS, or on the way current users expect to be able to use OpenBLAS. I'm not convinced that one microbenchmark on one machine is enough to establish this is a good idea to do everywhere as well.

Since this bug appears to be fixed in the next version of OpenBLAS, I think the correct thing to do here would be to apply the patches that fix it.

@danielnachun
Copy link
Member

Feel free to open a pull request to disable OpenMP for OpenBLAS. I think the testing you've done is sufficient to show that there is a no performance difference when using pthreads instead, and this appears to fix a bug, so I'd be happy to review it.

I don't think this is a good idea. It could have an impact on the way other formulae use OpenBLAS, or on the way current users expect to be able to use OpenBLAS. I'm not convinced that one microbenchmark on one machine is enough to establish this is a good idea to do everywhere as well.

Since this bug appears to be fixed in the next version of OpenBLAS, I think the correct thing to do here would be to apply the patches that fix it.

Point well taken. My understanding is that calls to OpenBLAS are generally agnostic to which parallel backend is used, but I'll defer to your experience with this type of issue!

@ywwry66
Copy link
Contributor Author

ywwry66 commented Apr 27, 2021

Since this bug appears to be fixed in the next version of OpenBLAS, I think the correct thing to do here would be to apply the patches that fix it.

@carlocab Is using clang+gfortran for compiling an acceptable approach in Homebrew?

@carlocab
Copy link
Member

We'd need a libomp (from LLVM) to provide OpenMP in that case, and that doesn't seem like a good thing to do: #50252 (comment)

@carlocab
Copy link
Member

carlocab commented May 3, 2021

Updating OpenBLAS in #76474. Unfortunately, it fails to build on Mojave.

@ywwry66
Copy link
Contributor Author

ywwry66 commented May 3, 2021

@carlocab You might have misread my post. Upgrading OpenBLAS along does not solve the OpenMP issue. It only solves the building issue of OpenBLAS with clang.

The core problem with OpenMP is that libgomp is not fork safe, while libomp is. It is actually possible to use gcc+libomp (as they are doing in Anaconda, see #50252 (comment)), but that requires some hacking. Per the comment by @fxcoudert you mentioned, the switch to libomp may cause other problems and inconvenience, so I don't really see any viable solution at this point. Probably this will change when Flang is out. But for now, you may close this issue if you wish to.

@carlocab
Copy link
Member

carlocab commented May 3, 2021

You're right, I did misread your post. My apologies. (I have a bad habit of only skimming posts in Homebrew/core -- it's the only way I'm able to manage the number of things there is to review. 😄)

Ok, it seems to me that:

  • We avoid gcc + libomp to prevent speculative problems (i.e. we're not aware of any specific problems from using this set up).
  • There are known problems with relying on libgomp.

Using libomp for OpenBLAS might be the only viable solution here. (Assuming it breaks nothing else. If it does, we may just have to live with the breakage here.)

One concern with this approach is that we'll probably have to change all the formulae that depend on OpenBLAS to also use libomp. (Hopefully none of them also depend on llvm, which also ships libomp.)

Of course, if @fxcoudert still disagrees with my assessment here, then I'm deferring to them. Thoughts, @Homebrew/core?

@danielnachun
Copy link
Member

These issues really demonstrate why OpenMP support is such a pain. As I mentioned above it I believe it was only created to provide a unified parallel interface that would also work on Windows. From what I understand about how OpenBLAS is used, the software actually doing the BLAS/LAPACK calls doesn't even know if the backend is multithreaded or not, let alone whether it is using OpenMP or pthreads. It would be great if we could test some formulae which depend on OpenBLAS to see if there is any effect disabling OpenMP support and just using pthreads.

As I see it, this is no more speculative than switching to libomp - in fact it may be more conservative because we would not be mixing two different toolchains. However, I'm curious to hear other opinions here.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale No recent activity label May 25, 2021
@github-actions github-actions bot closed this as completed Jun 1, 2021
@github-actions github-actions bot added the outdated PR was locked due to age label Jul 2, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Reproducible Homebrew/homebrew-core bug outdated PR was locked due to age stale No recent activity
Projects
None yet
Development

No branches or pull requests

3 participants