Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partr thread support for openblas #32786

Closed
ViralBShah opened this issue Aug 4, 2019 · 15 comments
Closed

partr thread support for openblas #32786

ViralBShah opened this issue Aug 4, 2019 · 15 comments
Labels
linear algebra Linear algebra multithreading Base.Threads and related functionality

Comments

@ViralBShah
Copy link
Member

ViralBShah commented Aug 4, 2019

Here are some notes from digging into the openblas codebase (with @stevengj) to enable partr threading support.

  1. exec_blas is called by all the routines. The code pattern followed is setting up the work queue and calling exec_blas to do all the work through an openmp pragma.
  2. The exception is lapack routines, which also use the exec_blas_async functions.
  3. The openmp backend doesn’t seem to implement the async and thus I believe that it will not multi-thread the lapack calls.
  4. Windows has its own threading backend

The easiest way may be to modify the openmp threading backend, which seems amenable to something like the fftw partr backend. To start with, we should ignore lapack threading. We could probably just implement an exec_blas_async fallback that calls exec_blas (and make exec_blas_async_wait a no-op).

All of this should work on windows too, although the going through the openmp build route may need some work on the makefiles.

The patch to FFTW should be indicative of something similar to be done for the openblas build.

@ViralBShah ViralBShah added linear algebra Linear algebra multithreading Base.Threads and related functionality labels Aug 4, 2019
@ViralBShah ViralBShah added this to the 1.3 milestone Aug 13, 2019
@JeffBezanson JeffBezanson modified the milestones: 1.3, 1.4 Aug 15, 2019
@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Aug 16, 2019

We now have algorithms in DifferentialEquations.jl which utilize simultaneous implicit methods to enhance the parallelizability of small stiff ODEs and DAEs (i.e. <= 20 ODEs). Right now we'll just document that the user should probably set the BLAS threads to 1, but once this PR is in this algorithm can serve as a very good test case / showcase of why PARTR mixed into BLAS is useful.

SciML/OrdinaryDiffEq.jl#872

@ViralBShah
Copy link
Member Author

This is a fairly straightforward project for someone who doesn't mind diving in and seeing how it was done in FFTW. I will certainly try it out if nobody gives it a shot in a few weeks.

@stevengj
Copy link
Member

stevengj commented Aug 16, 2019

In the long run, it would be good if partr had a documented C API for spawn/wait, which would give us a lot more flexibility in integrating it with external libraries like this.

@nalimilan
Copy link
Member

Do you think this something that will require changes to OpenBLAS upstream and/or compiling OpenBLAS with specific options? Just checking from a packager perspective.

@vchuravy
Copy link
Member

Do you think this something that will require changes to OpenBLAS upstream and/or compiling OpenBLAS with specific options? Just checking from a packager perspective.

Yes, we probably have to work with OpenBLAS upstream

@stevengj
Copy link
Member

stevengj commented Sep 5, 2019

I'm also implementing the FFTW strategy of a pluggable threading backend for Blosc (Blosc/c-blosc2#81).

I think we can make a strong argument to upstream developers that their libraries should use this kind of strategy where possible, because it allows easy composability not only with Julia's partr, but also with Intel's TBB and other threading schedulers. It also seems possible to do this with minimal patches in cases where they have already implemented their own threading.

@stevengj
Copy link
Member

stevengj commented Sep 12, 2019

I think it's attractive to implement this as a runtime option, in addition to existing threading options rather than instead of them, as I did for FFTW and Blosc. That is, we add a single if statement to the existing exec_blas functions:

exec_blas(num, queue) {
    if (threads_callback) {
        // pass work to the callback function
        return;
    }
    // parallelize normally
}

This has three advantages:

  • You can install a single library on your system, and it can be used both by programs that need a custom threading backend and programs that don't.
  • If OpenBLAS is using things like pthread or win32 mutex locks to make access to shared resources thread-safe, those will continue to work.
  • We don't need to add a new configuration option to OpenBLAS … the optional plug-in backend will be used automatically.

@stevengj
Copy link
Member

Regarding the exec_blas_async and exec_blas_async_wait, my hope is that the LAPACK code that calls this could be refactored. My understanding is that it looks something like:

exec_blas_async(queue);
// do some other work
exec_blas_async_wait(queue);

I'm not sure why the "other work" can't simply be added to the queue of parallel tasks, and let the runtime worry about load-balancing.

@stevengj
Copy link
Member

I posted a very early draft of the requisite changes at OpenMathLib/OpenBLAS#2255

@stevengj
Copy link
Member

Actually, I thought of an even easier way to implement exec_blas_async: the Julia callback can just spawn the tasks and return. The parallel tasks can set pthread mutex values to indicate that they are complete, just as they do now, and exec_blas_async_wait can wait on those mutexes as it does not, without modification to it or the LAPACK source code.

@KristofferC KristofferC removed this from the 1.4 milestone Nov 28, 2019
@KristofferC
Copy link
Member

Removing milestone since this certainly wasn't release blocking for 1.3 and neither will be for 1.4 or 1.x.

@AzamatB
Copy link
Contributor

AzamatB commented Nov 28, 2019

I'm confused. I thought that now that we switched to a time-based release schedule with 1.x releases, nothing is release-blocking, so should then all the remaining issues be removed from 1.4 milestone as well?

@chriscoey
Copy link

friendly bump on this one. new AMD processors have a ton of threads but I can't take much advantage of PARTR until it works nicely with OpenBLAS since my loops all have various LAPACK calls in them (and I also have standalone LAPACK calls outside of loops that ought to still use all threads)

@ViralBShah
Copy link
Member Author

Increasingly, a lot of libraries in Yggdrasil BB are using openmp, and many of them call BLAS. I suspect that we are increasingly going to see multi-threading clashes between julia threads, pthreaded libraries (openblas), and openmp. The fewer of these we can use the better! I also learnt that if MKL enters the picture, it is yet another library - tbb.

@ViralBShah
Copy link
Member Author

cc @kpamnany

@JuliaLang JuliaLang locked and limited conversation to collaborators Jan 30, 2022
@ViralBShah ViralBShah converted this issue into discussion #43984 Jan 30, 2022
@JuliaLang JuliaLang unlocked this conversation Jan 30, 2022
@JuliaLang JuliaLang locked and limited conversation to collaborators Jan 30, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
linear algebra Linear algebra multithreading Base.Threads and related functionality
Projects
None yet
Development

No branches or pull requests

9 participants