-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit initial OpenBLAS thread count #46844
Conversation
Also GOTO_NUM_THREADS? And I think you can recover the test for this from |
I don't see how that is testing this functionality; that's testing that we limit the number of threads to a maximum of 8, but that's not what we do anymore; and in fact, our logic is much closer to what OpenBLAS itself natively does, so it's difficult to test that this is working. |
d878265
to
5cc0dfb
Compare
Ah, true. Now that we are later also explicitly changing it, we would need to do that check in |
|
No; so what happens is, if a thread count is already set, neither this logic nor the LinearAlgebra |
I think I agree with Jeff that this will break |
Oh, of course, my bad. I totally misunderstood the question, and confused myself. One possibility is to unset the environment variable after |
I would perhaps rather just patch openblas to default to 1 thread at startup, though this might harm other apps that use it |
Okay, I'll just add a short-circuit to OpenBLAS's default thread calculation then. X-ref: JuliaPackaging/Yggdrasil#5555 |
This allows Julia to set a default number of threads (usually `1`) to be used when no other thread counts are specified [0], to short-circuit the default OpenBLAS thread initialization routine that spins up a different number of threads than Julia would otherwise choose. The reason to add a new environment variable is that we want to be able to configure OpenBLAS to avoid performing its initial memory allocation/thread startup, as that can consume significant amounts of memory, but we still want to be sensitive to legacy codebases that set things like `OMP_NUM_THREADS` or `GOTOBLAS_NUM_THREADS`. Creating a new environment variable that is openblas-specific and is not already publicly used to control the overall number of threads of programs like Julia seems to be the best way forward. [0] JuliaLang/julia#46844
I am attempting to upstream this patch here: OpenMathLib/OpenBLAS#3773 |
I'm exited to see this merged, so I can test/use on master, since its only point is faster Julia startup, right? How faster expected? So, is the "1 failing" check a false alarm? Does the test need to be rerun? At the very start for freebsd64:
|
184b918
to
89969c6
Compare
This allows Julia to set a default number of threads (usually `1`) to be used when no other thread counts are specified [0], to short-circuit the default OpenBLAS thread initialization routine that spins up a different number of threads than Julia would otherwise choose. The reason to add a new environment variable is that we want to be able to configure OpenBLAS to avoid performing its initial memory allocation/thread startup, as that can consume significant amounts of memory, but we still want to be sensitive to legacy codebases that set things like `OMP_NUM_THREADS` or `GOTOBLAS_NUM_THREADS`. Creating a new environment variable that is openblas-specific and is not already publicly used to control the overall number of threads of programs like Julia seems to be the best way forward. [0] JuliaLang/julia#46844
This allows Julia to set a default number of threads (usually `1`) to be used when no other thread counts are specified [0], to short-circuit the default OpenBLAS thread initialization routine that spins up a different number of threads than Julia would otherwise choose. The reason to add a new environment variable is that we want to be able to configure OpenBLAS to avoid performing its initial memory allocation/thread startup, as that can consume significant amounts of memory, but we still want to be sensitive to legacy codebases that set things like `OMP_NUM_THREADS` or `GOTOBLAS_NUM_THREADS`. Creating a new environment variable that is openblas-specific and is not already publicly used to control the overall number of threads of programs like Julia seems to be the best way forward. [0] JuliaLang/julia#46844
89969c6
to
c11d322
Compare
c11d322
to
e296ce7
Compare
This allows Julia to set a default number of threads (usually `1`) to be used when no other thread counts are specified [0], to short-circuit the default OpenBLAS thread initialization routine that spins up a different number of threads than Julia would otherwise choose. The reason to add a new environment variable is that we want to be able to configure OpenBLAS to avoid performing its initial memory allocation/thread startup, as that can consume significant amounts of memory, but we still want to be sensitive to legacy codebases that set things like `OMP_NUM_THREADS` or `GOTOBLAS_NUM_THREADS`. Creating a new environment variable that is openblas-specific and is not already publicly used to control the overall number of threads of programs like Julia seems to be the best way forward. [0] JuliaLang/julia#46844
e296ce7
to
97f2eaf
Compare
Sadly, I can confirm that while this does help a little, it's not very significant. Here are the "commit charge" graphs for the start of the testset for both masterthis branchIsolation testsWith In the above screenshot, PID 5820 is this branch, whereas PID 2052 is |
016423d
to
e8dba44
Compare
Alright, we finally managed to do some more testing on this on some large core-count Windows machines, and this helps significantly, so I'm going to merge. |
Just for my understanding, for a large core-count machine this would at most save half of the memory compared to before? With this change, we set OpenBLAS to 1 thread, then
and set the number of OpenBLAS threads to If so, even with this, it is probably worth doing something like #46844 (setting the number of openblas threads to 1 for most of the test suite). |
Yeah, ref JuliaCI/julia-buildkite#247 for changes specific to CI. |
Actually, something seemed fishy to me about how much this helped the internal workload, and I just tried the following on Julia v1.8.2:
So it appears that
So what this means is that we actually already try to restrict OpenBLAS to 1 thread on CI, but because OpenBLAS used to start up and immediately set a number of threads, we would have the problem of initially starting up and consuming a bunch of memory, then never letting go of it. So all that being said, I believe that with this PR merged, we've actually solved the root problem. We should watch it closely on CI, but I think we've only ever tried to exercise single-threaded BLAS on CI so far, so we should just continue to do that. :) |
That has to run after I was actually about to open an issue that we need to set the env var for OpenBLAS when spawning distributed workers exactly due to the argumentation above. Is there a way to query OpenBLAS how many buffer it currently has allocated? |
Ah, you're right; with this PR we have moved from:
to now:
So perhaps we need to change |
Yes, or set env var when spawning workers unless OpenBLAS threading is explicitly enabled. |
I would be totally okay with that, but I'm not sure exactly where to put that, as I have a hard time disentangling the different cluster managers in Distributed, to find the local Distributed code paths, as opposed to the remote/SSH ones. |
Is there anyway we could be more lazy and not set the number of BLAS threads explicitly until a BLAS operation is actually executed? |
There is always the option of adding an |
Can it be done with |
Thanks for merging!
It also helps on my (16-vthread) Linux taking 4.4 ms off (for down to 152.7 ms min. But best mean time is though 2.5 ms worse, likely because I got very lucky with the max value) and every ms counts for some benchmarking (where we're really close to other languages). I've gotten startup down by half with a non-default sysimage, for an older version, a much larger effect, but this too would help (yes, it's only 2.8% faster startup, with sysimage the gain should be amplified to about 5.6% faster).
I ran hyperfine many times, could never get min that far down without the ENV, while with it often higher than the min for without it. I don't think I'm just measuring noise, some real effect, though the machine likely heats up while benchmarking. I tried to minimize all noise, turned off the web browser, but load didn't go down to 0. |
* Limit initial OpenBLAS thread count We set OpenBLAS's initial thread count to `1` to prevent runaway allocation within OpenBLAS's initial thread startup. LinearAlgebra will later call `BLAS.set_num_threads()` to the actual value we require. * Support older names (cherry picked from commit 58b559f)
* Limit initial OpenBLAS thread count We set OpenBLAS's initial thread count to `1` to prevent runaway allocation within OpenBLAS's initial thread startup. LinearAlgebra will later call `BLAS.set_num_threads()` to the actual value we require. * Support older names (cherry picked from commit 58b559f)
* Limit initial OpenBLAS thread count We set OpenBLAS's initial thread count to `1` to prevent runaway allocation within OpenBLAS's initial thread startup. LinearAlgebra will later call `BLAS.set_num_threads()` to the actual value we require. * Support older names (cherry picked from commit 58b559f)
This was add to OpenBLAS in OpenMathLib/OpenBLAS#3773 and was supposed to be used in #46844 but was likely typod
…#48064) This was add to OpenBLAS in OpenMathLib/OpenBLAS#3773 and was supposed to be used in #46844 but was likely typod
…#48064) This was add to OpenBLAS in OpenMathLib/OpenBLAS#3773 and was supposed to be used in #46844 but was likely typod (cherry picked from commit 75bc5ee)
…#48064) This was add to OpenBLAS in OpenMathLib/OpenBLAS#3773 and was supposed to be used in #46844 but was likely typod (cherry picked from commit 75bc5ee)
We set OpenBLAS's initial thread count to
1
to prevent runaway allocation within OpenBLAS's initial thread startup. LinearAlgebra will later callBLAS.set_num_threads()
to the actual value we require.