Optimize run_spin_precession! and run_spin_excitation! for CPU #443

rkierulf · 2024-07-11T21:28:24Z

I've updated the run_spin_precession function to no longer use a matrix of size nspins x seq.Δt, in order to reduce memory use and (hopefully) simulation time as well. Instead, it now allocates arrays of size 1 x nspins, and steps through time to update them. The arrays are also pre-allocated at the beginning of run_sim_time_iter, so they can be re-used from block-to-block.

Interestingly, running the MRI Lab benchmark on my computer, I did see significantly faster simulation times on the CPU (roughly 4x faster), and a ~80% decrease in memory usage, however, GPU performance for Metal was much slower, going from ~6 seconds to run the MRI Lab benchmark to ~46 seconds. I haven't looked deeply into why this is yet, but I suspect this approach of stepping through time is more suited to the CPU than GPU. When I write a new BlochKernel simulation method using KernelAbstractions, I think we should switch back to the matrix approach, so we will have one method specialized for the CPU and another for the GPU. For the GPU method, we can have one thread for each entry in the nspins x seq.Δt matrix and compute the effective field all at once. There are parts that need to be sequential in time, but I think we can reduce across the time dimension in O(log(seq.Δt)) cycles.

I was originally planning to have a separate pull request for optimizing run_spin_excitation, but I think it's simpler to just include in this pull request and merge together, so I'll work on that next. Other issues fixed:

~~In Benchmark.yml, the job is not run for external pull requests since the Buildkite action is not run~~
If the user tries to run on the GPU but forgot to load the GPU backends, it will run on the CPU with Threads.nthreads(), not just 1 thread
If the user tries to run on the CPU and Threads.nthreads()=1, they see the message: "Simulation will be run on the CPU with only 1 thread. To enable multi-threading, start julia with --threads=auto."

cncastillo

Generally, allowing scalar indexing makes the GPU suffer.

KomaMRICore/Project.toml

KomaMRICore/ext/KomaAMDGPUExt.jl

KomaMRICore/ext/KomaCUDAExt.jl

KomaMRICore/ext/KomaMetalExt.jl

KomaMRICore/ext/KomaoneAPIExt.jl

rkierulf · 2024-07-11T23:03:58Z

When stepping through the sequence in time, we do need to access arrays such as seq.Gx / Gy / Gz, seq.ADC, and sig at a specific index for each time step, hence the scalar indexing. My understanding from https://discourse.julialang.org/t/overcoming-slow-scalar-operations-on-gpu-arrays/49554/5 is that this is ok since the number of spins is expected to be >> than the length of seq.Δt inside each block, and nspins is the dimension of the array we are broadcasting along. Although, as I mentioned earlier, the for-loop stepping through time approach is probably inefficient in general for the GPU, and we should change back to using a matrix for a future BlochKernel sim method.

At one point, I did have it so that each scalar index was turned into an array and broadcasted, so for example:

Bz_2 .= x .* seq.Gx[seq_idx,:] .+ y .* seq.Gy[seq_idx,:] .+ z.* seq.Gz[seq_idx,:] .+ p.Δw / T(2π * γ)

instead of:

Bz_2 .= x .* seq.Gx[seq_idx] .+ y .* seq.Gy[seq_idx] .+ z.* seq.Gz[seq_idx] .+ p.Δw / T(2π * γ)

But this was slightly slower (both on the CPU and GPU) compared with just using scalar indexing, so I changed it.

rkierulf · 2024-07-11T23:13:27Z

Another possible approach for run_spin_precession would be to still use the matrix-based approach, but pre-allocate the matrix with size nspins x seq.Δt for the longest block in the sequence, and then for blocks with a smaller seq.Δt use a view of this matrix. This would still cut memory usage down by a lot I think.

I'm pretty sure the main reason for the time being faster on the CPU is that we are only calculating Mxy if seq.ADC is true for that time step, whereas before we were calculating it for every time step, and this is the most expensive calculation. We could still do this using a matrix, however. I can play around a bit with this approach and see how the time compares.

cncastillo · 2024-07-12T00:20:16Z

Yeah, I'm not sure what's best :/. Let's discuss it tomorrow!

codecov · 2024-07-16T23:54:18Z

Codecov Report

Attention: Patch coverage is 98.91304% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.75%. Comparing base (1de7627) to head (34a3a03).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #443      +/-   ##
==========================================
+ Coverage   90.52%   90.75%   +0.22%     
==========================================
  Files          51       53       +2     
  Lines        2786     2855      +69     
==========================================
+ Hits         2522     2591      +69     
  Misses        264      264

Flag	Coverage Δ
base	`88.20% <ø> (ø)`
core	`91.79% <98.91%> (+1.43%)`	⬆️
files	`93.55% <ø> (ø)`
komamri	`93.98% <ø> (ø)`
plots	`89.30% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...RICore/src/simulation/SimMethods/Bloch/BlochCPU.jl	`100.00% <100.00%> (ø)`
...e/src/simulation/SimMethods/BlochDict/BlochDict.jl	`87.50% <100.00%> (ø)`
...c/simulation/SimMethods/BlochSimple/BlochSimple.jl	`100.00% <100.00%> (ø)`
...ICore/src/simulation/SimMethods/KernelFunctions.jl	`100.00% <ø> (ø)`
...MRICore/src/simulation/SimMethods/Magnetization.jl	`93.33% <100.00%> (ø)`
...Core/src/simulation/SimMethods/SimulationMethod.jl	`100.00% <100.00%> (ø)`
KomaMRICore/src/simulation/SimulatorCore.jl	`94.77% <100.00%> (+0.17%)`	⬆️
KomaMRICore/src/datatypes/Spinor.jl	`88.23% <0.00%> (-5.52%)`	⬇️

rkierulf · 2024-07-17T23:19:42Z

I've updated the file structure of KomaMRICore to match what we discussed. The default run_spin_precession! and run_spin_excitation! functions are now in a new file SimulationMethod.jl and are dispatched at the most abstract level for the sim_method, backend, and prealloc parameters. The specialized CPU implementation for run_spin_precession! is in a separate file BlochCPU.jl and is dispatched to for sim_method::Bloch, backend::KA.CPU, and prealloc::BlochCPUPrealloc. I also added an optimized implementation for run_spin_excitation! which improved memory usage a bit more for the MRI Lab benchmark but didn't seem to affect the simulation time.

Let me know if there's any more changes you want made before merging!

…KomaMRI.jl into optimize-precession

rkierulf · 2024-07-18T16:08:11Z

Doing some more profiling this morning, I was surprised to see some lines allocating more than expected. It turns out that there were quite a few statements that were not fully broadcasted. Apparently, even if multiplying / dividing by a scalar, it still helps to use ./ or .* compared with / or *. I think this makes it easier for the compiler to broadcast the entire statement. Making these changes, I was able to get memory usage to decrease by a factor of ~15 from where it was immediately before! However, simulation time only decreased by about 10%, so at this point, we are clearly compute-bound and not memory-bound.

cncastillo · 2024-07-18T22:49:32Z

Hi, it looks good!

There's a few details it would be nice to change before merging

(1) Can we call the previous Bloch method for ::SimulationMethod (fallback) but also for ::BlochSimple? And move it to its own folder for discoverability.

This is not for now, but:

(2) To organize the pre allocations, maybe we can put them in structs. For example, I think it makes sense to have

M::Mag (or another spin state)
- M.xy complex array Nspins x 1
- M.z real array Nspins x 1
B_new::Mag
- B_new.xy complex array Nspins x 1
- B_new.z real array Nspins x 1
B_old::Mag
- B_old.xy complex array Nspins x 1
- B_old.z real array Nspins x 1
Rot::Spinor
- Rot.alpha complex array Nspins x 1
- Rot.beta complex array Nspins x 1
  So all variables have a physical meaning, basically we're storing the current SimulationState.

For the excitation for now we only use B_new (I would like to add a trapezoidal integration later).

For the precession, the term exp/cis(phi) I think is the same as the Spinor alpha, with beta=0.

Mag is not a very good name for the struct (as it could also be used for B_new and B_old), it is a vector in SU2 representation, VectorSU2.

https://onlinelibrary.wiley.com/doi/pdf/10.1002/9783527630097.app1#:~:text=We%20define%20a%20spinor%20as,two%2D%20dimensional%20complex%20variable%20space.

By putting them inside structs we can use functions defined for them, for example mul!(spinor, mag) or abs(vectorsu2), etc, so we don't need to write the function again inside the simulation.

cncastillo · 2024-07-19T20:10:14Z

Just waiting for the CI to pass to merge!

rkierulf added 5 commits July 11, 2024 15:15

Optimize run_spin_precession for CPU

03b7eec

Don't run Benchmark action for external PRs

637b490

Commit change for adding Suppressor as dependency

1837de7

Still run with Threads.nthreads() if falling back to CPU

405f8d1

Warn if user tries to run on CPU without enabling multi-threading

3bce915

rkierulf requested a review from cncastillo as a code owner July 11, 2024 21:28

rkierulf linked an issue Jul 11, 2024 that may be closed by this pull request

Profile Bloch Simulation Method #352

Closed

rkierulf added 2 commits July 11, 2024 16:54

Add space to confirm if error happens again

7df32aa

Try putting -t_seq inside array

c19c75e

cncastillo reviewed Jul 11, 2024

View reviewed changes

Try changing order of operations

037561f

rkierulf added 3 commits July 13, 2024 15:17

Rename arrays

b7fa269

Leave BlochDict alone for now

bdaa16d

Reorganize files, separate CPU specialization from default

aeefa4a

rkierulf added 2 commits July 16, 2024 19:09

Try reverting Benchmark.yml change to see if action runs

bf11248

Optimize run_spin_excitation! for CPU

55b855c

rkierulf changed the title ~~Optimize run_spin_precession for CPU~~ Optimize run_spin_precession! and run_spin_excitation! for CPU Jul 17, 2024

cncastillo and others added 3 commits July 17, 2024 21:51

Merge branch 'master' into optimize-precession

7e8bb4e

Fix incomplete broadcasting

3d41cfd

Merge branch 'optimize-precession' of https://github.com/JuliaHealth/…

fbc1892

…KomaMRI.jl into optimize-precession

rkierulf added 4 commits July 18, 2024 12:00

Also fix incomplete broadcasting for BlochDict

cc92880

Use cis for complex exponentiation

dae09d0

Use @. macro

b2e3e79

Wrap all scalars with T (slightly helps performance)

3fe77ac

rkierulf added 2 commits July 18, 2024 16:52

Simplify some calculations

7b055b5

Also wrap scalars with T in Magnetization.jl

d839b55

rkierulf added 4 commits July 18, 2024 18:41

Move default definitions into BlochSimpleSimulationMethod.jl

e334cfd

Refactor allocation structs

4801e48

Rename files

b86da0b

Update so documentation doesn't fail

34a3a03

cncastillo merged commit e09ba44 into master Jul 19, 2024
18 of 19 checks passed

cncastillo deleted the optimize-precession branch July 19, 2024 20:20

rkierulf mentioned this pull request Aug 23, 2024

GSOC: Add GPU Explanation Section to Documentation #470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize run_spin_precession! and run_spin_excitation! for CPU #443

Optimize run_spin_precession! and run_spin_excitation! for CPU #443

rkierulf commented Jul 11, 2024 •

edited

Loading

cncastillo left a comment

rkierulf commented Jul 11, 2024

rkierulf commented Jul 11, 2024

cncastillo commented Jul 12, 2024

codecov bot commented Jul 16, 2024 •

edited

Loading

rkierulf commented Jul 17, 2024

rkierulf commented Jul 18, 2024 •

edited

Loading

cncastillo commented Jul 18, 2024 •

edited

Loading

cncastillo commented Jul 19, 2024

Optimize run_spin_precession! and run_spin_excitation! for CPU #443

Optimize run_spin_precession! and run_spin_excitation! for CPU #443

Conversation

rkierulf commented Jul 11, 2024 • edited Loading

cncastillo left a comment

Choose a reason for hiding this comment

rkierulf commented Jul 11, 2024

rkierulf commented Jul 11, 2024

cncastillo commented Jul 12, 2024

codecov bot commented Jul 16, 2024 • edited Loading

Codecov Report

rkierulf commented Jul 17, 2024

rkierulf commented Jul 18, 2024 • edited Loading

cncastillo commented Jul 18, 2024 • edited Loading

cncastillo commented Jul 19, 2024

rkierulf commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 16, 2024 •

edited

Loading

rkierulf commented Jul 18, 2024 •

edited

Loading

cncastillo commented Jul 18, 2024 •

edited

Loading