Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize run_spin_precession! and run_spin_excitation! for CPU #443

Merged
merged 26 commits into from
Jul 19, 2024

Conversation

rkierulf
Copy link
Collaborator

@rkierulf rkierulf commented Jul 11, 2024

I've updated the run_spin_precession function to no longer use a matrix of size nspins x seq.Δt, in order to reduce memory use and (hopefully) simulation time as well. Instead, it now allocates arrays of size 1 x nspins, and steps through time to update them. The arrays are also pre-allocated at the beginning of run_sim_time_iter, so they can be re-used from block-to-block.

Interestingly, running the MRI Lab benchmark on my computer, I did see significantly faster simulation times on the CPU (roughly 4x faster), and a ~80% decrease in memory usage, however, GPU performance for Metal was much slower, going from ~6 seconds to run the MRI Lab benchmark to ~46 seconds. I haven't looked deeply into why this is yet, but I suspect this approach of stepping through time is more suited to the CPU than GPU. When I write a new BlochKernel simulation method using KernelAbstractions, I think we should switch back to the matrix approach, so we will have one method specialized for the CPU and another for the GPU. For the GPU method, we can have one thread for each entry in the nspins x seq.Δt matrix and compute the effective field all at once. There are parts that need to be sequential in time, but I think we can reduce across the time dimension in O(log(seq.Δt)) cycles.

I was originally planning to have a separate pull request for optimizing run_spin_excitation, but I think it's simpler to just include in this pull request and merge together, so I'll work on that next. Other issues fixed:

  • In Benchmark.yml, the job is not run for external pull requests since the Buildkite action is not run
  • If the user tries to run on the GPU but forgot to load the GPU backends, it will run on the CPU with Threads.nthreads(), not just 1 thread
  • If the user tries to run on the CPU and Threads.nthreads()=1, they see the message: "Simulation will be run on the CPU with only 1 thread. To enable multi-threading, start julia with --threads=auto."

@rkierulf rkierulf linked an issue Jul 11, 2024 that may be closed by this pull request
Copy link
Member

@cncastillo cncastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, allowing scalar indexing makes the GPU suffer.

KomaMRICore/Project.toml Outdated Show resolved Hide resolved
KomaMRICore/ext/KomaAMDGPUExt.jl Outdated Show resolved Hide resolved
KomaMRICore/ext/KomaAMDGPUExt.jl Outdated Show resolved Hide resolved
KomaMRICore/ext/KomaCUDAExt.jl Outdated Show resolved Hide resolved
KomaMRICore/ext/KomaCUDAExt.jl Outdated Show resolved Hide resolved
KomaMRICore/ext/KomaMetalExt.jl Outdated Show resolved Hide resolved
KomaMRICore/ext/KomaMetalExt.jl Outdated Show resolved Hide resolved
KomaMRICore/ext/KomaoneAPIExt.jl Outdated Show resolved Hide resolved
KomaMRICore/ext/KomaoneAPIExt.jl Outdated Show resolved Hide resolved
@rkierulf
Copy link
Collaborator Author

When stepping through the sequence in time, we do need to access arrays such as seq.Gx / Gy / Gz, seq.ADC, and sig at a specific index for each time step, hence the scalar indexing. My understanding from https://discourse.julialang.org/t/overcoming-slow-scalar-operations-on-gpu-arrays/49554/5 is that this is ok since the number of spins is expected to be >> than the length of seq.Δt inside each block, and nspins is the dimension of the array we are broadcasting along. Although, as I mentioned earlier, the for-loop stepping through time approach is probably inefficient in general for the GPU, and we should change back to using a matrix for a future BlochKernel sim method.

At one point, I did have it so that each scalar index was turned into an array and broadcasted, so for example:

Bz_2 .= x .* seq.Gx[seq_idx,:] .+ y .* seq.Gy[seq_idx,:] .+ z.* seq.Gz[seq_idx,:] .+ p.Δw / T(2π * γ)

instead of:

Bz_2 .= x .* seq.Gx[seq_idx] .+ y .* seq.Gy[seq_idx] .+ z.* seq.Gz[seq_idx] .+ p.Δw / T(2π * γ)

But this was slightly slower (both on the CPU and GPU) compared with just using scalar indexing, so I changed it.

@rkierulf
Copy link
Collaborator Author

Another possible approach for run_spin_precession would be to still use the matrix-based approach, but pre-allocate the matrix with size nspins x seq.Δt for the longest block in the sequence, and then for blocks with a smaller seq.Δt use a view of this matrix. This would still cut memory usage down by a lot I think.

I'm pretty sure the main reason for the time being faster on the CPU is that we are only calculating Mxy if seq.ADC is true for that time step, whereas before we were calculating it for every time step, and this is the most expensive calculation. We could still do this using a matrix, however. I can play around a bit with this approach and see how the time compares.

@cncastillo
Copy link
Member

Yeah, I'm not sure what's best :/. Let's discuss it tomorrow!

Copy link

codecov bot commented Jul 16, 2024

Codecov Report

Attention: Patch coverage is 98.91304% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.75%. Comparing base (1de7627) to head (34a3a03).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #443      +/-   ##
==========================================
+ Coverage   90.52%   90.75%   +0.22%     
==========================================
  Files          51       53       +2     
  Lines        2786     2855      +69     
==========================================
+ Hits         2522     2591      +69     
  Misses        264      264              
Flag Coverage Δ
base 88.20% <ø> (ø)
core 91.79% <98.91%> (+1.43%) ⬆️
files 93.55% <ø> (ø)
komamri 93.98% <ø> (ø)
plots 89.30% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...RICore/src/simulation/SimMethods/Bloch/BlochCPU.jl 100.00% <100.00%> (ø)
...e/src/simulation/SimMethods/BlochDict/BlochDict.jl 87.50% <100.00%> (ø)
...c/simulation/SimMethods/BlochSimple/BlochSimple.jl 100.00% <100.00%> (ø)
...ICore/src/simulation/SimMethods/KernelFunctions.jl 100.00% <ø> (ø)
...MRICore/src/simulation/SimMethods/Magnetization.jl 93.33% <100.00%> (ø)
...Core/src/simulation/SimMethods/SimulationMethod.jl 100.00% <100.00%> (ø)
KomaMRICore/src/simulation/SimulatorCore.jl 94.77% <100.00%> (+0.17%) ⬆️
KomaMRICore/src/datatypes/Spinor.jl 88.23% <0.00%> (-5.52%) ⬇️

@rkierulf
Copy link
Collaborator Author

I've updated the file structure of KomaMRICore to match what we discussed. The default run_spin_precession! and run_spin_excitation! functions are now in a new file SimulationMethod.jl and are dispatched at the most abstract level for the sim_method, backend, and prealloc parameters. The specialized CPU implementation for run_spin_precession! is in a separate file BlochCPU.jl and is dispatched to for sim_method::Bloch, backend::KA.CPU, and prealloc::BlochCPUPrealloc. I also added an optimized implementation for run_spin_excitation! which improved memory usage a bit more for the MRI Lab benchmark but didn't seem to affect the simulation time.

Let me know if there's any more changes you want made before merging!

@rkierulf rkierulf changed the title Optimize run_spin_precession for CPU Optimize run_spin_precession! and run_spin_excitation! for CPU Jul 17, 2024
@rkierulf
Copy link
Collaborator Author

rkierulf commented Jul 18, 2024

Doing some more profiling this morning, I was surprised to see some lines allocating more than expected. It turns out that there were quite a few statements that were not fully broadcasted. Apparently, even if multiplying / dividing by a scalar, it still helps to use ./ or .* compared with / or *. I think this makes it easier for the compiler to broadcast the entire statement. Making these changes, I was able to get memory usage to decrease by a factor of ~15 from where it was immediately before! However, simulation time only decreased by about 10%, so at this point, we are clearly compute-bound and not memory-bound.

@cncastillo
Copy link
Member

cncastillo commented Jul 18, 2024

Hi, it looks good!

There's a few details it would be nice to change before merging

(1) Can we call the previous Bloch method for ::SimulationMethod (fallback) but also for ::BlochSimple? And move it to its own folder for discoverability.

This is not for now, but:

(2) To organize the pre allocations, maybe we can put them in structs. For example, I think it makes sense to have

  • M::Mag (or another spin state)
    • M.xy complex array Nspins x 1
    • M.z real array Nspins x 1
  • B_new::Mag
    • B_new.xy complex array Nspins x 1
    • B_new.z real array Nspins x 1
  • B_old::Mag
    • B_old.xy complex array Nspins x 1
    • B_old.z real array Nspins x 1
  • Rot::Spinor
    • Rot.alpha complex array Nspins x 1
    • Rot.beta complex array Nspins x 1
      So all variables have a physical meaning, basically we're storing the current SimulationState.

For the excitation for now we only use B_new (I would like to add a trapezoidal integration later).

For the precession, the term exp/cis(phi) I think is the same as the Spinor alpha, with beta=0.

Mag is not a very good name for the struct (as it could also be used for B_new and B_old), it is a vector in SU2 representation, VectorSU2.

https://onlinelibrary.wiley.com/doi/pdf/10.1002/9783527630097.app1#:~:text=We%20define%20a%20spinor%20as,two%2D%20dimensional%20complex%20variable%20space.

By putting them inside structs we can use functions defined for them, for example mul!(spinor, mag) or abs(vectorsu2), etc, so we don't need to write the function again inside the simulation.

@cncastillo
Copy link
Member

Just waiting for the CI to pass to merge!

@cncastillo cncastillo merged commit e09ba44 into master Jul 19, 2024
18 of 19 checks passed
@cncastillo cncastillo deleted the optimize-precession branch July 19, 2024 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Profile Bloch Simulation Method
2 participants