Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-atomic pairwise force summation kernels. #133

Merged
merged 36 commits into from
Sep 6, 2023

Conversation

JaydevSR
Copy link
Contributor

Benchmark results before any changes

Job Properties

  • Time of benchmark: 7 Jun 2023 - 19:30
  • Package commit: cc575d
  • Julia commit: 8e6305
  • Julia command flags: None
  • Environment variables: JULIA_NUM_THREADS => 16

Results

Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.

ID time GC time memory allocations
["interactions", "Coulomb energy"] 8.585 ns (5%)
["interactions", "Coulomb force"] 10.210 ns (5%)
["interactions", "HarmonicBond energy"] 13.589 ns (5%)
["interactions", "HarmonicBond force"] 20.126 ns (5%)
["interactions", "LennardJones energy"] 14.764 ns (5%)
["interactions", "LennardJones force"] 16.534 ns (5%)
["protein", "CPU parallel NL"] 2.109 s (5%) 26.741 ms 504.39 MiB (1%) 7430
["simulation", "CPU NL"] 136.302 ms (5%) 34.51 MiB (1%) 11749
["simulation", "CPU f32 NL"] 131.337 ms (5%) 22.23 MiB (1%) 11748
["simulation", "CPU f32"] 723.609 ms (5%) 18.42 MiB (1%) 11057
["simulation", "CPU parallel NL"] 148.578 ms (5%) 99.92 MiB (1%) 41661
["simulation", "CPU parallel f32 NL"] 87.299 ms (5%) 59.08 MiB (1%) 41654
["simulation", "CPU parallel f32"] 150.689 ms (5%) 49.74 MiB (1%) 36257
["simulation", "CPU parallel"] 171.384 ms (5%) 90.58 MiB (1%) 36267
["simulation", "CPU"] 655.085 ms (5%) 30.69 MiB (1%) 11058
["simulation", "GPU NL"] 93.107 ms (5%) 13.66 MiB (1%) 86877
["simulation", "GPU f32 NL"] 87.481 ms (5%) 10.43 MiB (1%) 81447
["simulation", "GPU f32"] 90.161 ms (5%) 9.78 MiB (1%) 70666
["simulation", "GPU"] 93.585 ms (5%) 12.99 MiB (1%) 76096
["spatial", "vector"] 11.215 ns (5%)
["spatial", "vector_1D"] 8.266 ns (5%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

  • ["interactions"]
  • ["protein"]
  • ["simulation"]
  • ["spatial"]

Julia versioninfo

Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 20.04.6 LTS
  uname: Linux 5.4.0-146-generic #163-Ubuntu SMP Fri Mar 17 18:26:02 UTC 2023 x86_64 x86_64
  CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz: 
                 speed         user         nice          sys         idle          irq
       #1-12  1200 MHz   41107039 s     172497 s    5408181 s  518383963 s          0 s
  Memory: 377.7989311218262 GB (363221.26953125 MB free)
  Uptime: 4.72816548e6 sec
  Load Avg:  4.25  3.26  2.61
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, broadwell)
  Threads: 16 on 12 virtual cores

@JaydevSR
Copy link
Contributor Author

JaydevSR commented Jun 16, 2023

I have presently changed the implementation such that it utilizes two kernels: One for neighbor list and one without it.

  1. For the kernel without neighbor list, I was initially working on something that would tile the N^2 iterations into warp sized tiles and then diagonally evaluate the tiles to avoid access conflicts (how OpenMM does it) and also to allow data reuse, the atom data could be shuffled between the threads in the warp.
    • Problem 1 (easy): shuffles are not supported for how atom and coords are stored currently, the shuffled data has to be floating point numbers or array. This would require processing the input data before passing to kernel to allow this.
    • Problem 2 (unsure): I am facing some issues most probably with synchronization in the kernel which is giving wrong results for forces. I still have to figure this out.
    • As of now I have written a simple kernel for this where each thread sums the forces for one atom using a for loop. This not performant about 1.5 times slower than the current atomic implementation (although I have only tried this on small systems where there should not be many conflicts during atomic access to begin with). I think this can be improved by doing a binary reduction over rows using multiple threads.
  2. For the kernel without neighbor list, I see two possible ways:
    • Use an adjacency bit matrix, but the problem with this is because of sparsity many threads in a block will be idle if you simply broadcast and I am not sure how can we manage this. So mostly I won't go ahead with this.
    • Use a sparse adjacency matrix in CSC/CSR format. For this I have an I idea similar to the non-neighbor list implementation where each row can be treated as separate block and then we can do a combination of loop and binary reduction in the blocks. If the threads per block are less than number of non-zero inters in the row then it can be handled with a summation loop.
    • Then main issue lies with the construction of such sparse matrix on the GPU. As of now I have no idea :/ I have implemented such sparse matrix from scratch which will go down the bin later as the best way will be to use something like https://github.com/j-fu/ExtendableSparse.jl instead?

@codecov
Copy link

codecov bot commented Jun 16, 2023

Codecov Report

Patch coverage: 7.31% and project coverage change: -0.87% ⚠️

Comparison is base (17f20ed) 73.12% compared to head (841d291) 72.25%.

❗ Current head 841d291 differs from pull request most recent head b349ac1. Consider uploading reports for the commit b349ac1 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #133      +/-   ##
==========================================
- Coverage   73.12%   72.25%   -0.87%     
==========================================
  Files          35       35              
  Lines        5142     5212      +70     
==========================================
+ Hits         3760     3766       +6     
- Misses       1382     1446      +64     
Files Changed Coverage Δ
src/types.jl 72.45% <ø> (ø)
src/cuda.jl 1.39% <5.00%> (+1.39%) ⬆️
src/setup.jl 92.27% <100.00%> (+0.01%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JaydevSR JaydevSR changed the title Non-atomic force summation kernels Non-atomic pairwise force summation kernels. Jun 16, 2023
@jgreener64
Copy link
Collaborator

Looks like a good start. With regards to benchmarking I would use a system of 1000-5000 atoms and use @benchmark on the force function for now.

I am facing some issues most probably with synchronization in the kernel which is giving wrong results for forces.

This stuff can be hard to debug. I would use @cushow and @cuprintln on a small system to print out relevant data and check it is correct and gets reduced correctly. Also, you could try a sync_threads() after setting up and zeroing the shared memory, or before the final reduction.

It might be worth asking on the #gpu Slack channel about sparse matrices. CUDA has sparse matrices but I guess the issue is that these don't work in kernels?

I would also be interested to see the speed of using the OpenMM atom block strategy even with using the dense adjacency matrix. Another idea is to use a dense adjacency matrix, but create a sparse data structure with the first thread inside the kernel itself. You could have two shared memory int vectors for the interacting indices. Threads would then loop over those pairs.

@ejmeitz
Copy link
Collaborator

ejmeitz commented Jun 17, 2023

Not the best CUDA programmer, but I thought I'd throw this out there. I'm sure I'm missing some intricacy as I have literally spent 0 minutes thinking about force kernels on a GPU but the CUDA part of Molly is something I'd like to understand more. What are downsides of using a kernel like this besides you are not manually choosing the block sizes etc? too many kernel invocatons?

function kernel(atom_idx, interacting_atoms)
      return mapreduce(j  -> force(atom_idx, j), +, interacting_atoms)
end

@JaydevSR
Copy link
Contributor Author

JaydevSR commented Jun 17, 2023

@ejmeitz Right now the kernel is essentially doing what you have pointed out. Maybe I will try and see how the map reduce compares to the current implementation. But I think the problem with just using this is map reduce is that it is well suited to do reductions when the reduced over indices are mostly independent and no extra optimizations can be made anyway.

But in our case, we are dealing with N^2 interactions where we can certainly benefit from data reuse and operation ordering. This can happen only if the kernel is doing things at a warp level so that we can shuffle things around while smartly going through the reduction and force calculation at once. I am working on this right now and trying to straighten a few bugs. I also have little experience with CUDA apart from examples for now, as far as I understand this is the main motivation to begin with this barebones implementation, I am not sure how much these optimizations will help but it will be surely worth something 😄 (unless neighbor data is way sparse that the extra overhead of loops takes over but that can be dealt with as another case?)

@JaydevSR
Copy link
Contributor Author

JaydevSR commented Jun 20, 2023

Bechmark results on for the forces function (Non-tiled). The implementation loops over all the j values for a particular i using block stride and the force summation is made over shared memory using a binary reduction. Each block calculates one row in the interaction pairs, hence there are n_atoms blocks.

Device: Tesla V100-PCIE-16GB

Simulation Atoms Total Pairs Min (ms) Median (ms) Mean (ms) Mean per 10000 pairs (μs)
NONL 3200 5118400 2.450 2.577 2.576 5.03
NONL f32 3200 5118400 0.854 1.098 1.093 2.13
NL (atomic) 3200 335323 0.215 0.225 0.226 6.74
NL f32 (atomic) 3200 334149 0.193 0.210 0.211 6.31

@ejmeitz
Copy link
Collaborator

ejmeitz commented Jun 20, 2023

Btw I have access to a cluster of A40s and another of 2080s of you wanna test on distributed systems or just on a different single GPU.

@JaydevSR

This comment was marked as outdated.

@JaydevSR
Copy link
Contributor Author

JaydevSR commented Jun 22, 2023

Benchmark results on the forces function using a tiled kernel without a neighbor list

  1. Approach 0 - Atomic addition on Molly.jl/main
  2. Approach 1 - The tiling is done over one whole block and each warp loops over a complete row of tiles horizontally. Each block computes a section of the pairs that is of size (n_threads, n_threads).
  3. Approach 2 - The tiling is done such that each warp only computes one tile. Each block computes of a section of the pairs that is of size (WARPSIZE, n_threads).
Simulation Atoms Number of Pairs Min time (ms) Median time (ms) Mean time (ms) Mean time per 10000 pairs (μs)
Approach 0 3072 4717056 5.193 5.331 5.339 11.31
Approach 0 f32 3072 4717056 5.306 5.466 5.484 11.63
Approach 1 3072 4717056 1.402 1.572 1.568 3.32
Approach 1 f32 3072 4717056 0.870 1.065 1.061 2.24
Approach 2 3072 4717056 2.015 2.213 2.220 4.71
Approach 2 f32 3072 4717056 0.502 0.512 0.513 1.08

Takeaway: getting rid of the internal loop gives 2x improvement for f32 but we have to pay the price with worse f64 performance. I think this can be because approach 2 increases the number of blocks by a factor of n_threads / WARPSIZE which results in drop in performance to dispatch the f64 calculations to the SMs in the GPU? If this is the case then this can be improved if we want to by each warp calculating 2 or 4 tiles at a time (without a loop just multiple lines of same code).

@jgreener64
Copy link
Collaborator

Cool, I would focus on f32 performance since force calculation is usually safe to run in f32. I know other software has severe slowdown with f64, it might be unavoidable.

Am I right in interpreting that the f32 median time before this PR was 1.098 ms for no NL / 0.210 ms for NL and now is 0.512 ms for no NL? That's getting somewhere.

I would aim to move to the NL version soon, particularly if there are things you have learned with the no NL case that can be used there. Can approach 2 above be applied in a performant way to the case where interactions are skipped if they are not in a dense neighbour matrix?

@JaydevSR
Copy link
Contributor Author

JaydevSR commented Jun 22, 2023

In the current state the kernel is quite minimal and the only major optimization opportunity that I can see is to shuffle atom and coordinate data as well which will certainly require some preprocessing that I haven't thought of yet how to integrate with the whole interface.

The NONL version gives quite a motivation for similar implementation for NL as well if we can think of a way to construct some dense tiles out of the sparse adjacency matrix? I am not quite sure about this. But this will work quite naturally if we are dealing with cell lists!

@jgreener64 No the 1.098 ms is the median time for the kernel in which we simply have a for loop over all j's for a particular i (using block stride and then a binary reduction over shared memory). But for NL i haven't touched anything so yes that the same.

Looking at the benchmarks of the atomic version (before the pr) that I have included just now you can see this kernel is 10x faster that that. If we can achieve similar performance with a NL it would be certainly something to look for.

EDIT: Also the current kernel only works if the number of atoms is a multiple of of n_threads and n_threads is a multiple of WARPSIZE this removes the need for if conditions for i and j but the general case would introduce quite some complexity which would certainly impact performance. What should be done in this case?

@jgreener64
Copy link
Collaborator

this kernel is 10x faster that that

Great, even better.

Yes I think the NL kernel would be based on this new no NL one, the current NL kernel is probably too simple to speed up a lot.

How easy would it be to have the above approach 2 but have another input neighbors, the dense adjacency list, and just have a if neighbors[i, j] switch before the force calculation? Does that have any hope of being fast? Might be worth benchmarking if it isn't too hard. You might get use out of @inbounds around indexing expressions too.

What should be done in this case?

I would try adding the i and j conditions in the most simple way you can and benchmark performance. It may not be that bad? The alternative is to pad the inputs, but that adds its own complexity.

@JaydevSR
Copy link
Contributor Author

Okay I will try both the things and benchmark them 👍

@JaydevSR
Copy link
Contributor Author

Yes, the current benchmarks are for just the forces function which calls the kernel after doing some things. I am using this script for the benchmarks:

using Molly
using BenchmarkTools
using CUDA

function setup_sim(nl::Bool, f32::Bool)
    local n_atoms = 3219
    atom_mass = f32 ? 10.0f0u"u" : 10.0u"u"
    boundary = f32 ? CubicBoundary(6.0f0u"nm") : CubicBoundary(6.0u"nm")
    starting_coords = place_atoms(n_atoms, boundary; min_dist=0.2u"nm")
    starting_velocities = [random_velocity(atom_mass, 1.0u"K") for i in 1:n_atoms]
    starting_coords_f32 = [Float32.(c) for c in starting_coords]
    starting_velocities_f32 = [Float32.(c) for c in starting_velocities]

    simulator = VelocityVerlet(dt=f32 ? 0.02f0u"ps" : 0.02u"ps")

    neighbor_finder = NoNeighborFinder()
    cutoff = DistanceCutoff(f32 ? 1.0f0u"nm" : 1.0u"nm")
    pairwise_inters = (LennardJones(use_neighbors=false, cutoff=cutoff),)
    if nl
        neighbor_finder = DistanceNeighborFinder(
            eligible=CuArray(trues(n_atoms, n_atoms)),
            n_steps=10,
            dist_cutoff=f32 ? 1.5f0u"nm" : 1.5u"nm",
        )
        pairwise_inters = (LennardJones(use_neighbors=true, cutoff=cutoff),)
    end

    coords = CuArray(f32 ? starting_coords_f32 : starting_coords)
    velocities = CuArray(f32 ? starting_velocities_f32 : starting_velocities)
    atoms = CuArray([Atom(charge=f32 ? 0.0f0 : 0.0, mass=atom_mass, σ=f32 ? 0.2f0u"nm" : 0.2u"nm",
                            ϵ=f32 ? 0.2f0u"kJ * mol^-1" : 0.2u"kJ * mol^-1") for i in 1:n_atoms])

    sys = System(
        atoms=atoms,
        coords=coords,
        boundary=boundary,
        velocities=velocities,
        pairwise_inters=pairwise_inters,
        neighbor_finder=neighbor_finder,
    )

    return sys, simulator
end

runs = [
    ("GPU NONL"       , [false, false]),
    ("GPU NONL f32"   , [false,  true]),
    ("GPU NL"    , [true , false]),
    ("GPU NL f32", [true ,  true]),
]

for (name, args) in runs
    println("*************** Run: $name ************************")
    sys, sim = setup_sim(args...)
    n_atoms = length(sys)
    neighbors = find_neighbors(sys)
    nbrs = isnothing(neighbors) ? n_atoms * (n_atoms - 1) ÷ 2 : length(neighbors)
    println("> Total Pairs = $nbrs")
    f = forces(sys, neighbors)
    b = @benchmark CUDA.@sync forces($sys, $neighbors)
    display(b)
end

I will also try and setup Nsight compute but if you can take a quick look at the profiles as well for the neighbor list kernel that would also be great.

@ejmeitz
Copy link
Collaborator

ejmeitz commented Jul 24, 2023

Here's the GPU NL F32 output. A lot more kernels were generated than I expected. The first 3 were generated by sim_setup, find_neighbors launched the next like 100 (not sure why so many), and the last 3 kernels were generated by forces (starts with "Z25pairwise_force_kernel_nl_"). I did run all of the kernels but the output for this one is already a lot to go through lol. If you want them I can send that, but in general they look to have similar issues to this kernel.

If you unzip the file below and double click on one of the kernels it will dump all sorts of information and things that could be optimized. I looked quickly at the pairwise_force_kernel and its only using maybe 30% of my GPU's resources. Looks like there's a lot of un-coalesced accesses and warp divergence.

Molly_kernel.txt.zip

@JaydevSR
Copy link
Contributor Author

Thanks a lot @ejmeitz. I will go through these and try to see if fixing these problems can improve things.

@JaydevSR
Copy link
Contributor Author

JaydevSR commented Sep 1, 2023

So... Out of all the different approaches for the NL kernel everything has almost the same performance even the atomic approach. I think this suggest that the computation is not making any difference and the performance is almost entirely dependent on the memory read-write. Profiles also show the same thing. I am not really sure how to deal with this memory bottleneck.

Even using the shared memory to store the atoms in between the computation does not deal with this issue. I had a look at the generated PTX code which shows that many intermediate results are stored in local memory instead in the registers and the using shared memory does not get rid of majority of memory stalls. Maybe there is some way to deal with this, but I will have to look more into this.

For now, it would be great if the NONL kernel can be merged at least and I will work on the NL kernel in a separate PR from scratch with my newly gained experience :) As of now according to the benchmarks for different system sizes, the NONL kernel is 20-30 times faster than the atomic approach so that's quite an improvement and any further improvement faces the same issue of memory read-write being so slow that computation times are not holding a candle to that.

@jgreener64
Copy link
Collaborator

Sounds good, I can try and review this PR next week. You will need to rebase/merge. The no NL kernel is definitely an improvement. One thing that would be nice to add is a comment by the kernel with a description/diagram of the approach used. I don't know how the kernel will play with Enzyme, but I can look into any problems there myself.

Well done for doggedly pursuing the NL kernel, I know how frustrating it can be. Something to consider is starting from the other direction, completely ablating everything, just loop over the neighbouring pairs and return zero forces. I guess that should be fast? Then see what minimal addition causes the slowdown, make a self-contained example and discuss it here, on Slack or on the CUDA.jl issues.

@JaydevSR
Copy link
Contributor Author

JaydevSR commented Sep 1, 2023

Thanks @jgreener64, I will rebase the incoming branch. Also should I also make similar changes to the pairwise potential energy kernel as well?

@jgreener64
Copy link
Collaborator

If it's easy to change the potential energy kernel then do. If it will take up anything more than a short time then focus on the NL kernel and I will implement the potential energy kernel later as a way to learn what is going on.

@JaydevSR JaydevSR marked this pull request as ready for review September 3, 2023 16:57
@JaydevSR
Copy link
Contributor Author

JaydevSR commented Sep 4, 2023

I am not sure why the CI runs are failing. Everything works for me locally.

@jgreener64
Copy link
Collaborator

This looks good bar the failure when CUDA is not available. That will be due to CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK) or maybe the CUDA.shfl_recurse lines. You could enclose the offending code in if CUDA.functional().

Do the GPU tests pass for you? I can run them locally tomorrow too. Long term we should get set up on the Julia GPU CI infrastructure to test this kind of thing.

@JaydevSR
Copy link
Contributor Author

JaydevSR commented Sep 6, 2023

Yes the problem seems to lie with CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK), I am not using this anyway so I have commented it out.

The GPU tests also pass for me locally.

@jgreener64
Copy link
Collaborator

pairwise_force_kernel! needs to be renamed to pairwise_force_kernel_nl! on line 313 of src/chain_rules.jl, GPU tests fail otherwise.

I am also seeing an occasional failure on the Monte Carlo anisotropic barostat GPU tests but I'm not sure it is due to this change, do you see that?

so I have commented it out

May as well remove these lines if they are not needed.

@ejmeitz
Copy link
Collaborator

ejmeitz commented Sep 6, 2023

The barostat tests fail intermittently on my local machine for CPU. Im not sure that is related to changes made in this PR.

@jgreener64 jgreener64 merged commit ebb3fa2 into JuliaMolSim:master Sep 6, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants