Non-atomic pairwise force summation kernels. #133

JaydevSR · 2023-06-16T08:18:51Z

Benchmark results before any changes

Job Properties

Time of benchmark: 7 Jun 2023 - 19:30
Package commit: cc575d
Julia commit: 8e6305
Julia command flags: None
Environment variables: JULIA_NUM_THREADS => 16

Results

Below is a table of this job's results, obtained by running the benchmarks.
The values listed in the ID column have the structure [parent_group, child_group, ..., key], and can be used to
index into the BaseBenchmarks suite to retrieve the corresponding benchmarks.
The percentages accompanying time and memory values in the below table are noise tolerances. The "true"
time/memory value for a given benchmark is expected to fall within this percentage of the reported value.
An empty cell means that the value was zero.

ID	time	GC time	memory	allocations
`["interactions", "Coulomb energy"]`	8.585 ns (5%)
`["interactions", "Coulomb force"]`	10.210 ns (5%)
`["interactions", "HarmonicBond energy"]`	13.589 ns (5%)
`["interactions", "HarmonicBond force"]`	20.126 ns (5%)
`["interactions", "LennardJones energy"]`	14.764 ns (5%)
`["interactions", "LennardJones force"]`	16.534 ns (5%)
`["protein", "CPU parallel NL"]`	2.109 s (5%)	26.741 ms	504.39 MiB (1%)	7430
`["simulation", "CPU NL"]`	136.302 ms (5%)		34.51 MiB (1%)	11749
`["simulation", "CPU f32 NL"]`	131.337 ms (5%)		22.23 MiB (1%)	11748
`["simulation", "CPU f32"]`	723.609 ms (5%)		18.42 MiB (1%)	11057
`["simulation", "CPU parallel NL"]`	148.578 ms (5%)		99.92 MiB (1%)	41661
`["simulation", "CPU parallel f32 NL"]`	87.299 ms (5%)		59.08 MiB (1%)	41654
`["simulation", "CPU parallel f32"]`	150.689 ms (5%)		49.74 MiB (1%)	36257
`["simulation", "CPU parallel"]`	171.384 ms (5%)		90.58 MiB (1%)	36267
`["simulation", "CPU"]`	655.085 ms (5%)		30.69 MiB (1%)	11058
`["simulation", "GPU NL"]`	93.107 ms (5%)		13.66 MiB (1%)	86877
`["simulation", "GPU f32 NL"]`	87.481 ms (5%)		10.43 MiB (1%)	81447
`["simulation", "GPU f32"]`	90.161 ms (5%)		9.78 MiB (1%)	70666
`["simulation", "GPU"]`	93.585 ms (5%)		12.99 MiB (1%)	76096
`["spatial", "vector"]`	11.215 ns (5%)
`["spatial", "vector_1D"]`	8.266 ns (5%)

Benchmark Group List

Here's a list of all the benchmark groups executed by this job:

["interactions"]
["protein"]
["simulation"]
["spatial"]

Julia versioninfo

Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 20.04.6 LTS
  uname: Linux 5.4.0-146-generic #163-Ubuntu SMP Fri Mar 17 18:26:02 UTC 2023 x86_64 x86_64
  CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz: 
                 speed         user         nice          sys         idle          irq
       #1-12  1200 MHz   41107039 s     172497 s    5408181 s  518383963 s          0 s
  Memory: 377.7989311218262 GB (363221.26953125 MB free)
  Uptime: 4.72816548e6 sec
  Load Avg:  4.25  3.26  2.61
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, broadwell)
  Threads: 16 on 12 virtual cores

JaydevSR · 2023-06-16T08:40:29Z

I have presently changed the implementation such that it utilizes two kernels: One for neighbor list and one without it.

For the kernel without neighbor list, I was initially working on something that would tile the N^2 iterations into warp sized tiles and then diagonally evaluate the tiles to avoid access conflicts (how OpenMM does it) and also to allow data reuse, the atom data could be shuffled between the threads in the warp.
- Problem 1 (easy): shuffles are not supported for how atom and coords are stored currently, the shuffled data has to be floating point numbers or array. This would require processing the input data before passing to kernel to allow this.
- Problem 2 (unsure): I am facing some issues most probably with synchronization in the kernel which is giving wrong results for forces. I still have to figure this out.
- As of now I have written a simple kernel for this where each thread sums the forces for one atom using a for loop. This not performant about 1.5 times slower than the current atomic implementation (although I have only tried this on small systems where there should not be many conflicts during atomic access to begin with). I think this can be improved by doing a binary reduction over rows using multiple threads.
For the kernel without neighbor list, I see two possible ways:
- Use an adjacency bit matrix, but the problem with this is because of sparsity many threads in a block will be idle if you simply broadcast and I am not sure how can we manage this. So mostly I won't go ahead with this.
- Use a sparse adjacency matrix in CSC/CSR format. For this I have an I idea similar to the non-neighbor list implementation where each row can be treated as separate block and then we can do a combination of loop and binary reduction in the blocks. If the threads per block are less than number of non-zero inters in the row then it can be handled with a summation loop.
- Then main issue lies with the construction of such sparse matrix on the GPU. As of now I have no idea :/ I have implemented such sparse matrix from scratch which will go down the bin later as the best way will be to use something like https://github.com/j-fu/ExtendableSparse.jl instead?

codecov · 2023-06-16T08:41:54Z

Codecov Report

Patch coverage: 7.31% and project coverage change: -0.87% ⚠️

Comparison is base (17f20ed) 73.12% compared to head (841d291) 72.25%.

❗ Current head 841d291 differs from pull request most recent head b349ac1. Consider uploading reports for the commit b349ac1 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #133      +/-   ##
==========================================
- Coverage   73.12%   72.25%   -0.87%     
==========================================
  Files          35       35              
  Lines        5142     5212      +70     
==========================================
+ Hits         3760     3766       +6     
- Misses       1382     1446      +64

Files Changed	Coverage Δ
src/types.jl	`72.45% <ø> (ø)`
src/cuda.jl	`1.39% <5.00%> (+1.39%)`	⬆️
src/setup.jl	`92.27% <100.00%> (+0.01%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jgreener64 · 2023-06-16T13:11:58Z

Looks like a good start. With regards to benchmarking I would use a system of 1000-5000 atoms and use @benchmark on the force function for now.

I am facing some issues most probably with synchronization in the kernel which is giving wrong results for forces.

This stuff can be hard to debug. I would use @cushow and @cuprintln on a small system to print out relevant data and check it is correct and gets reduced correctly. Also, you could try a sync_threads() after setting up and zeroing the shared memory, or before the final reduction.

It might be worth asking on the #gpu Slack channel about sparse matrices. CUDA has sparse matrices but I guess the issue is that these don't work in kernels?

I would also be interested to see the speed of using the OpenMM atom block strategy even with using the dense adjacency matrix. Another idea is to use a dense adjacency matrix, but create a sparse data structure with the first thread inside the kernel itself. You could have two shared memory int vectors for the interacting indices. Threads would then loop over those pairs.

ejmeitz · 2023-06-17T02:14:08Z

Not the best CUDA programmer, but I thought I'd throw this out there. I'm sure I'm missing some intricacy as I have literally spent 0 minutes thinking about force kernels on a GPU but the CUDA part of Molly is something I'd like to understand more. What are downsides of using a kernel like this besides you are not manually choosing the block sizes etc? too many kernel invocatons?

function kernel(atom_idx, interacting_atoms)
      return mapreduce(j  -> force(atom_idx, j), +, interacting_atoms)
end

JaydevSR · 2023-06-17T12:41:07Z

@ejmeitz Right now the kernel is essentially doing what you have pointed out. Maybe I will try and see how the map reduce compares to the current implementation. But I think the problem with just using this is map reduce is that it is well suited to do reductions when the reduced over indices are mostly independent and no extra optimizations can be made anyway.

But in our case, we are dealing with N^2 interactions where we can certainly benefit from data reuse and operation ordering. This can happen only if the kernel is doing things at a warp level so that we can shuffle things around while smartly going through the reduction and force calculation at once. I am working on this right now and trying to straighten a few bugs. I also have little experience with CUDA apart from examples for now, as far as I understand this is the main motivation to begin with this barebones implementation, I am not sure how much these optimizations will help but it will be surely worth something 😄 (unless neighbor data is way sparse that the extra overhead of loops takes over but that can be dealt with as another case?)

JaydevSR · 2023-06-20T14:20:40Z

Bechmark results on for the forces function (Non-tiled). The implementation loops over all the j values for a particular i using block stride and the force summation is made over shared memory using a binary reduction. Each block calculates one row in the interaction pairs, hence there are n_atoms blocks.

Device: Tesla V100-PCIE-16GB

Simulation	Atoms	Total Pairs	Min (ms)	Median (ms)	Mean (ms)	Mean per 10000 pairs (μs)
NONL	3200	5118400	2.450	2.577	2.576	5.03
NONL f32	3200	5118400	0.854	1.098	1.093	2.13
NL (atomic)	3200	335323	0.215	0.225	0.226	6.74
NL f32 (atomic)	3200	334149	0.193	0.210	0.211	6.31

ejmeitz · 2023-06-20T14:41:03Z

Btw I have access to a cluster of A40s and another of 2080s of you wanna test on distributed systems or just on a different single GPU.

Tiled Kernel

Remove inner loop over tiles from warp

JaydevSR · 2023-06-22T14:37:05Z

Benchmark results on the `forces` function using a tiled kernel without a neighbor list

Approach 0 - Atomic addition on Molly.jl/main
Approach 1 - The tiling is done over one whole block and each warp loops over a complete row of tiles horizontally. Each block computes a section of the pairs that is of size (n_threads, n_threads).
Approach 2 - The tiling is done such that each warp only computes one tile. Each block computes of a section of the pairs that is of size (WARPSIZE, n_threads).

Simulation	Atoms	Number of Pairs	Min time (ms)	Median time (ms)	Mean time (ms)	Mean time per 10000 pairs (μs)
Approach 0	3072	4717056	5.193	5.331	5.339	11.31
Approach 0 f32	3072	4717056	5.306	5.466	5.484	11.63
Approach 1	3072	4717056	1.402	1.572	1.568	3.32
Approach 1 f32	3072	4717056	0.870	1.065	1.061	2.24
Approach 2	3072	4717056	2.015	2.213	2.220	4.71
Approach 2 f32	3072	4717056	0.502	0.512	0.513	1.08

Takeaway: getting rid of the internal loop gives 2x improvement for f32 but we have to pay the price with worse f64 performance. I think this can be because approach 2 increases the number of blocks by a factor of n_threads / WARPSIZE which results in drop in performance to dispatch the f64 calculations to the SMs in the GPU? If this is the case then this can be improved if we want to by each warp calculating 2 or 4 tiles at a time (without a loop just multiple lines of same code).

jgreener64 · 2023-06-22T14:52:03Z

Cool, I would focus on f32 performance since force calculation is usually safe to run in f32. I know other software has severe slowdown with f64, it might be unavoidable.

Am I right in interpreting that the f32 median time before this PR was 1.098 ms for no NL / 0.210 ms for NL and now is 0.512 ms for no NL? That's getting somewhere.

I would aim to move to the NL version soon, particularly if there are things you have learned with the no NL case that can be used there. Can approach 2 above be applied in a performant way to the case where interactions are skipped if they are not in a dense neighbour matrix?

JaydevSR · 2023-06-22T14:59:20Z

In the current state the kernel is quite minimal and the only major optimization opportunity that I can see is to shuffle atom and coordinate data as well which will certainly require some preprocessing that I haven't thought of yet how to integrate with the whole interface.

The NONL version gives quite a motivation for similar implementation for NL as well if we can think of a way to construct some dense tiles out of the sparse adjacency matrix? I am not quite sure about this. But this will work quite naturally if we are dealing with cell lists!

@jgreener64 No the 1.098 ms is the median time for the kernel in which we simply have a for loop over all j's for a particular i (using block stride and then a binary reduction over shared memory). But for NL i haven't touched anything so yes that the same.

Looking at the benchmarks of the atomic version (before the pr) that I have included just now you can see this kernel is 10x faster that that. If we can achieve similar performance with a NL it would be certainly something to look for.

EDIT: Also the current kernel only works if the number of atoms is a multiple of of n_threads and n_threads is a multiple of WARPSIZE this removes the need for if conditions for i and j but the general case would introduce quite some complexity which would certainly impact performance. What should be done in this case?

jgreener64 · 2023-06-22T15:28:27Z

this kernel is 10x faster that that

Great, even better.

Yes I think the NL kernel would be based on this new no NL one, the current NL kernel is probably too simple to speed up a lot.

How easy would it be to have the above approach 2 but have another input neighbors, the dense adjacency list, and just have a if neighbors[i, j] switch before the force calculation? Does that have any hope of being fast? Might be worth benchmarking if it isn't too hard. You might get use out of @inbounds around indexing expressions too.

What should be done in this case?

I would try adding the i and j conditions in the most simple way you can and benchmark performance. It may not be that bad? The alternative is to pad the inputs, but that adds its own complexity.

JaydevSR · 2023-06-22T15:39:51Z

Okay I will try both the things and benchmark them 👍

JaydevSR · 2023-07-24T13:46:11Z

Yes, the current benchmarks are for just the forces function which calls the kernel after doing some things. I am using this script for the benchmarks:

using Molly
using BenchmarkTools
using CUDA

function setup_sim(nl::Bool, f32::Bool)
    local n_atoms = 3219
    atom_mass = f32 ? 10.0f0u"u" : 10.0u"u"
    boundary = f32 ? CubicBoundary(6.0f0u"nm") : CubicBoundary(6.0u"nm")
    starting_coords = place_atoms(n_atoms, boundary; min_dist=0.2u"nm")
    starting_velocities = [random_velocity(atom_mass, 1.0u"K") for i in 1:n_atoms]
    starting_coords_f32 = [Float32.(c) for c in starting_coords]
    starting_velocities_f32 = [Float32.(c) for c in starting_velocities]

    simulator = VelocityVerlet(dt=f32 ? 0.02f0u"ps" : 0.02u"ps")

    neighbor_finder = NoNeighborFinder()
    cutoff = DistanceCutoff(f32 ? 1.0f0u"nm" : 1.0u"nm")
    pairwise_inters = (LennardJones(use_neighbors=false, cutoff=cutoff),)
    if nl
        neighbor_finder = DistanceNeighborFinder(
            eligible=CuArray(trues(n_atoms, n_atoms)),
            n_steps=10,
            dist_cutoff=f32 ? 1.5f0u"nm" : 1.5u"nm",
        )
        pairwise_inters = (LennardJones(use_neighbors=true, cutoff=cutoff),)
    end

    coords = CuArray(f32 ? starting_coords_f32 : starting_coords)
    velocities = CuArray(f32 ? starting_velocities_f32 : starting_velocities)
    atoms = CuArray([Atom(charge=f32 ? 0.0f0 : 0.0, mass=atom_mass, σ=f32 ? 0.2f0u"nm" : 0.2u"nm",
                            ϵ=f32 ? 0.2f0u"kJ * mol^-1" : 0.2u"kJ * mol^-1") for i in 1:n_atoms])

    sys = System(
        atoms=atoms,
        coords=coords,
        boundary=boundary,
        velocities=velocities,
        pairwise_inters=pairwise_inters,
        neighbor_finder=neighbor_finder,
    )

    return sys, simulator
end

runs = [
    ("GPU NONL"       , [false, false]),
    ("GPU NONL f32"   , [false,  true]),
    ("GPU NL"    , [true , false]),
    ("GPU NL f32", [true ,  true]),
]

for (name, args) in runs
    println("*************** Run: $name ************************")
    sys, sim = setup_sim(args...)
    n_atoms = length(sys)
    neighbors = find_neighbors(sys)
    nbrs = isnothing(neighbors) ? n_atoms * (n_atoms - 1) ÷ 2 : length(neighbors)
    println("> Total Pairs = $nbrs")
    f = forces(sys, neighbors)
    b = @benchmark CUDA.@sync forces($sys, $neighbors)
    display(b)
end

I will also try and setup Nsight compute but if you can take a quick look at the profiles as well for the neighbor list kernel that would also be great.

ejmeitz · 2023-07-24T14:38:53Z

Here's the GPU NL F32 output. A lot more kernels were generated than I expected. The first 3 were generated by sim_setup, find_neighbors launched the next like 100 (not sure why so many), and the last 3 kernels were generated by forces (starts with "Z25pairwise_force_kernel_nl_"). I did run all of the kernels but the output for this one is already a lot to go through lol. If you want them I can send that, but in general they look to have similar issues to this kernel.

If you unzip the file below and double click on one of the kernels it will dump all sorts of information and things that could be optimized. I looked quickly at the pairwise_force_kernel and its only using maybe 30% of my GPU's resources. Looks like there's a lot of un-coalesced accesses and warp divergence.

Molly_kernel.txt.zip

JaydevSR · 2023-07-24T15:11:31Z

Thanks a lot @ejmeitz. I will go through these and try to see if fixing these problems can improve things.

… kernel-v2

JaydevSR · 2023-09-01T14:26:11Z

So... Out of all the different approaches for the NL kernel everything has almost the same performance even the atomic approach. I think this suggest that the computation is not making any difference and the performance is almost entirely dependent on the memory read-write. Profiles also show the same thing. I am not really sure how to deal with this memory bottleneck.

Even using the shared memory to store the atoms in between the computation does not deal with this issue. I had a look at the generated PTX code which shows that many intermediate results are stored in local memory instead in the registers and the using shared memory does not get rid of majority of memory stalls. Maybe there is some way to deal with this, but I will have to look more into this.

For now, it would be great if the NONL kernel can be merged at least and I will work on the NL kernel in a separate PR from scratch with my newly gained experience :) As of now according to the benchmarks for different system sizes, the NONL kernel is 20-30 times faster than the atomic approach so that's quite an improvement and any further improvement faces the same issue of memory read-write being so slow that computation times are not holding a candle to that.

jgreener64 · 2023-09-01T14:55:32Z

Sounds good, I can try and review this PR next week. You will need to rebase/merge. The no NL kernel is definitely an improvement. One thing that would be nice to add is a comment by the kernel with a description/diagram of the approach used. I don't know how the kernel will play with Enzyme, but I can look into any problems there myself.

Well done for doggedly pursuing the NL kernel, I know how frustrating it can be. Something to consider is starting from the other direction, completely ablating everything, just loop over the neighbouring pairs and return zero forces. I guess that should be fast? Then see what minimal addition causes the slowdown, make a self-contained example and discuss it here, on Slack or on the CUDA.jl issues.

JaydevSR · 2023-09-01T16:55:22Z

Thanks @jgreener64, I will rebase the incoming branch. Also should I also make similar changes to the pairwise potential energy kernel as well?

jgreener64 · 2023-09-01T17:00:25Z

If it's easy to change the potential energy kernel then do. If it will take up anything more than a short time then focus on the NL kernel and I will implement the potential energy kernel later as a way to learn what is going on.

… kernel-v2

JaydevSR · 2023-09-04T05:04:52Z

I am not sure why the CI runs are failing. Everything works for me locally.

jgreener64 · 2023-09-05T22:50:15Z

This looks good bar the failure when CUDA is not available. That will be due to CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK) or maybe the CUDA.shfl_recurse lines. You could enclose the offending code in if CUDA.functional().

Do the GPU tests pass for you? I can run them locally tomorrow too. Long term we should get set up on the Julia GPU CI infrastructure to test this kind of thing.

JaydevSR · 2023-09-06T11:41:18Z

Yes the problem seems to lie with CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK), I am not using this anyway so I have commented it out.

The GPU tests also pass for me locally.

jgreener64 · 2023-09-06T14:45:17Z

pairwise_force_kernel! needs to be renamed to pairwise_force_kernel_nl! on line 313 of src/chain_rules.jl, GPU tests fail otherwise.

I am also seeing an occasional failure on the Monte Carlo anisotropic barostat GPU tests but I'm not sure it is due to this change, do you see that?

so I have commented it out

May as well remove these lines if they are not needed.

ejmeitz · 2023-09-06T14:57:18Z

The barostat tests fail intermittently on my local machine for CPU. Im not sure that is related to changes made in this PR.

JaydevSR added 4 commits June 15, 2023 09:43

pairwise forces: no neighbor list kernel

414e9b0

change j iteration loop

8a68b20

sparse adjacency matrix

f6df0f7

Merge branch 'JuliaMolSim:master' into kernel-v2

3bcf82c

JaydevSR changed the title ~~Non-atomic force summation kernels~~ Non-atomic pairwise force summation kernels. Jun 16, 2023

JaydevSR added 2 commits June 16, 2023 08:54

nonl kernel: block stride and tree reduction

79ec7d3

32 bit index calcuation

88afacb

tiled kernel nonl

fda21db

This comment was marked as outdated.

Sign in to view

JaydevSR added 5 commits June 20, 2023 11:41

use for loop over tiles

93bb84b

Merge pull request #4 from JaydevSR/kernel-tiled

a533e65

Tiled Kernel

macro to shuffle multiple vars

5e33baf

remove loop over tiles in a warp

2b67eb9

Merge pull request #5 from JaydevSR/kernel-tiled

42bba1c

Remove inner loop over tiles from warp

JaydevSR added 4 commits June 24, 2023 09:09

shuffle coordinates (-16 μs ^_^')

90900cd

fix warping of j and minor changes

085c5d6

general n_atoms casewise

5bb9fa1

ensure 32 threads per tile

932a307

JaydevSR added 5 commits August 8, 2023 13:19

change pairwise block size

a79fb93

Merge branch 'JuliaMolSim:master' into kernel-v2

1ad93be

Merge branch 'kernel-v2' of https://github.com/JaydevSR/Molly.jl into…

0af4646

… kernel-v2

launch configuration in nonl, minor edits

8f4e700

use shmem in nl

9937408

JaydevSR added 7 commits September 3, 2023 05:31

revert neighbor list kernel changes

da59436

Merge branch 'kernel-v2' of https://github.com/JaydevSR/Molly.jl into…

d65724a

… kernel-v2

Merge branch 'master' into kernel-v2

05c9233

fix revert

98f5ec9

more revert fixes

7ac1d2e

add explaination for nonl kernel

182980a

minor fix in nl kernel

510fe90

JaydevSR marked this pull request as ready for review September 3, 2023 16:57

fix CUDA attribute error

841d291

edit chain_rules.jl, cuda.jl

b349ac1

jgreener64 merged commit ebb3fa2 into JuliaMolSim:master Sep 6, 2023
5 checks passed

This was referenced Sep 7, 2023

KernelAbstractions support #147

Draft

exposing warp-level semantics JuliaGPU/KernelAbstractions.jl#420

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-atomic pairwise force summation kernels. #133

Non-atomic pairwise force summation kernels. #133

JaydevSR commented Jun 16, 2023

JaydevSR commented Jun 16, 2023 •

edited

Loading

codecov bot commented Jun 16, 2023 •

edited

Loading

jgreener64 commented Jun 16, 2023

ejmeitz commented Jun 17, 2023 •

edited

Loading

JaydevSR commented Jun 17, 2023 •

edited

Loading

JaydevSR commented Jun 20, 2023 •

edited

Loading

ejmeitz commented Jun 20, 2023

This comment was marked as outdated.

JaydevSR commented Jun 22, 2023 •

edited

Loading

jgreener64 commented Jun 22, 2023

JaydevSR commented Jun 22, 2023 •

edited

Loading

jgreener64 commented Jun 22, 2023

JaydevSR commented Jun 22, 2023

JaydevSR commented Jul 24, 2023

ejmeitz commented Jul 24, 2023 •

edited

Loading

JaydevSR commented Jul 24, 2023

JaydevSR commented Sep 1, 2023

jgreener64 commented Sep 1, 2023

JaydevSR commented Sep 1, 2023 •

edited

Loading

jgreener64 commented Sep 1, 2023

JaydevSR commented Sep 4, 2023

jgreener64 commented Sep 5, 2023

JaydevSR commented Sep 6, 2023

jgreener64 commented Sep 6, 2023

ejmeitz commented Sep 6, 2023

Non-atomic pairwise force summation kernels. #133

Non-atomic pairwise force summation kernels. #133

Conversation

JaydevSR commented Jun 16, 2023

Benchmark results before any changes

Job Properties

Results

Benchmark Group List

Julia versioninfo

JaydevSR commented Jun 16, 2023 • edited Loading

codecov bot commented Jun 16, 2023 • edited Loading

Codecov Report

jgreener64 commented Jun 16, 2023

ejmeitz commented Jun 17, 2023 • edited Loading

JaydevSR commented Jun 17, 2023 • edited Loading

JaydevSR commented Jun 20, 2023 • edited Loading

ejmeitz commented Jun 20, 2023

This comment was marked as outdated.

JaydevSR commented Jun 22, 2023 • edited Loading

Benchmark results on the forces function using a tiled kernel without a neighbor list

jgreener64 commented Jun 22, 2023

JaydevSR commented Jun 22, 2023 • edited Loading

jgreener64 commented Jun 22, 2023

JaydevSR commented Jun 22, 2023

JaydevSR commented Jul 24, 2023

ejmeitz commented Jul 24, 2023 • edited Loading

JaydevSR commented Jul 24, 2023

JaydevSR commented Sep 1, 2023

jgreener64 commented Sep 1, 2023

JaydevSR commented Sep 1, 2023 • edited Loading

jgreener64 commented Sep 1, 2023

JaydevSR commented Sep 4, 2023

jgreener64 commented Sep 5, 2023

JaydevSR commented Sep 6, 2023

jgreener64 commented Sep 6, 2023

ejmeitz commented Sep 6, 2023

JaydevSR commented Jun 16, 2023 •

edited

Loading

codecov bot commented Jun 16, 2023 •

edited

Loading

ejmeitz commented Jun 17, 2023 •

edited

Loading

JaydevSR commented Jun 17, 2023 •

edited

Loading

JaydevSR commented Jun 20, 2023 •

edited

Loading

JaydevSR commented Jun 22, 2023 •

edited

Loading

Benchmark results on the `forces` function using a tiled kernel without a neighbor list

JaydevSR commented Jun 22, 2023 •

edited

Loading

ejmeitz commented Jul 24, 2023 •

edited

Loading

JaydevSR commented Sep 1, 2023 •

edited

Loading