You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know it sounds like an absurd regime for such a brute force approach, but it turns out that this kernel remains a
competitive option in some situations depending on the average number of neighbors per particle, for instance.
I believe there should be at least a TORCH_CHECK around.
Switching to int64_t virtually eliminates this problem, but there is a big performance penalty and the maximum number of blocks a CUDA kernel can take becomes a problem soon after.
OTOH, the way the row index is computed results in incorrect results due to floating point error at ~100k particles, even when switching to int64 and double.
Which does not take too much time to implement and does not suffer from the aforementioned issues, since it launches an O(N) number of threads. Maybe it would be worth to have the two options and decide on the algorithm at runtime depending on some heuristic?
const Tensor indices = arange(0, num_pairs, positions.options().dtype(kInt32));
Tries to allocate 8GB of memory when asked for 128k particles.
This can be fixed by using something like fancy iterators, however, I wonder if there is a reason the CPU implementation is written using torch operations exclusively.
The text was updated successfully, but these errors were encountered:
This is one of many reasons you really don't want to use a thread per interaction. In torchmd/torchmd-net#61 (comment) I recommended using a smaller number of threads and having each thread loop over multiple interactions, but that didn't get implemented.
The index here overflows with 128K particles (a thread is launched per possible pair, N*(N-1)/2)
NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu
Lines 43 to 46 in 054d487
I know it sounds like an absurd regime for such a brute force approach, but it turns out that this kernel remains a
competitive option in some situations depending on the average number of neighbors per particle, for instance.
I believe there should be at least a TORCH_CHECK around.
Switching to int64_t virtually eliminates this problem, but there is a big performance penalty and the maximum number of blocks a CUDA kernel can take becomes a problem soon after.
OTOH, the way the row index is computed results in incorrect results due to floating point error at ~100k particles, even when switching to int64 and double.
I suspect this kernel will be defeated by this other approach at some point:
https://developer.nvidia.com/gpugems/gpugems3/part-v-physics-simulation/chapter-31-fast-n-body-simulation-cuda
Which does not take too much time to implement and does not suffer from the aforementioned issues, since it launches an O(N) number of threads. Maybe it would be worth to have the two options and decide on the algorithm at runtime depending on some heuristic?
Additionally, this line:
NNPOps/src/pytorch/neighbors/getNeighborPairsCPU.cpp
Line 58 in 054d487
Tries to allocate 8GB of memory when asked for 128k particles.
This can be fixed by using something like fancy iterators, however, I wonder if there is a reason the CPU implementation is written using torch operations exclusively.
The text was updated successfully, but these errors were encountered: