Batched NNs for TorchANI #13

raimis · 2020-10-15T11:48:42Z

A proof of concept to speed up the inference of ANI-like model by using batched matrix operations.

Implement TorchANIBatchedNN
Benchmarks
Tests
Documentation

See #11 for details

raimis · 2020-10-15T12:37:24Z

Benchmarks

Molecule: 46 atoms (pytorch/molecules/2iuz_ligand.mol2)
GPU: GTX 1080 Ti
Script: nn/BenchmarkBatchedNN.py

ANI-2x (forward pass): 26 ms
ANI-2x (forward & backward pass): 93 ms
ANI-2x + BatchedNN (forward pass): 5.0 ms
ANI-2x + BatchedNN (forward & backward pass): 15 ms

raimis · 2020-10-15T12:41:24Z

Additionally, the speed when using the faster symmetry functions (#5)

ANI-2x + #5 + BatchedNN (forward pass): 2.5 ms
ANI-2x + #5 + BatchedNN (forward & backward pass): 11 ms

raimis · 2020-10-15T13:37:20Z

Profiling of ANI-2x + #5 + BatchedNN (forward pass):

raimis · 2020-10-15T13:50:42Z

In the profiler, it runs ~5.0 ms instead of 2.5 ms. So GPU are working more than it looks.
Memory allocation for ANISymmetryFunction takes ~0.5 ms. We should get rid of this inefficiency.
ANISymmetryFunction kernels take 0.2 ms.
NN kernels run 1.5 ms. For small molecules this will be a bottleneck eventually. It should be worth to look, if the auto-tuning TensorRT kernel can perform better.
At the EnergyShifter runs a lot of small kernels, which probably can be optimized too.

peastman · 2020-10-15T22:10:53Z

Very nice! And it looks from the profile like there's still room for more improvement.

raimis · 2020-10-16T12:25:43Z

I have managed to speed up more the backward pass:

ANI-2x + BatchedNN (forward pass): 5.2 ms
ANI-2x + BatchedNN (forward & backward passes): 9.2 ms
ANI-2x + PyTorch wrapper #5 + BatchedNN (forward pass): 2.5 ms
ANI-2x + PyTorch wrapper #5 + BatchedNN (forward & backward passes): 5.1 ms

raimis · 2020-10-29T15:15:41Z

For some inexplicable reason, the backward pass performance drops, when converting to TorchScript:

nnp_ts = torch.jit.script(nnp)

ANI-2x + PyTorch wrapper #5 + BatchedNN (forward pass): 2.6 ms
ANI-2x + PyTorch wrapper #5 + BatchedNN (forward & backward passes): 10.2 ms

raimis · 2020-10-29T15:29:23Z

With the tracing, it is slow too:

nnp_tr = torch.jit.trace(nnp, ((species, positions),))

pytorch/TestBatchedNN.py

raimis · 2020-11-04T11:09:18Z

I have solved the performance problem regarding TorchScript.

ANI-2x + PyTorch wrapper #5 + BatchedNN (forward pass): 2.6 ms
ANI-2x + PyTorch wrapper #5 + BatchedNN (forward & backward passes): 5.2 ms

yueyericardo · 2020-12-03T20:55:18Z

Hi Raimondas,

I found that GPU memory usage increase linearly as molecule size when initialize BatchedNN.
Not sure whether it will be more expensive? Let me know if I did it wrong.

File: 1hz5.pdb, Molecule size: 973

species size: 100
  GPU Memory Used (nvidia-smi):  2242.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 200
  GPU Memory Used (nvidia-smi):  3820.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 300
  GPU Memory Used (nvidia-smi):  6032.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 400
  GPU Memory Used (nvidia-smi):  7604.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 500
  GPU Memory Used (nvidia-smi): 11542.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 600
  GPU Memory Used (nvidia-smi): 13118.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 700
  GPU Memory Used (nvidia-smi): 14694.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 800
  GPU Memory Used (nvidia-smi):  9478.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 900
  GPU Memory Used (nvidia-smi): 10530.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

script to run it

import os
import gc
import torch
import torchani
import pynvml
import numpy as np
from ase.io import read
from NNPOps.BatchedNN import TorchANIBatchedNN


def checkgpu(device=None):
    i = device if device else torch.cuda.current_device()
    real_i = int(os.environ['CUDA_VISIBLE_DEVICES'][0]) if 'CUDA_VISIBLE_DEVICES' in os.environ else i
    pynvml.nvmlInit()
    h = pynvml.nvmlDeviceGetHandleByIndex(real_i)
    info = pynvml.nvmlDeviceGetMemoryInfo(h)
    name = pynvml.nvmlDeviceGetName(h)
    print('  GPU Memory Used (nvidia-smi): {:7.1f}MB / {:.1f}MB ({})'.format(info.used / 1024 / 1024, info.total / 1024 / 1024, name.decode()))


file = '1hz5.pdb'
mol = read(file)
device = torch.device('cuda')
species_ = torch.tensor([mol.get_atomic_numbers()], device=device)
positions = torch.tensor([mol.get_positions()], dtype=torch.float32, requires_grad=False, device=device)
print(f'File: {file}, Molecule size: {species_.shape[-1]}\n')

for N in np.arange(100, 1000, 100):
    torch.cuda.empty_cache()
    gc.collect()
    species = species_[:, :N]
    print(f"species size: {species.shape[1]}")
    nnp = torchani.models.ANI2x(periodic_table_index=True, model_index=None).to(device)
    nnp.neural_networks = TorchANIBatchedNN(nnp.species_converter, nnp.neural_networks, species).to(device)
    checkgpu()
    print('-' * 70 + '\n')

pdb file at : https://raw.githubusercontent.com/yueyericardo/aev_benchmark/master/molecules/1hz5.pdb

peastman · 2020-12-03T21:05:33Z

That shows a large decrease in memory between 700 and 800? This may not really be measuring what you think it is.

Try modifying the script so N is passed as a command line argument, then run the script repeatedly for different values. That will remove uncertainty about when memory gets released.

yueyericardo · 2020-12-03T21:10:51Z

I know the memory might not be accurate.
But a directly run of species size of 900 is giving the similar result:

File: 1hz5.pdb, Molecule size: 973

species size: 900
  GPU Memory Used (nvidia-smi): 10530.9MB / 16130.5MB (Tesla V100-PCIE-16GB)

raimis · 2020-12-04T21:30:06Z

@yueyericardo

The NN parameters are replicated for each atom to optimize memory layout for the batched multiplication. So, the memory increases linearly with a system size.

Effectively, this is trading memory for speed. For small molecules (~100 atoms), GPUs have more than enough of memory. For large molecules, we may need a different algorithm.

peastman · 2020-12-04T23:01:11Z

It seems like there ought to be a way to avoid having to duplicate them. We just need to figure out how to get PyTorch to sum/broadcast correctly. Only having one copy of the parameters should help caching efficiency, which would improve performance.

raimis · 2020-12-07T11:26:08Z

For small molecules, as far as I tested, it was the fastest algorithm, even though it wastes GPU memory and bandwidth, but it just one kernel launch.

For larger molecules, I guess if atoms are sorted by elements, the multiplication can be carried out for each element separately. This won't replicate the parameters, but would require more kernel launches.

jchodera · 2021-09-17T16:27:14Z

@raimis : Can this be updated and merged?

raimis · 2021-09-30T15:41:07Z

@peastman this is ready for a review!

peastman

Looks good!

Raimondas Galvelis added 2 commits October 15, 2020 13:44

Implement TorchANIBatchedNNs

d821ddf

Add a benchmark script for TorchANIBatchedNNs

6536eae

Disalbe unnecessary derivatives in TorchANIBachedNNs

97dd1f6

Raimondas Galvelis added 10 commits October 29, 2020 10:53

Merge branch 'master' into batchedNN

63ca4ee

Move the files to pytorch directory

336bfe6

Add test for TorchANIBatchedNN

5fe5973

Add file headers

d7aa121

Make TorchANIBatchedNN to accept atomic number for consistency

5eb324e

Uniform the name of TorchANIBarchedNN

905cba0

Fix a molecule path

4dfbcb3

Install BatchedNN.py and update imports

0095e62

Simplify TorchANIBatchedNN.__init__

299b030

Update BenchmarkBatchedNN.py

1f6e985

raimis mentioned this pull request Oct 29, 2020

Conda package of NNPOps-PyTorch #15

Closed

1 task

yueyericardo reviewed Oct 30, 2020

View reviewed changes

pytorch/TestBatchedNN.py Outdated Show resolved Hide resolved

Raimondas Galvelis added 3 commits November 3, 2020 19:01

Implement BachedLinear

7e0f281

Unify the names of BatchedNN

52b4d75

Clean up BatchedNN

be29ad1

peastman mentioned this pull request Dec 6, 2020

ANI implementations in OpenMM openmm/openmm#2850

Closed

Merge branch 'master' into batchedNN

6bdc48b

raimis mentioned this pull request Sep 23, 2021

Conda-forge package of NNPOps #26

Closed

Raimondas Galvelis added 7 commits September 30, 2021 15:24

Merge branch 'master' into batchedNN

216dc35

Update paths and the build system after merging

fc47e46

Fix the test of BatchedNN

930dee7

Update the documenation

9899211

Move the documentaiton

5bee22c

Update the test tolerances

2a6509b

Update the benchmark script

1b4acae

raimis self-assigned this Sep 30, 2021

raimis marked this pull request as ready for review September 30, 2021 15:40

raimis requested a review from peastman September 30, 2021 15:40

peastman approved these changes Sep 30, 2021

View reviewed changes

raimis merged commit 44e4282 into openmm:master Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched NNs for TorchANI #13

Batched NNs for TorchANI #13

raimis commented Oct 15, 2020 •

edited

Loading

raimis commented Oct 15, 2020

raimis commented Oct 15, 2020

raimis commented Oct 15, 2020

raimis commented Oct 15, 2020 •

edited

Loading

peastman commented Oct 15, 2020

raimis commented Oct 16, 2020

raimis commented Oct 29, 2020 •

edited

Loading

raimis commented Oct 29, 2020

raimis commented Nov 4, 2020

yueyericardo commented Dec 3, 2020

peastman commented Dec 3, 2020

yueyericardo commented Dec 3, 2020

raimis commented Dec 4, 2020

peastman commented Dec 4, 2020

raimis commented Dec 7, 2020

jchodera commented Sep 17, 2021

raimis commented Sep 30, 2021

peastman left a comment

Batched NNs for TorchANI #13

Batched NNs for TorchANI #13

Conversation

raimis commented Oct 15, 2020 • edited Loading

raimis commented Oct 15, 2020

raimis commented Oct 15, 2020

raimis commented Oct 15, 2020

raimis commented Oct 15, 2020 • edited Loading

peastman commented Oct 15, 2020

raimis commented Oct 16, 2020

raimis commented Oct 29, 2020 • edited Loading

raimis commented Oct 29, 2020

raimis commented Nov 4, 2020

yueyericardo commented Dec 3, 2020

peastman commented Dec 3, 2020

yueyericardo commented Dec 3, 2020

raimis commented Dec 4, 2020

peastman commented Dec 4, 2020

raimis commented Dec 7, 2020

jchodera commented Sep 17, 2021

raimis commented Sep 30, 2021

peastman left a comment

Choose a reason for hiding this comment

raimis commented Oct 15, 2020 •

edited

Loading

raimis commented Oct 15, 2020 •

edited

Loading

raimis commented Oct 29, 2020 •

edited

Loading