Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched NNs for TorchANI #13

Merged
merged 24 commits into from
Oct 1, 2021
Merged

Batched NNs for TorchANI #13

merged 24 commits into from
Oct 1, 2021

Conversation

raimis
Copy link
Contributor

@raimis raimis commented Oct 15, 2020

A proof of concept to speed up the inference of ANI-like model by using batched matrix operations.

  • Implement TorchANIBatchedNN
  • Benchmarks
  • Tests
  • Documentation

See #11 for details

@raimis
Copy link
Contributor Author

raimis commented Oct 15, 2020

Benchmarks

Molecule: 46 atoms (pytorch/molecules/2iuz_ligand.mol2)
GPU: GTX 1080 Ti
Script: nn/BenchmarkBatchedNN.py

  • ANI-2x (forward pass): 26 ms
  • ANI-2x (forward & backward pass): 93 ms
  • ANI-2x + BatchedNN (forward pass): 5.0 ms
  • ANI-2x + BatchedNN (forward & backward pass): 15 ms

@raimis
Copy link
Contributor Author

raimis commented Oct 15, 2020

Additionally, the speed when using the faster symmetry functions (#5)

ANI-2x + #5 + BatchedNN (forward pass): 2.5 ms
ANI-2x + #5 + BatchedNN (forward & backward pass): 11 ms

@raimis
Copy link
Contributor Author

raimis commented Oct 15, 2020

Profiling of ANI-2x + #5 + BatchedNN (forward pass):
Screenshot from 2020-10-15 15-32-53

@raimis
Copy link
Contributor Author

raimis commented Oct 15, 2020

  • In the profiler, it runs ~5.0 ms instead of 2.5 ms. So GPU are working more than it looks.
  • Memory allocation for ANISymmetryFunction takes ~0.5 ms. We should get rid of this inefficiency.
  • ANISymmetryFunction kernels take 0.2 ms.
  • NN kernels run 1.5 ms. For small molecules this will be a bottleneck eventually. It should be worth to look, if the auto-tuning TensorRT kernel can perform better.
  • At the EnergyShifter runs a lot of small kernels, which probably can be optimized too.

@peastman
Copy link
Member

Very nice! And it looks from the profile like there's still room for more improvement.

@raimis
Copy link
Contributor Author

raimis commented Oct 16, 2020

I have managed to speed up more the backward pass:

  • ANI-2x + BatchedNN (forward pass): 5.2 ms
  • ANI-2x + BatchedNN (forward & backward passes): 9.2 ms
  • ANI-2x + PyTorch wrapper #5 + BatchedNN (forward pass): 2.5 ms
  • ANI-2x + PyTorch wrapper #5 + BatchedNN (forward & backward passes): 5.1 ms

@raimis
Copy link
Contributor Author

raimis commented Oct 29, 2020

For some inexplicable reason, the backward pass performance drops, when converting to TorchScript:

nnp_ts = torch.jit.script(nnp)

@raimis
Copy link
Contributor Author

raimis commented Oct 29, 2020

With the tracing, it is slow too:

nnp_tr = torch.jit.trace(nnp, ((species, positions),))

@raimis raimis mentioned this pull request Oct 29, 2020
1 task
@raimis
Copy link
Contributor Author

raimis commented Nov 4, 2020

I have solved the performance problem regarding TorchScript.

@yueyericardo
Copy link
Contributor

Hi Raimondas,

I found that GPU memory usage increase linearly as molecule size when initialize BatchedNN.
Not sure whether it will be more expensive? Let me know if I did it wrong.

File: 1hz5.pdb, Molecule size: 973

species size: 100
  GPU Memory Used (nvidia-smi):  2242.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 200
  GPU Memory Used (nvidia-smi):  3820.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 300
  GPU Memory Used (nvidia-smi):  6032.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 400
  GPU Memory Used (nvidia-smi):  7604.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 500
  GPU Memory Used (nvidia-smi): 11542.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 600
  GPU Memory Used (nvidia-smi): 13118.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 700
  GPU Memory Used (nvidia-smi): 14694.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 800
  GPU Memory Used (nvidia-smi):  9478.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

species size: 900
  GPU Memory Used (nvidia-smi): 10530.9MB / 16130.5MB (Tesla V100-PCIE-16GB)
----------------------------------------------------------------------

script to run it

import os
import gc
import torch
import torchani
import pynvml
import numpy as np
from ase.io import read
from NNPOps.BatchedNN import TorchANIBatchedNN


def checkgpu(device=None):
    i = device if device else torch.cuda.current_device()
    real_i = int(os.environ['CUDA_VISIBLE_DEVICES'][0]) if 'CUDA_VISIBLE_DEVICES' in os.environ else i
    pynvml.nvmlInit()
    h = pynvml.nvmlDeviceGetHandleByIndex(real_i)
    info = pynvml.nvmlDeviceGetMemoryInfo(h)
    name = pynvml.nvmlDeviceGetName(h)
    print('  GPU Memory Used (nvidia-smi): {:7.1f}MB / {:.1f}MB ({})'.format(info.used / 1024 / 1024, info.total / 1024 / 1024, name.decode()))


file = '1hz5.pdb'
mol = read(file)
device = torch.device('cuda')
species_ = torch.tensor([mol.get_atomic_numbers()], device=device)
positions = torch.tensor([mol.get_positions()], dtype=torch.float32, requires_grad=False, device=device)
print(f'File: {file}, Molecule size: {species_.shape[-1]}\n')

for N in np.arange(100, 1000, 100):
    torch.cuda.empty_cache()
    gc.collect()
    species = species_[:, :N]
    print(f"species size: {species.shape[1]}")
    nnp = torchani.models.ANI2x(periodic_table_index=True, model_index=None).to(device)
    nnp.neural_networks = TorchANIBatchedNN(nnp.species_converter, nnp.neural_networks, species).to(device)
    checkgpu()
    print('-' * 70 + '\n')

pdb file at : https://raw.githubusercontent.com/yueyericardo/aev_benchmark/master/molecules/1hz5.pdb

@peastman
Copy link
Member

peastman commented Dec 3, 2020

That shows a large decrease in memory between 700 and 800? This may not really be measuring what you think it is.

Try modifying the script so N is passed as a command line argument, then run the script repeatedly for different values. That will remove uncertainty about when memory gets released.

@yueyericardo
Copy link
Contributor

I know the memory might not be accurate.
But a directly run of species size of 900 is giving the similar result:

File: 1hz5.pdb, Molecule size: 973

species size: 900
  GPU Memory Used (nvidia-smi): 10530.9MB / 16130.5MB (Tesla V100-PCIE-16GB)

@raimis
Copy link
Contributor Author

raimis commented Dec 4, 2020

@yueyericardo

The NN parameters are replicated for each atom to optimize memory layout for the batched multiplication. So, the memory increases linearly with a system size.

Effectively, this is trading memory for speed. For small molecules (~100 atoms), GPUs have more than enough of memory. For large molecules, we may need a different algorithm.

@peastman
Copy link
Member

peastman commented Dec 4, 2020

It seems like there ought to be a way to avoid having to duplicate them. We just need to figure out how to get PyTorch to sum/broadcast correctly. Only having one copy of the parameters should help caching efficiency, which would improve performance.

@raimis
Copy link
Contributor Author

raimis commented Dec 7, 2020

For small molecules, as far as I tested, it was the fastest algorithm, even though it wastes GPU memory and bandwidth, but it just one kernel launch.

For larger molecules, I guess if atoms are sorted by elements, the multiplication can be carried out for each element separately. This won't replicate the parameters, but would require more kernel launches.

@jchodera
Copy link
Member

@raimis : Can this be updated and merged?

@raimis raimis self-assigned this Sep 30, 2021
@raimis raimis marked this pull request as ready for review September 30, 2021 15:40
@raimis raimis requested a review from peastman September 30, 2021 15:40
@raimis
Copy link
Contributor Author

raimis commented Sep 30, 2021

@peastman this is ready for a review!

Copy link
Member

@peastman peastman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@raimis raimis merged commit 44e4282 into openmm:master Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants