Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHGNet scorer implementation #8

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kianpu34593
Copy link

Hi,

I was bothered by the ZMQ option for connecting with an external scorer because it is kinda slow and unreliable on the supercomputer cluster. I have implemented a new scorer called CHGNetScorer in _scorer.py. This is designed for the case that 2 GPUs are available in the same node. It also works if the scorer is hosted on CPU. The idea is basically to setup two different devices: CrystaLLM on cuda and CHGNetScorer on cuda:1. Worker CPU is used for the output transfer.

For more details about CHGNet, please see here. From Matbench Discovery, CHGNet is a better ML model than ALIGNN. I'm more familiar with CHGNet than MACE. MACE is an even better option. That being said, I will implement a MACEScorer shortly.

I tested this feature using a LiFeF3 example. The truncated output is attached below:

--> python bin/mcts.py --config=template_6.yaml
Using configuration:
out_dir: crystallm_v1_large
temperature: 1.0
start: 'data_Li6Fe6F18

  '
seed: 1337
device: cuda
dtype: float32
compile: false
tree_width: 5
max_depth: 1000
c: 1.0
num_simulations: 1000
bond_length_acceptability_cutoff: 1.0
reward_k: 2.0
mcts_out_dir: Li6Fe6F18_mcts_cifs
scorer: CHGNet
scorer_host: localhost
scorer_port: 5555
use_context_sensitive_tree_builder: true
top_child_weight_cutoff: 0.99
selector: puct
n_space_groups: 0
bypass_only_child: false
n_rollouts: 1
scorer_device: cuda
chgnet_model_name: 0.3.0

number of parameters: 201.74M
CrystaLLM using: cuda
Pytorch Scorer using: cuda:1
CHGNET model name: 0.3.0
CHGNet v0.3.0 initialized with 412,525 parameters
CHGNet will run on cuda:1
performing 1000 simulations...
performing simulation 1...
/jet/home/jpu/projects/softwares/envs/crystalllm/lib/python3.10/site-packages/pymatgen/analysis/local_env.py:4148: UserWarning: No oxidation states specified on sites! For better results, set the site oxidation states in the structure.
  warnings.warn(
/jet/home/jpu/projects/softwares/envs/crystalllm/lib/python3.10/site-packages/pymatgen/analysis/local_env.py:3941: UserWarning: CrystalNN: cannot locate an appropriate radius, covalent or atomic radii will be used, this can lead to non-optimal results.
  warnings.warn(
invoking external scorer...
sending reply: -5.952406406402588
external scorer returned score: -5.952406406402588
computed reward: 0.5
CIF not written to file as it already exists: /jet/home/jpu/projects/projects/crystal_llm/mcts_example/Li6Fe6F18_mcts_cifs/generated_1.cif
performing simulation 2...

template_6.yaml

out_dir: ../crystallm_v1_large # path to the folder containing the model checkpoint file
temperature: 1.0  # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
start: "data_Li6Fe6F18\n"  # the prompt; can also specify a file, use as: "FILE:prompt.txt"
seed: 1337
device: cuda  # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
dtype: float32  # 'float32' or 'bfloat16' or 'float16'
compile: False  # use PyTorch 2.0 to compile the model to be faster
tree_width: 5  # the tree width
max_depth: 1000  # the maximum depth of the tree
c: 1  # the selector constant: c_puct for PUCT, c for UCT, epsilon for greedy
num_simulations: 1000  # the number of simulations to perform during search
bond_length_acceptability_cutoff: 1.0
reward_k: 2.0  # the reward constant
mcts_out_dir: ../mcts_example/Li6Fe6F18_mcts_cifs  # path to the directory where generated CIF files will be stored
scorer: "CHGNet"  # supported values: 'zmq', 'random', "CHGNet"
scorer_host: localhost   # required if `scorer` is 'zmq'
scorer_port: 5555  # required if `scorer` is 'zmq'
use_context_sensitive_tree_builder: True
top_child_weight_cutoff:  0.99
selector: puct  # valid values: 'puct', 'uct', 'greedy'
n_space_groups: 0
bypass_only_child: False
n_rollouts: 1  # the number of rollouts to perform per simulation
scorer_device: "cuda"
chgnet_model_name: "0.3.0"

Oh, I also updated the code to support the latest Pytorch (v2024.3.1).

Please take a look! Open for discussions!

Best,
Kian

@lantunes
Copy link
Owner

Hi,

Thanks for creating this PR. The time and effort you put into developing the CHGNetScorer is greatly appreciated!

I understand your concerns about the current ZMQ-based approach. However, this project intentionally avoids dependencies on specific scorers like ALIGNN, CHGNet, and others. The core focus of this repository is on the development and enhancement of the CrystaLLM model itself, rather than on facilitating specific integrations with other models. Moreover, we want to maintain simplicity, and avoid the complexities of managing inter-dependencies and potential conflicts amongst various dependencies that might be introduced by incorporating these other models. Instead, we provide an interface for representing the scorer (CIFScorer), and a ZMQ implementation (ZMQScorer) that illustrates how two different processes (with different and even incompatible environments) might interoperate. We also provide an example script of how one might use ALIGNN in a separate process. The intention is that users will determine the best way to integrate scorers that works for them, as you have done.

While we might not merge this PR into the main project for the reasons stated above, we encourage you to maintain your fork with the CHGNetScorer implementation. Additionally, we can link to your fork in the documentation, and include an example script in the resources folder, as we did for ALIGNN (alignn_zmq_example.py). This way, users who need this specific functionality and are in similar hardware environments can benefit from your work.

Best regards,
Luis

update CHGNet typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants