Skip to content

Repository for Text2Mol: Cross-Modal Molecular Retrieval with Natural Language Queries

Notifications You must be signed in to change notification settings

cnedwards/text2mol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text2Mol

This is code for the paper Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries

Task Example

PWC

Installation

Code is written in Python 3. Packages are shown in code/packages.txt. However, the following should suffice:

pytorch pytorch-geometric transformers scikit-learn numpy

For processing .sdf files, we recommend RDKit.

For ranker_threshold.py:

matplotlib

Files

File Description
main.py Train Text2Mol.
main_parallel.py A lightly-tested parallel version.
ranker.py Rank output embeddings.
ensemble.py Rank ensemble of output embeddings.
test_example.py Runs a version of the model that you can query with arbitrary inputs for testing.
extract_embeddings.py Extract embeddings or rules from a specific checkpoint.
ranker_threshold.py Rank output embeddings and plot cosine score vs. ranking.
models.py The three model definitions: MLP, GCN, and Attention.
losses.py Losses used for training.
dataloaders.py Code for loading the data.
notebooks Jupyter Notebooks/Google Collab implementations.

Example commands:

To train the model:

python code/main.py --data data --output_path test_output --model MLP --epochs 40 --batch_size 32

ranker.py can be used to rank embedding outpoints. ensemble.py ranks the ensemble of multiple embeddings.

python code/ranker.py test_output/embeddings --train --val --test

python code/ensemble.py test_output/embeddings GCN_outputs/embeddings --train --val --test

To run example queries given a model checkpoint for the MLP model:

python code/test_example.py test_output/embeddings/ data/ test_output/CHECKPOINT.pt

To get embeddings from a specific checkpoint:

python code/extract_embeddings.py --data data --output_path embedding_output_dir --checkpoint test_output/CHECKPOINT.pt --model MLP --batch_size 32

To plot cosine score vs ranking:

python code/ranker_threshold.py test_output/embeddings --train --val --test --output_file threshold_image.png

All code has been rewritten as Python files so far except association_rules.ipynb.

Data: ChEBI-20

Data can be found in "data/". Files directly used in the dataloaders are "training.txt", "val.txt", and "test.txt". These include the CIDs (pubchem compound IDs), mol2vec embeddings, and ChEBI descriptions. SDF (structural data file) versions are also available.

Thanks to PubChem and ChEBI for freely providing access to their databases.

Citation

If you found our work useful, please cite:

@inproceedings{edwards2021text2mol,
  title={Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries},
  author={Edwards, Carl and Zhai, ChengXiang and Ji, Heng},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  pages={595--607},
  year={2021},
  url = {https://aclanthology.org/2021.emnlp-main.47/}
}

Poster

About

Repository for Text2Mol: Cross-Modal Molecular Retrieval with Natural Language Queries

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages