Paired Bio-Inspired Antibody Language Model (BALM)

This repository contains the code for fine-tuning the BALM model for natively paired antibody sequences.

Create Environment

To set up the environment, run the following commands:

conda env create -f environment.yml
conda activate PBALM

Preparation

The pre-trained weights of BALM should be downloaded from Google Drive link: pretrained-BALM. Place the downloaded files in the pretrained_BALM folder.

Data Preprocessing

To download, cluster, and clean the data from Paired OAS dataset and calculate the mask probabilities, you can simply run:

bash data.sh

This script will create two files in the main directory: data.pkl and mask_probs.pt. These files contain pickled antibody sequences with their IMGT numbering and the probability of masking for each IMGT position calculated based on the BALM paper, respectively.

Run inference

The weights for trained model can be found here.

from BALM.modeling_balm import BALMForMaskedLM
from numbering import get_anarci_numbering
from transformers import EsmTokenizer
import torch

# an antibody sequence example
light = "DIQMTQSPSSLSASVGDRVTITCRASQGIRNDLGWYQQKPGKAPKRLIYAASSLQSGVPSRFSGSGSGTEFTLTISSLQPEDFATYYCLQHNSYPRTFGQGTKVEIK"
heavy = "EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKDWPFWQWLVRRGERFDYWGQGTLVTVSS"
input_ab = [[light, heavy]]

tokenizer = EsmTokenizer.from_pretrained("BALM/tokenizer/vocab.txt", do_lower_case=False, model_max_length=288)

batch = tokenizer(input_ab, truncation=True, padding="max_length", return_tensors="pt")
# generate position_ids
batch.update(get_anarci_numbering(input_ab[0]))

with torch.no_grad():
    # please download from Google drive link before
    model = BALMForMaskedLM.from_pretrained("./pretrained_PBALM/")
    # on CPU device
    outputs = model(**batch, return_dict=True, output_hidden_states=True, output_attentions=True)

    # final hidden layer representation [batch_sz * max_length * hidden_size]
    final_hidden_layer = outputs.hidden_states[-1]
    
    # final hidden layer sequence representation [batch_sz * hidden_size]
    final_seq_embedding = final_hidden_layer[:, 0, :]
    
    # final layer attention map [batch_sz * num_head * max_length * max_length]
    final_attention_map = outputs.attentions[-1]

The architecture and finetuning process of Paired-BALM builds on the BALM and Hugging Face modeling framework. We really appreciate the work of BALM and Hugging Face team.

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
BALM		BALM
data		data
.gitignore		.gitignore
README.md		README.md
data.sh		data.sh
environment.yml		environment.yml
inference.ipynb		inference.ipynb
numbering.py		numbering.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paired Bio-Inspired Antibody Language Model (BALM)

Create Environment

Preparation

Data Preprocessing

Run inference

License

About

Releases 1

Packages

Languages

benyaminjami/Paired-BALM

Folders and files

Latest commit

History

Repository files navigation

Paired Bio-Inspired Antibody Language Model (BALM)

Create Environment

Preparation

Data Preprocessing

Run inference

License

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages