PlasmidGPT: a generative framework for plasmid design and annotation

We introduce PlasmidGPT, a generative language model pretrained on 153k engineered plasmid sequences from Addgene (https://www.addgene.org/). PlasmidGPT generates de novo sequences that share similar characteristics with engineered plasmids but show low sequence identity to the training data. We demonstrate its ability to generate plasmids in a controlled manner based on the input sequence or specific design constraint. Moreover, our model learns informative embeddings of both engineered and natural plasmids, allowing for efficient prediction of a wide range of sequence-related attributes.

Installation

Python package dependencies:

torch 2.0.1
transformers 4.37.2
pandas 2.2.0
seaborn 0.13.2

We recommend using Conda to install our packages. For convenience, we have provided a conda environment file with package versions that are compatiable with the current version of the program. The conda environment can be setup with the following comments:

Clone this repository:

  git clone https://github.com/lingxusb/PlasmidGPT.git
  cd PlasmidGPT

Create and activate the Conda environment:

conda env create -f env.yml
conda activate PlasmidGPT

Trained model

The trained model and tokenizer is availale at huggingface.

pretrained_model.pt, pretrained PlasmidGPT model, can be accessed here
addgene_trained_dna_tokenizer.json, trained BPE tokenizer on Addgene plasmid sequences, can be accessed here

Sequence generation

import torch

# load the model
device = 'cpu' # use 'cuda' for GPU

model = torch.load(pt_file_path).to(device)
model.eval()

# start sequence
input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device)

# model generation
outputs = model.generate(
    input_ids,
    max_length=300,
    num_return_sequences=1,
    temperature=1.0,
    do_sample=True,
    generation_config=GenerationConfig.from_model_config(model.config)
)

# transform tokens back to DNA ucleotide sequence:
generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)

command line

To generate plasmid sequence using the model, please run the following command:

python generate.py --model_dir ../pretrained_model

The ../pretrained_model folder should contain the model file and the tokenizer.

For a full list of options, please run:

python generate.py -h

Arguments description

argument	description
`-h`, `--help`	show help message and exit
`-m`, `--model_dir`	path to the directory containing the pretrained model and tokenizer, required
`-s`, `--start_sequence`	starting DNA sequence for sequence generation
`-f`, `--fasta_file`	FASTA file containing the starting sequence
`-n`, `--num_sequences`	number of sequences to generate, default value: 1
`-l`, `--max_length`	maximum length of the tokenized generated sequence, default value: 300
`-t`, `--temperature`	temperature for sequence generation (controls randomness), default value: 1.0
`-o`, `--output`	output file name for the generated sequences, default value: generated_sequence.fasta

The model output will be stored in the generated_sequence.fasta file. The script should automatically detect whether to use CUDA (GPU) or CPU based on availability. If you encounter a CUDA-related error when running on a CPU-only machine, the script will handle this by falling back to CPU.

notebooks

Please also check our jupyter notebook PlasmidGPT_generate.ipynb.

Or, you can easily use our Colab Notebook in the browser. Please make sure to connect to a GPU instance (e.g., T4 GPU). The notebook will automatically download the pretrained model and tokenizer. The plasmid sequence can be generated based on the user's specified start sequence and downloaded in the .fasta file format.

Model embeddings

# calculation of model embeddings
model.config.output_hidden_states = True

# Inference to obtain hidden states
with torch.no_grad():
    outputs = model(input_ids)
    hidden_states = outputs.hidden_states[-1].cpu().numpy()
    hidden_states_mean = np.mean(hidden_states, axis=1).reshape(-1)    
    embedding.append(hidden_states_mean)

command

To generate plasmid sequence embeddings, please run the following command:

python embeddings.py [-h] -m MODEL_DIR -f FASTA_FILE [-o OUTPUT_FILE]

Arguments description

argument	description
`-h`, `--help`	show help message and exit
`-m`, `--model_dir`	path to the directory containing the pretrained model and tokenizer, required
`-f`, `--fasta_file`	FASTA file containing DNA sequences for the embedding calculation, required
`-o`, `--output_file`	output file name for saving the embeddings

The model output will be save in the embeddings.txt file.

Sequence annotation

For prediction of attributes, please check our models in the prediction_models folder.

command

To predict lab of origin based on input fasta file, please run the following command:

python prediction.py [-h] -m MODEL_DIR -i INPUT_FILE [-e] -nn NN_MODEL -l LAB_LIST [-o OUTPUT_FILE] [-n TOP_N]

The neural network model for lab prediction is provided in ./prediction_models/embedding_prediction_labs.pth. The lab labels are provided in ./prediction_models/lab_list.txt.

Arguments description

argument	description
`-h`, `--help`	show help message and exit
`-m`, `--model_dir`	path to the directory containing the pretrained model and tokenizer, required
`-i`, `--input_file`	FASTA file or embeddings file as input, required
`-e`, `--embedding_file`	indicates if the input is an embedding file
`-nn`, `--nn_model`	path to the neural network model for lab prediction, required
`-l`, `--lab_list`	path to the file containing the lab labels, required
`-o`, `--output_file`	output file name for lab predictions
`-n`, `--top_n`	number of top predictions to output, default value: 10

The top predictions will be stored in the file lab_predictions.txt, where each row corresponds to one input sequence.

notebooks

We have provided the jupyter notebook PlasmidGPT_predict.ipynb for the prediction of lab of origin.

The Colab Notebook can be easily used in the browser to predict the lab of origin, species, and vector type for the input sequence. The notebook will automatically download all related models and make predictions based on the user's input plasmid sequence. Please use the drop-down list to select the feature to predict, and the top 10 predictions will be displayed.

Reference

PlasmidGPT: a generative framework for plasmid design and annotation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PlasmidGPT: a generative framework for plasmid design and annotation

Table of Contents

Installation

Trained model

Sequence generation

command line

notebooks

Model embeddings

command

Sequence annotation

command

notebooks

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

PlasmidGPT: a generative framework for plasmid design and annotation

Table of Contents

Installation

Trained model

Sequence generation

command line

notebooks

Model embeddings

command

Sequence annotation

command

notebooks

Reference