megaDNA: a long-context language model for deciphering and generating bacteriophage genomes

Generative pre-trained transformers (GPTs) have revolutionized the field of natural language processing. Inspired by the success of large language models, we develop a long-context generative model for genomes. Our multiscale transformer model was pre-trained on unannotated bacteriophage genomes with byte-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96K base pairs, which contain functional regulatory elements and novel proteins with phage-related functions.

Install

To install megaDNA, run the following bash script:

git clone https://github.com/lingxusb/megaDNA.git
cd megaDNA
pip install .

Trained model

The original model: megaDNA_145M.
Other model sizes: megaDNA_78M and megaDNA_277M.
Fine-tuned model on E. coli phage: megaDNA_ecoli.

Sequence generation

 import torch
 
 # load the model

 device = 'cpu' # use 'cuda' for GPU
 model = torch.load(model_path, map_location=torch.device(device))

 # new sequences can be generated by a primer sequence:

 nucleotides = ['**', 'A', 'T', 'C', 'G', '#'] # vocabulary

 seq_tokenized = model.generate(primer_sequence,
                                seq_len=context_length,
                                temperature=0.95, 
                                filter_thres=0.0)

 # To transform tokens back to DNA ucleotide sequence:
 def token2nucleotide(s):
     return nucleotides[s]
 generated_sequence = ''.join(map(token2nucleotide, seq_tokenized.squeeze().cpu().int()))

Please check our jupyter notebook: megaDNA_generate.ipynb. GPU recommended.

Or you can easily run the Colab Notebook in the browser. Please make sure to connect to a GPU instance (e.g. T4 GPU).

Features for the generated sequences

Annotated genes (Fig. 2b)
Annotated proteins with diverse functions (Fig. 2i & Fig. S12)
Folding of annotated proteins (Fig. 2h & Fig. S11)
Virus score that is comparable with natural phages (Fig. 2c)
Marker genes for phage (Fig. 2h)
Classified as Caudoviricetes (~37%, Fig. 2d)
Predicted hosts (~40%, Fig. S9)
Regulatory elements including promoters and RBS (Fig. 2f, 2g, Fig. S10)

Please check our preprint for more details.

Model embeddings and loss

# a random input sequence
encoded_sequence = np.random.choice(np.arange(1,5), 100)
input_seq = torch.tensor(encoded_sequence).unsqueeze(0).to(device) 

# get embeddings
output = model(input_seq, return_value = 'embedding')

# output[0:3] stores embeddings from three transformer layers.

# get model loss
output = model(input_seq, return_value = 'loss')

print(output)

In silico mutagenesis analysis

Please check our jupyter notebook: megaDNA_mutagenesis.ipynb. Fasta file and gene annotation for lambda phage can be downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NC_001416.1

Reference

A long-context language model for deciphering and generating bacteriophage genomes
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
MEGABYTE-pytorch by Phil Wang
Protein language models learn evolutionary statistics of interacting sequence motifs
Please contact shaobinlx@gmail.com or raise an issue in the github repo with any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
megaDNA		megaDNA
notebook		notebook
training		training
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

megaDNA: a long-context language model for deciphering and generating bacteriophage genomes

Install

Trained model

Sequence generation

Model embeddings and loss

In silico mutagenesis analysis

Reference

About

Releases

Packages

Contributors 2

Languages

lingxusb/megaDNA

Folders and files

Latest commit

History

Repository files navigation

megaDNA: a long-context language model for deciphering and generating bacteriophage genomes

Install

Trained model

Sequence generation

Model embeddings and loss

In silico mutagenesis analysis

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages