ColabFold

+ 2022/07/13: We have set up a new ColabFold MSA server provided by Korean Bioinformation Center. 
+             It provides accelerated MSA generation, we updated the UniRef30 to 2022_02 and PDB/PDB70 to 220313.

Making Protein folding accessible to all via Google Colab!

Notebooks	monomers	complexes	mmseqs2	jackhmmer	templates
AlphaFold2_mmseqs2	Yes	Yes	Yes	No	Yes
AlphaFold2_batch	Yes	Yes	Yes	No	Yes
RoseTTAFold	Yes	No	Yes	No	No
AlphaFold2 (from Deepmind)	Yes	Yes	No	Yes	No

BETA (in development) notebooks
AlphaFold2_advanced	Yes	Yes	Yes	Yes	No

OLD retired notebooks
AlphaFold2_complexes	No	Yes	No	No	No
AlphaFold2_jackhmmer	Yes	No	Yes	Yes	No
AlphaFold2_noTemplates_noMD
AlphaFold2_noTemplates_yesMD

FAQ

Can I use the models for Molecular Replacement?
- Yes, but be CAREFUL, the bfactor column is populated with pLDDT confidence values (higher = better). Phenix.phaser expects a "real" bfactor, where (lower = better). See post from Claudia Millán.
What is the maximum length?
- Limits depends on free GPU provided by Google-Colab fingers-crossed
- For GPU: Tesla T4 or Tesla P100 with ~16G the max length is ~1400
- For GPU: Tesla K80 with ~12G the max length is ~1000
- To check what GPU you got, open a new code cell and type !nvidia-smi
Is it okay to use the MMseqs2 MSA server (cf.run_mmseqs2) on a local computer?
- You can access the server from a local computer if you queries are serial from a single IP. Please do not use multiple computers to query the server.
Where can I download the databases used by ColabFold?
- The databases are available at colabfold.mmseqs.com
I want to render my own images of the predicted structures, how do I color by pLDDT?
- In pymol for AlphaFold structures: spectrum b, red_yellow_green_cyan_blue, minimum=50, maximum=90
- In pymol for RoseTTAFold structures: spectrum b, red_yellow_green_cyan_blue, minimum=0.5, maximum=0.9
What is the difference between the AlphaFold2_advanced and AlphaFold2_mmseqs2 (_batch) notebook for complex prediction?
- We currently have two different ways to predict protein complexes: (1) using the AlphaFold2 model with residue index jump and (2) using the AlphaFold2-multimer model. AlphaFold2_advanced supports (1) and AlphaFold2_mmseqs2 (_batch) (2).
What is the difference between localcolabfold and the pip installable colabfold_batch?
- localcolabfold is a command line interface for our advanced notebooks. pip is a command line version of the alphafold_mmseqs2 and alphafold_batch notebook.

Running locally

_Note: Checkout localcolabfold too

Install ColabFold using the pip commands below. pip will resolve and install all required dependencies and ColabFold should be ready within a few minutes to use. Please check the JAX documentation for how to get JAX to work on your GPU or TPU.

pip install "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold"
pip install -q "jax[cuda]>=0.3.8,<0.4" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
# For template-based predictions also install kalign and hhsuite
conda install -c conda-forge -c bioconda kalign2=2.04 hhsuite=3.3.0
# For amber also install openmm and pdbfixer
conda install -c conda-forge openmm=7.5.1 pdbfixer

colabfold_batch <directory_with_fasta_files> <result_dir>

If no GPU or TPU is present, colabfold_batch can be executed (slowly) using only a CPU with the --cpu parameter.

Generating MSAs for large scale structure/complex predictions

First create a directory for the databases on a disk with sufficient storage (940GB (!)). Depending on where you are, this will take a couple of hours:

./setup_databases.sh /path/to/db_folder

Download and unpack mmseqs (Note: The required features aren't in a release yet, so currently, you need to compile the latest version from source yourself or use a static binary). If mmseqs is not in your PATH, replace mmseqs below with the path to your mmseqs:

# This needs a lot of CPU
colabfold_search input_sequences.fasta /path/to/db_folder msas
# This needs a GPU
colabfold_batch msas predictions

This will create intermediate folder msas that contains all input multiple sequence alignments formated as a3m files and a predictions folder with all predicted pdb,json and png files.

Searches against the ColabFoldDB can be done in two different modes:

(1) Batch searches with many sequences against the ColabFoldDB quires a machine with approx. 128GB RAM. The search should be performed on the same machine that called setup_databases.sh since the database index size is adjusted to the main memory size. To search on computers with less main memory delete the index by removing all .idx files, this will force MMseqs2 to create an index on the fly in memory. MMSeqs2 is optimized for large input sequence sets sizes. For batch searches use the --db-load-mode 0 option.

(2) single query searches require the full index (the .idx files) to be kept in memory. This can be done with e.g. by using vmtouch. Thus, this type of search requires a machine with at least 768GB RAM for the ColabfoldDB. If the index is in memory use to --db-load-mode 3 parameter in colabfold_search to avoid index loading overhead. If they database is already in memory use --db-load-mode 2 option.

Tutorials & Presentations

ColabFold Tutorial presented at the Boston Protein Design and Modeling Club. [video] [slides].

Projects based on ColabFold or helpers

Run ColabFold on your local computer by Yoshitaka Moriwaki
ColabFold/AlphaFold2 for protein structure predictions for Discoba species by Richard John Wheeler
Cloud-based molecular simulations for everyone by Pablo R. Arantes, Marcelo D. Polêto, Conrado Pedebos and Rodrigo Ligabue-Braun
getmoonbear is a webserver to predict protein structures by Stephanie Zhang and Neil Deshmukh
ColabFold/AlphaFold2 IDR complex prediction by Balint Meszaros
ColabFold/AlphaFold2 (Phenix version) for macromolecular structure determination by Tom Terwilliger
AlphaPickle: making AlphaFold2/ColabFold outputs interpretable by Matt Arnold

Acknowledgments

We would like to thank the RoseTTAFold and AlphaFold team for doing an excellent job open sourcing the software.
Also credit to David Koes for his awesome py3Dmol plugin, without whom these notebooks would be quite boring!
A colab by Sergey Ovchinnikov (@sokrypton), Milot Mirdita (@milot_mirdita) and Martin Steinegger (@thesteinegger).

How do I reference this work?

Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. ColabFold: Making protein folding accessible to all.
Nature Methods (2022) doi: 10.1038/s41592-022-01488-1
If you’re using AlphaFold, please also cite:
Jumper et al. "Highly accurate protein structure prediction with AlphaFold."
Nature (2021) doi: 10.1038/s41586-021-03819-2
If you’re using AlphaFold-multimer, please also cite:
Evans et al. "Protein complex prediction with AlphaFold-Multimer."
biorxiv (2021) doi: 10.1101/2021.10.04.463034v1
If you are using RoseTTAFold, please also cite:
Minkyung et al. "Accurate prediction of protein structures and interactions using a three-track neural network."
Science (2021) doi: 10.1126/science.abj8754

OLD Updates

11Mar2022: We use in default AlphaFold-multimer-v2 weights for complex modeling.
We also offer the old complex modes "AlphaFold-ptm" or "AlphaFold-multimer-v1"
04Mar2022: ColabFold now uses a much more powerful server for MSAs and searches through the ColabFoldDB instead of BFD/MGnify.
Please let us know if you observe any issues.
26Jan2022: AlphaFold2_mmseqs2, AlphaFold2_batch and colabfold_batch's multimer complexes predictions are
now in default reranked by iptmscore*0.8+ptmscore*0.2 instead of ptmscore
16Aug2021: WARNING - MMseqs2 API is undergoing upgrade, you may see error messages.
17Aug2021: If you see any errors, please report them.
17Aug2021: We are still debugging the MSA generation procedure...
20Aug2021: WARNING - MMseqs2 API is undergoing upgrade, you may see error messages.
To avoid Google Colab from crashing, for large MSA we did -diff 1000 to get
1K most diverse sequences. This caused some large MSA to degrade in quality,
as sequences close to query were being merged to single representive.
We are working on updating the server (today) to fix this, by making sure
that both diverse and sequences close to query are included in the final MSA.
We'll post update here when update is complete.
21Aug2021 The MSA issues should now be resolved! Please report any errors you see.
In short, to reduce MSA size we filter (qsc > 0.8, id > 0.95) and take 3K
most diverse sequences at different qid (sequence identity to query) intervals
and merge them. More specifically 3K sequences at qid at (0→0.2),(0.2→0.4),
(0.4→0.6),(0.6→0.8) and (0.8→1). If you submitted your sequence between
16Aug2021 and 20Aug2021, we recommend submitting again for best results!
21Aug2021 The use_templates option in AlphaFold2_mmseqs2 is not properly working. We are
working on fixing this. If you are not using templates, this does not affect the
the results. Other notebooks that do not use_templates are unaffected.
21Aug2021 The templates issue is resolved!
11Nov2021 [AlphaFold2_mmseqs2] now uses Alphafold-multimer for complex (homo/hetero-oligomer) modeling.
Use [AlphaFold2_advanced] notebook for the old complex prediction logic.
11Nov2021 ColabFold can be installed locally using pip!
14Nov2021 Template based predictions works again in the Alphafold2_mmseqs2 notebook.
14Nov2021 WARNING "Single-sequence" mode in AlphaFold2_mmseqs2 and AlphaFold2_batch was broken
starting 11Nov2021. The MMseqs2 MSA was being used regardless of selection.
14Nov2021 "Single-sequence" mode is now fixed.
20Nov2021 WARNING "AMBER" mode in AlphaFold2_mmseqs2 and AlphaFold2_batch was broken
starting 11Nov2021. Unrelaxed proteins were returned instead.
20Nov2021 "AMBER" is fixed thanks to Kevin Pan

Name		Name	Last commit message	Last commit date
Latest commit History 1,040 Commits
.github		.github
MsaServer		MsaServer
TemplateServer		TemplateServer
batch		batch
beta		beta
colabfold		colabfold
test-data		test-data
tests		tests
utils		utils
verbose		verbose
.gitattributes		.gitattributes
.gitignore		.gitignore
AlphaFold2.ipynb		AlphaFold2.ipynb
AlphaFold2_complexes.ipynb		AlphaFold2_complexes.ipynb
Contributing.md		Contributing.md
LICENSE		LICENSE
README.md		README.md
RoseTTAFold.ipynb		RoseTTAFold.ipynb
colabfold_search.sh		colabfold_search.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup_databases.sh		setup_databases.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ColabFold

Making Protein folding accessible to all via Google Colab!

FAQ

Running locally

Generating MSAs for large scale structure/complex predictions

Tutorials & Presentations

Projects based on ColabFold or helpers

Acknowledgments

How do I reference this work?

About

Releases

Packages

Languages

License

tomgoddard/ColabFold

Folders and files

Latest commit

History

Repository files navigation

ColabFold

Making Protein folding accessible to all via Google Colab!

FAQ

Running locally

Generating MSAs for large scale structure/complex predictions

Tutorials & Presentations

Projects based on ColabFold or helpers

Acknowledgments

How do I reference this work?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages