Skip to content

Commit

Permalink
UF-Symmetry inference code (dptech-corp#44)
Browse files Browse the repository at this point in the history
* swap case study

* add uf-symmetry inference (dptech-corp#21)

* notebook noqa

* fix arg-passing bug

* fix plotting bug

* fix b factor bug
  • Loading branch information
ZiyaoLi authored and teslacool committed Feb 27, 2023
1 parent aa414b7 commit 660bcc6
Show file tree
Hide file tree
Showing 15 changed files with 1,392 additions and 30 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,4 @@ test/
*.tfevents.*
*.sto
*.a3m
nogit/
52 changes: 52 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,15 @@

[[bioRxiv](https://www.biorxiv.org/content/10.1101/2022.08.04.502811v2)], [[Uni-Fold Colab](https://colab.research.google.com/github/dptech-corp/Uni-Fold/blob/main/notebooks/unifold.ipynb)], [[Hermite™](https://hermite.dp.tech/)]

[[UF-Symmetry bioRxiv](https://www.biorxiv.org/content/10.1101/2022.08.30.505833)]

We proudly present Uni-Fold as a thoroughly open-source platform for developing protein models beyond [AlphaFold](https://github.com/deepmind/alphafold/). Uni-Fold introduces the following features:

- Reimplemented AlphaFold and AlphaFold-Multimer models in PyTorch framework. **This is currently the first (if any else) open-source repository that supports training AlphaFold-Multimer.**
- Model correctness proved by successful from-scratch training with equivalent accuracy, both monomer and multimer included.
- Highest efficiency among existing AlphaFold implementations (to our knowledge).
- Easy distributed training based on [Uni-Core](https://github.com/dptech-corp/Uni-Core/), as well as other conveniences including half-precision training (`float16/bfloat16`), per-sample gradient clipping, and fused CUDA kernels.
- Fast prediction of large symmetric complexes with UF-Symmetry.


![case](./img/case.png)
Expand Down Expand Up @@ -44,6 +46,19 @@ The name Uni-Fold is inherited from our previous repository, [Uni-Fold-JAX](http

---

## NEWEST in Uni-Fold

[2022-09-06] We released the code of Uni-Fold Symmetry (UF-Symmetry), a fast solution to fold large symmetric protein complexes. The details of UF-Symmetry can be found in [bioRxiv: Uni-Fold Symmetry: Harnessing Symmetry in Folding Large Protein Complexes](https://www.biorxiv.org/content/10.1101/2022.08.30.505833). The code of UF-Symmetry is concentrated in the folder [`unifold/symmetry`](./unifold/symmetry/).

![case](./img/uf-symmetry-effect.png)
<center>
<small>
Figure 4. Prediction of UF-Symmetry. AlphaFold etc. failed due to OOM errors.
</small>
</center>

&nbsp;

## Installation and Preparations

### Installing Uni-Fold
Expand Down Expand Up @@ -175,6 +190,29 @@ Besides the notices in the previous section, additionaly note that:
1. The model architecture should be correctly specified by the model name.
2. Checkpoints must be in Uni-Fold format (`*.pt`).

## Run UF-Symmetry

To run UF-Symmetry, please first install the newest version of Uni-Fold, and download the parameters of UF-Symmetry:

```bash
wget https://uni-fold.dp.tech/uf_symmetry_params_2022-09-06.tar.gz
tar -zxf uf_symmetry_params_2022-09-06.tar.gz
```

Run

```bash
bash run_uf_symmetry.sh \
/path/to/the/input.fasta \ # target fasta file, include AU only
C3 \ # desired symmetry group
/path/to/the/output/directory/ \ # output directory
/path/to/database/directory/ \ # directory of databases
2020-05-01 \ # use templates before this date
/path/to/model_parameters.pt # model parameters
```

to inference with UF-Symmetry. **Note that the input FASTA file should contain the sequences of the asymmetric unit only, and a symmetry group must be specified for the model.**

## Inference on Hermite

We provide covenient structure prediction service on [Hermite™](https://hermite.dp.tech/), a new-generation drug design platform powered by AI, physics, and computing. Users only need to upload sequences of protein monomers and multimers to obtain the predicted structures from Uni-Fold, acompanied by various analyzing tools. [Click here](https://docs.google.com/document/d/1iFdezkKJVuhyqN3WvzsC7-422T-zf18IhP7M9CBj5gs) for more information of how to use Hermite™.
Expand All @@ -199,6 +237,20 @@ If you use the code, the model parameters, the web server at [Hermite™](https:
}
```

If you use the relative utilities of UF-Symmetry, please cite

```bibtex
@article {uf-symmetry,
author = {Li, Ziyao and Yang, Shuwen and Liu, Xuyang and Chen, Weijie and Wen, Han and Shen, Fan and Ke, Guolin and Zhang, Linfeng},
title = {Uni-Fold Symmetry: Harnessing Symmetry in Folding Large Protein Complexes},
year = {2022},
doi = {10.1101/2022.08.30.505833},
URL = {https://www.biorxiv.org/content/early/2022/08/30/2022.08.30.505833},
eprint = {https://www.biorxiv.org/content/early/2022/08/30/2022.08.30.505833.full.pdf},
journal = {bioRxiv}
}
```

## Acknowledgements

Our training framework is based on [Uni-Core](https://github.com/dptech-corp/Uni-Core/). Implementation of fused operators referred to [fused_ops](https://github.com/guolinke/fused_ops/) and [OneFlow](https://github.com/Oneflow-Inc/oneflow). We partly referred to an early version of [OpenFold](https://github.com/aqlaboratory/openfold) for some of the PyTorch implementations, while mostly followed the original code of [AlphaFold](https://github.com/deepmind/alphafold/). For the data processing part, we followed [AlphaFold](https://github.com/deepmind/alphafold/), and referred to utilities in [Biopython](https://biopython.org/), [HH-suite3](https://github.com/soedinglab/hh-suite/), [HMMER](http://eddylab.org/software/hmmer/), [Kalign](https://msa.sbc.su.se/cgi-bin/msa.cgi), [pandas](https://pandas.pydata.org/), [NumPy](https://numpy.org/), and [SciPy](https://scipy.org/).
Expand Down
Binary file added img/uf-symmetry-effect.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
122 changes: 93 additions & 29 deletions notebooks/unifold.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
"* Evans et al. \"[Protein complex prediction with AlphaFold-Multimer.](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1)\" biorxiv (2021)\n",
"* Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. \"[ColabFold: Making protein folding accessible to all.](https://www.nature.com/articles/s41592-022-01488-1)\" Nature Methods (2022) \n",
"* Ziyao Li, Xuyang Liu, Weijie Chen, Fan Shen, Hangrui Bi, Guolin Ke, Linfeng Zhang. \"[Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold.](https://www.biorxiv.org/content/10.1101/2022.08.04.502811v1)\" biorxiv (2022)\n",
"* Ziyao Li, Shuwen Yang, Xuyang Liu, Weijie Chen, Han Wen, Fan Shen, Guolin Ke, Linfeng Zhang. \"[Uni-Fold Symmetry: Harnessing Symmetry in Folding Large Protein Complexes.](https://www.biorxiv.org/content/10.1101/2022.08.30.505833v1)\" bioRxiv (2022)\n",
"\n",
"\n",
"**Acknowledgements**\n",
Expand Down Expand Up @@ -78,6 +79,8 @@
"GIT_REPO = 'https://github.com/dptech-corp/Uni-Fold'\n",
"UNICORE_URL = 'https://github.com/dptech-corp/Uni-Core/releases/download/0.0.1/unicore-0.0.1+cu113torch1.12.1-cp37-cp37m-linux_x86_64.whl'\n",
"PARAM_URL = 'https://drive.google.com/uc?id=1A9iXMYCwP0f_U0FgISJ_6BX7FXZtglvV'\n",
"UF_SYMM_PARAM_URL = 'https://uni-fold.dp.tech/uf_symmetry_params_2022-09-06.tar.gz' # TODO: use Google drive.\n",
"\n",
"\n",
"!rm *.whl\n",
"!wget {UNICORE_URL} \n",
Expand All @@ -86,7 +89,9 @@
"!git clone -b main {GIT_REPO}\n",
"!pip3 -q install ./Uni-Fold\n",
"!gdown {PARAM_URL}\n",
"!tar -xzf \"unifold_params_2022-08-01.tar.gz\"\n"
"!tar -xzf \"unifold_params_2022-08-01.tar.gz\"\n",
"!wget {UF_SYMM_PARAM_URL}\n",
"!tar -xzf \"uf_symmetry_params_2022-09-06.tar.gz\""
]
},
{
Expand Down Expand Up @@ -145,6 +150,7 @@
"\n",
"def validate_input(\n",
" input_sequences: Sequence[str],\n",
" symmetry: str,\n",
" min_length: int,\n",
" max_length: int,\n",
" max_multimer_length: int) -> Tuple[Sequence[str], bool]:\n",
Expand All @@ -158,10 +164,22 @@
" min_length=min_length,\n",
" max_length=max_length)\n",
" sequences.append(input_sequence)\n",
" \n",
" if symmetry_group != 'C1':\n",
" if symmetry_group.startswith('C') and symmetry_group[1:].isnumeric():\n",
" print(f'Using UF-Symmetry with group {symmetry_group}. If you do not '\n",
" f'want to use UF-Symmetry, please use `C1` and copy the AU '\n",
" f'sequences to the count in the assembly.')\n",
" is_multimer = (len(sequences) > 1)\n",
" return sequences, is_multimer, symmetry_group\n",
" else:\n",
" raise ValueError(f\"UF-Symmetry does not support symmetry group \"\n",
" f\"{symmetry_group} currently. Cyclic groups (Cx) are \"\n",
" f\"supported only.\")\n",
"\n",
" if len(sequences) == 1:\n",
" elif len(sequences) == 1:\n",
" print('Using the single-chain model.')\n",
" return sequences, False\n",
" return sequences, False, None\n",
"\n",
" elif len(sequences) > 1:\n",
" total_multimer_length = sum([len(seq) for seq in sequences])\n",
Expand All @@ -171,7 +189,7 @@
" f'{max_multimer_length}. Please use the full AlphaFold '\n",
" f'system for long multimers.')\n",
" print(f'Using the multimer model with {len(sequences)} sequences.')\n",
" return sequences, True\n",
" return sequences, True, None\n",
"\n",
" else:\n",
" raise ValueError('No input amino acid sequence provided, please provide at '\n",
Expand All @@ -186,6 +204,8 @@
"sequence_3 = '' #@param {type:\"string\"}\n",
"sequence_4 = '' #@param {type:\"string\"}\n",
"\n",
"symmetry_group = 'C1' #@param {type:\"string\"}\n",
"\n",
"use_templates = True #@param {type:\"boolean\"}\n",
"msa_mode = \"MMseqs2\" #@param [\"MMseqs2\",\"single_sequence\"]\n",
"\n",
Expand All @@ -198,8 +218,9 @@
"target_id = add_hash(jobname, basejobname)\n",
"\n",
"# Validate the input.\n",
"sequences, is_multimer = validate_input(\n",
"sequences, is_multimer, symmetry_group = validate_input(\n",
" input_sequences=input_sequences,\n",
" symmetry_group=symmetry_group,\n",
" min_length=MIN_SINGLE_SEQUENCE_LENGTH,\n",
" max_length=MAX_SINGLE_SEQUENCE_LENGTH,\n",
" max_multimer_length=MAX_MULTIMER_LENGTH)\n",
Expand Down Expand Up @@ -646,6 +667,7 @@
"source": [
"#@title Uni-Fold prediction\n",
"\n",
"from unittest import result\n",
"import torch\n",
"import json\n",
"from unifold.config import model_config\n",
Expand All @@ -654,6 +676,12 @@
"from unicore.utils import (\n",
" tensor_tree_map,\n",
")\n",
"from unifold.symmetry import (\n",
" UFSymmetry,\n",
" uf_symmetry_config,\n",
" assembly_from_prediction,\n",
" load_and_process_symmetry,\n",
")\n",
"\n",
"def automatic_chunk_size(seq_len):\n",
" if seq_len < 512:\n",
Expand All @@ -670,7 +698,7 @@
"\n",
"\n",
"def load_feature_for_one_target(\n",
" config, data_folder, seed=0, is_multimer=False, use_uniprot=False\n",
" config, data_folder, seed=0, is_multimer=False, use_uniprot=False, symmetry_group=None,\n",
"):\n",
" if not is_multimer:\n",
" uniprot_msa_dir = None\n",
Expand All @@ -681,21 +709,40 @@
" else:\n",
" uniprot_msa_dir = data_folder\n",
" sequence_ids = open(os.path.join(data_folder, \"chains.txt\")).readline().split()\n",
" batch, _ = load_and_process(\n",
" config=config.data,\n",
" mode=\"predict\",\n",
" seed=seed,\n",
" batch_idx=None,\n",
" data_idx=0,\n",
" is_distillation=False,\n",
" sequence_ids=sequence_ids,\n",
" monomer_feature_dir=data_folder,\n",
" uniprot_msa_dir=uniprot_msa_dir,\n",
" )\n",
" \n",
" if symmetry_group is None:\n",
" batch, _ = load_and_process(\n",
" config=config.data,\n",
" mode=\"predict\",\n",
" seed=seed,\n",
" batch_idx=None,\n",
" data_idx=0,\n",
" is_distillation=False,\n",
" sequence_ids=sequence_ids,\n",
" monomer_feature_dir=data_folder,\n",
" uniprot_msa_dir=uniprot_msa_dir,\n",
" )\n",
" \n",
" else:\n",
" batch, _ = load_and_process_symmetry(\n",
" config=config.data,\n",
" mode=\"predict\",\n",
" seed=seed,\n",
" batch_idx=None,\n",
" data_idx=0,\n",
" is_distillation=False,\n",
" symmetry=symmetry_group,\n",
" sequence_ids=sequence_ids,\n",
" monomer_feature_dir=data_folder,\n",
" uniprot_msa_dir=uniprot_msa_dir,\n",
" )\n",
" batch = UnifoldDataset.collater([batch])\n",
" return batch\n",
"\n",
"if is_multimer:\n",
"if symmetry_group is not None:\n",
" model_name = \"uf_symmetry\"\n",
" param_path = \"./uf_symmetry.pt\"\n",
"elif is_multimer:\n",
" model_name = \"multimer_ft\"\n",
" param_path = \"./multimer.unifold.pt\"\n",
"else:\n",
Expand All @@ -707,14 +754,17 @@
"manual_seed = 42 #@param {type:\"integer\"}\n",
"times = 3 #@param {type:\"integer\"}\n",
"\n",
"config = model_config(model_name)\n",
"if symmetry_group is None:\n",
" config = model_config(model_name)\n",
"else:\n",
" config = uf_symmetry_config()\n",
"config.data.common.max_recycling_iters = max_recycling_iters\n",
"config.globals.max_recycling_iters = max_recycling_iters\n",
"config.data.predict.num_ensembles = num_ensembles\n",
"\n",
"# faster prediction with large chunk\n",
"config.globals.chunk_size = 128\n",
"model = AlphaFold(config)\n",
"model = AlphaFold(config) if symmetry_group is None else UFSymmetry(config)\n",
"print(\"start to load params {}\".format(param_path))\n",
"state_dict = torch.load(param_path)[\"ema\"][\"params\"]\n",
"state_dict = {\".\".join(k.split(\".\")[1:]): v for k, v in state_dict.items()}\n",
Expand Down Expand Up @@ -742,6 +792,7 @@
" cur_seed,\n",
" is_multimer=is_multimer,\n",
" use_uniprot=is_multimer,\n",
" symmetry_group=symmetry_group,\n",
" )\n",
" seq_len = batch[\"aatype\"].shape[-1]\n",
" model.globals.chunk_size = automatic_chunk_size(seq_len)\n",
Expand Down Expand Up @@ -777,19 +828,26 @@
" plddt[..., None], residue_constants.atom_type_num, axis=-1\n",
" )\n",
" # TODO: , may need to reorder chains, based on entity_ids\n",
" cur_protein = protein.from_prediction(\n",
" features=batch, result=out, b_factors=plddt_b_factors\n",
" )\n",
" if symmetry_group is None:\n",
" cur_protein = protein.from_prediction(\n",
" features=batch, result=out, b_factors=plddt_b_factors\n",
" )\n",
" else:\n",
" plddt_b_factors_assembly = np.concatenate(\n",
" [plddt_b_factors for _ in range(batch[\"symmetry_opers\"].shape[0])])\n",
" cur_protein = assembly_from_prediction(\n",
" result=out, b_factors=plddt_b_factors_assembly,\n",
" )\n",
" cur_save_name = (\n",
" f\"{cur_param_path_postfix}_{cur_seed}\"\n",
" )\n",
" plddts[cur_save_name] = str(mean_plddt)\n",
" if is_multimer:\n",
" if is_multimer and symmetry_group is None:\n",
" ptms[cur_save_name] = str(np.mean(out[\"iptm+ptm\"]))\n",
" with open(os.path.join(output_dir, cur_save_name + '.pdb'), \"w\") as f:\n",
" f.write(protein.to_pdb(cur_protein))\n",
"\n",
" if is_multimer:\n",
" if is_multimer and symmetry_group is None:\n",
" mean_ptm = np.mean(out[\"iptm+ptm\"])\n",
" if mean_ptm>best_score:\n",
" best_protein = cur_protein\n",
Expand Down Expand Up @@ -870,7 +928,7 @@
" return plt\n",
"\n",
"\n",
"if is_multimer:\n",
"if is_multimer and symmetry_group is None:\n",
" multichain_view = py3Dmol.view(width=800, height=600)\n",
" multichain_view.addModelsAsFrames(to_visualize_pdb)\n",
" multichain_style = {'cartoon': {'colorscheme': 'chain'}}\n",
Expand Down Expand Up @@ -901,7 +959,7 @@
"display.display(grid)\n",
"\n",
"# Display pLDDT and predicted aligned error (if output by the model).\n",
"if is_multimer:\n",
"if is_multimer and symmetry_group is None:\n",
" num_plots = 2\n",
"else:\n",
" num_plots = 1\n",
Expand Down Expand Up @@ -983,12 +1041,18 @@
},
"gpuClass": "standard",
"kernelspec": {
"display_name": "Python 3.8",
"display_name": "Python 3.8.10 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
"name": "python",
"version": "3.8.10"
},
"vscode": {
"interpreter": {
"hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
}
}
},
"nbformat": 4,
Expand Down
31 changes: 31 additions & 0 deletions run_uf_symmetry.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
fasta_path=$1
symmetry=$2
output_dir_base=$3
database_dir=$4
max_template_date=$5
param_path=$6

echo "Starting homogeneous searching..."
python unifold/homo_search.py \
--fasta_path=$fasta_path \
--max_template_date=$max_template_date \
--output_dir=$output_dir_base \
--uniref90_database_path=$database_dir/uniref90/uniref90.fasta \
--mgnify_database_path=$database_dir/mgnify/mgy_clusters_2018_12.fa \
--bfd_database_path=$database_dir/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniclust30_database_path=$database_dir/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--uniprot_database_path=$database_dir/uniprot_db/uniprot_220501/uniprot_trembl.fasta \
--pdb_seqres_database_path=$database_dir/pdb_seqres/pdb_seqres.txt \
--template_mmcif_dir=$database_dir/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$database_dir/pdb_mmcif/obsolete.dat \
--use_precomputed_msas=True

echo "Starting prediction..."
fasta_dir=$(dirname $fasta_path)
target_name=${fasta_dir##*/}
python unifold/inference_symmetry.py \
--symmetry=$symmetry \
--param_path=$param_path \
--data_dir=$output_dir_base \
--target_name=$target_name \
--output_dir=$output_dir_base
Loading

0 comments on commit 660bcc6

Please sign in to comment.