Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add globularity module and README.manual_run.md #59

Merged
merged 12 commits into from
Apr 25, 2023
377 changes: 377 additions & 0 deletions README.manual_run.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,377 @@
# Running CATH-AlphaFlow-CLI - Step-by-step Tutorial
## UCL CS HPC (Test on 100,000 structures)

Initial Steps:

- load python module - `module load python/3.8.5`
- create virtual environment `python3 -m venv venv && . venv/bin/activate`
- Upgrade pip, wheel and pytest `pip install --upgrade pip wheel pytest`
- Install repo - `git clone git@github.com:UCLOrengoGroup/cath-alphaflow.git`
- Create .env file (use prod settings) - see `config.env`
- export instantclient for Oracle `export LD_LIBRARY_PATH=~/instantclient_21_7`


## Retrieve list of 100k UniProt IDs from database

Time: 7 minutes

Command:

```
cath-af-cli create-dataset-uniprot-ids --uniprot_ids_csv af_uniprot_id_100k_from_g3d.list \
--dbname gene3d_21 \
--max_evalue 1e-50 \
--max_records 100000 \
```

Output: `af_uniprot_id_100k_from_g3d.list`

## Run create-cath-dataset-from-db

Time: 4 minutes

Input: af_uniprot_id_100k_from_g3d.list

Command:

```
cath-af-cli create-cath-dataset-from-db \
--csv_uniprot_ids af_uniprot_id_100k_from_g3d.list \
--dbname gene3d_21 \
--af_cath_annotations af_100k_cath_annotations.tsv \
--af_chainlist_ids af_100k_chainlist_ids.tsv \
--af_domainlist_ids af_100k_domainlist_ids.tsv \
--gene3d_crh_output af_100k.crh \
--csv_uniprot_md5 af_100k_md5.tsv \
```

Output:

- af_100k_cath_annotations.tsv
- af_100k_chainlist_ids.tsv
- af_100k_chainlist_ids.tsv
- af_100k.crh
- af_100k_md5.tsv

## Create GCloud Manifest File

Time: less than 1 min

Input: `af_100k_chainlist_ids.tsv`

Output: `af_100k_manifest.txt`

```
grep -v 'chain_id' af_100k_chainlist_ids.tsv | sort | uniq | tr -d '\r' | awk '{print "gs://public-datasets-deepmind-alphafold-v4/"$1".cif"}' > af_100k_manifest.txt
```

## Download structures from GCloud

time: 2hrs and 20mins

Input: af_100k_manifest.txt

Output: `af_raw_cif` folder containing unchopped AlphaFold chains

```
mkdir af_cif_raw
cat af_100k_manifest.txt | (/share/apps/google-cloud-sdk/bin/gsutil -o GSUtil:parallel_process_count=1 -m cp -I af_cif_raw/ || echo "Ignoring non-zero exit code: \$?")
```

The step outputs the number of UniProt IDs not available in AlphaFoldDB.

```
CommandException: 6532 files/objects could not be transferred. Total downloaded CIF: 62,212
```

## Create new AFChainList after GCloud download.

Rationale:

The `cif_to_fasta` module in `cath-af-cli` expects a list of existing files, so we need to feed it just the chains that are in AFDB and are available on the filesystem.

```
cd af_cif_raw
find *.cif > ../af_100k_chainlist_after_gsutil.txt
sed -i 's/\.cif//g' af_100k_chainlist_after_gsutil.txt
echo af_chain_id | cat - af_100k_chainlist_after_gsutil.txt > temp && mv temp af_100k_chainlist_after_gsutil.txt
```


## Convert CIF to FASTA

Process: 62,212 CIF files -> 62,212 FASTA files + merged.fasta

Thoughts: Too slow on 1CPU (i.e. 30mins for 10k sequences, splitting and parallelizing)

Time: 1hr 9 mins

Input:

```
af_cif_raw/
af_100k_chainlist_after_gsutil.txt
```

Output:
`af_fasta_raw` - folder containing a FASTA file for each chain available in the filesystem
`merged.fasta` - file containing a multiFASTA of all processed CIF files.

```
mkdir af_fasta_raw
split -l 10000 af_100k_chainlist_after_gsutil.txt splitlist
find splitlista* | xargs -P4 -I XXX sh -c 'cath-af-cli convert-cif-to-fasta --cif_in_dir af_cif_raw/ \
--id_file XXX \
--fasta_out_dir af_fasta_raw/'
```

Careful: the module expects each chunk to have a header, so the first structure doesn't get processed if we use chunks instead of the first file.

## Create MD5 from FASTA

Time: seconds

Input: `merged.fasta`

Output: `af_100k_cif_raw_md5.txt`

```
cath-af-cli create-md5 --fasta merged.fasta \
--uniprot_md5_csv af_100k_cif_raw_md5.txt
```

## Filter CRH

Rationale:

- Extract and further process only matches between CIF-md5 and chopping MD5
- Remove all CRH entries that rely on a missing AFDB entry.

Time: 1 second

Issue: The files generated in step 6 have some carriage returns (like ^M) that needs removing using tr -d '\r'

Temporary fix: Strip these characters before doing the grep command.

Command:

```
grep -v 'af_chain_id' af_100k_cif_raw_md5.txt | awk '{print $3}' | tr -d '\r' | grep -F -f - af_100k.crh > af_100k_crh_after_filtering.crh
sort af_100k_crh_after_filtering.crh | uniq > af_100k_crh_after_filtering_uniq.crh
awk '{print $1}' af_100k_crh_after_filtering_uniq.crh | grep -F -f - af_100k_cif_raw_md5.txt | awk '{print $1}' | grep -F -f - af_100k_domainlist_ids.txt > af_100k_domainlist_after_md5_filter.txt
echo af_domain_id | cat - af_100k_domainlist_after_md5_filter.txt > temp && mv temp af_100k_domainlist_after_md5_filter.txt
```

Output: `af_100k_crh_after_filtering.crh`

## Chop CIF before optimisation

### WARNING!!

### Unnecessary step as `optimise-domain-boundaries` acts on chains, not on domains, so we can optimise first and chop once.

Running on xargs with 12 processes with a uniq-ed crh file and domain list (120,129 domains)
Time: 1hr 45 minutes

Command:

```
split -l 7500 af_100k_domainlist_after_md5_filter.txt split_crh_uniq_ -a 5 -d
find split_crh_uniq_000* | xargs -P12 -I XXX sh -c 'cath-af-cli chop-cif --cif_in_dir af_cif_raw/ \
--id_file XXX \
--cif_out_dir chopped_cif_before_optimization/'
```

Issues Found:
Remove af_domain_id from header, program can't parse it or add exception, as it skips the first entry but it crashes as it can't find af_domain_id

Stats: Chopped 120,129 domains

## Optimise boundaries
On 16 threads, 7500 entries each

Time: 1hr 1 min

Comments/TODO:
- Include percentage of delta length (i.e. how much the boundaries are changing?)
- Remove af_domain_id from header, it crashes the process.
- Optimise boundaries works on chains, not on domains.

The initial chopping is redundant, keeping it in this instance as it's interesting for debugging.

Commands:

```
split -l 7500 af_100k_domainlist_after_md5_filter_uniq.txt split_af_100k -a 5 -d
find split_af_100k000* | xargs -P16 -I XXX sh -c 'cath-af-cli optimise-domain-boundaries --af_domain_list XXX \
--af_chain_mmcif_dir af_cif_raw/ \
--af_domain_list_post_tailchop af_100k_domainlist_post_tailchop_XXX.txt \
--af_domain_mapping_post_tailchop af_100k_domain_mapping_post_tailchop_XXX.tsv \
--status_log optimise_boundaries_status_log_XXX'
```

Afterwards:

- Concatenate logs
- Concatenate af_domain_list_post_tailchop
- Concatenate af_domain_mapping_post_tailchop
- Remove duplicated headers

```
cat af_100k_domain_mapping_post_tailchop_split_af_100k000* | grep -v 'af_domain_id' > af_100k_domain_mapping_post_tailchop.tsv
cat af_100k_domainlist_post_tailchop_split_af_100k000* grep -v 'af_domain_id' > af_100k_domainlist_post_tailchop.txt
cat optimise_boundaries_status_log_split_af_100k000* > optimise_boundaries_status_log
rm af_100k_domain_mapping_post_tailchop_split_af_100k000*
rm optimise_boundaries_status_log_split_af_100k000*
rm af_100k_domainlist_post_tailchop_split_af_100k000*
rm split_af_100k000*
```

Stats:

- 84,604 have unchanged boundaries
- 35,378 have adjusted boundaries
- 129 have failed due to not having residues above pLDDT 70



## Chop CIF files after boundaries optimisation

Time: 1hr 46 mins on 12 threads

```
split -l 7500 af_100k_domainlist_post_tailchop.txt af_100k_domainlist -a 5 -d
find af_100k_domainlist00* | xargs -P12 -I XXX sh -c 'cath-af-cli chop-cif --cif_in_dir af_cif_raw/ \
--id_file XXX \
--cif_out_dir chopped_cif_after_optimisation/'
```

TODO: Add StatusLog to ALL MODULES

Chopped 120,112 domains with optimised boundaries

## Calculate pLDDT and LUR report

Time: 59 mins on 16 threads

```
split -l 7500 af_100k_domainlist_post_tailchop.txt split_af_100k_plddt_ -a 5 -d
find split_af_100k_plddt_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-cif-to-plddt-summary --cif_in_dir af_cif_raw/ \
--id_file XXX \
--plddt_stats_file af_100k_plddt_summary_XXX.tsv'
cat af_100k_plddt_summary_split_af_100k_plddt_*.tsv > af_100k_plddt_summary.tsv
```

## Create DSSP Files

Time: 4 hrs

Requirements: boost library. On CS HPC can be loaded as a module.

```
module load boost/1.71.0
mkdir af_dssp_dir
find split_af_100k_dssp_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-cif-to-dssp --cif_in_dir af_cif_raw/ \
--id_file XXX \
--dssp_out_dir af_dssp_dir/'
```

Comments: Implement a skip if the file is already present, maybe switch to IUPRED? Or some other secondary structure predictors based on sequence.

## Convert DSSP to SSE summary

Time: 4 mins

Command:
```
find split_af_100k_dssp_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-dssp-to-sse-summary --dssp_dir af_dssp_dir/ \
--id_file XXX \
--sse_out_file af_100k_sse_XXX.tsv'
```

## Convert optimised domains to Foldseek DB

Time: 2 hrs 7 mins on a single thread, RAM:1G

Command:
```
cath-af-cli convert-cif-to-foldseek-db --cif_dir chopped_cif_after_optimisation/ \
--fs_querydb_dir af_foldseek_db/ \
--fs_bin_path /SAN/biosciences/alphafold/foldseek/bin/foldseek
```

Comment:
- Added an option to generate a database only from a few structures. Needs to be specified as a CLI option using id_list
- Investigate why m_core on CS is not seen by Foldseek. Increasing RAM available speeds up the process significantly.



## Run Foldseek of Query_Domain_Structures against CATH_S95 Representatives

Time: 2hrs 30

Command:
```
cath-af-cli run-foldseek --fs_querydb af_foldseek_db/af_query_foldseek.db \
--fs_targetdb /SAN/cath/cath_v4_3_0/databases/foldseek/cath_s95/cath_s95_db \
--fs_bin_path /SAN/biosciences/alphafold/foldseek/bin/foldseek
```

## Convert Foldseek Results to Summary

Time: 5 mins

Command:
```
cath-af-cli convert-foldseek-output-to-summary --id_file af_100k_domainlist_post_tailchop.txt \
--fs_input_file fs_query_results.m8 \
--fs_results fs_query_results.txt
```

Starting from 110,250 unique domain ids in foldseek output, resulting into 110,218 hits passing the thresholds.

## Globularity prediction

Time: 2hrs 6 mins

Split into 16 processes. Processing 120,113 domains

Command:
```
split -l 7500 af_100k_domainlist_post_tailchop.txt split_af_100k_globularity_ -a 5 -d
find split_af_100k_globularity_000* | xargs -P16 -I XXX sh -c 'cath-af-cli measure-globularity --af_domain_list XXX \
--af_chain_mmcif_dir af_cif_raw/ \
--af_domain_globularity af_100k_globularity_XXX.txt'
cat af_100k_globularity_split_af_100k_globularity_000* > af_100k_globularity.txt
rm split_af_100k_globularity_000*
af_100k_globularity_split_af_100k_globularity_000*
```

As this was done in chunks, remove the 16 additional occurences of the header from af_100k_globularity.txt


## Combine results

TBD






















2 changes: 2 additions & 0 deletions cath_alphaflow/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from .commands import create_md5
from .commands import convert_cif_to_fasta
from .commands import load_mongo
from .commands import measure_globularity

logging.basicConfig(
level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s"
Expand Down Expand Up @@ -65,3 +66,4 @@ def dump_config():
cli.add_command(create_md5.create_md5)
cli.add_command(convert_cif_to_fasta.convert_cif_to_fasta)
cli.add_command(load_mongo.load_af_from_archive)
cli.add_command(measure_globularity.measure_globularity)
Loading