Skip to content

Commit

Permalink
Merge pull request #59 from UCLOrengoGroup/add_globularity
Browse files Browse the repository at this point in the history
Add globularity module and README.manual_run.md
  • Loading branch information
sillitoe authored Apr 25, 2023
2 parents 3a816b0 + 6a85dd2 commit e79bd0f
Show file tree
Hide file tree
Showing 6 changed files with 644 additions and 15 deletions.
377 changes: 377 additions & 0 deletions README.manual_run.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,377 @@
# Running CATH-AlphaFlow-CLI - Step-by-step Tutorial
## UCL CS HPC (Test on 100,000 structures)

Initial Steps:

- load python module - `module load python/3.8.5`
- create virtual environment `python3 -m venv venv && . venv/bin/activate`
- Upgrade pip, wheel and pytest `pip install --upgrade pip wheel pytest`
- Install repo - `git clone git@github.com:UCLOrengoGroup/cath-alphaflow.git`
- Create .env file (use prod settings) - see `config.env`
- export instantclient for Oracle `export LD_LIBRARY_PATH=~/instantclient_21_7`


## Retrieve list of 100k UniProt IDs from database

Time: 7 minutes

Command:

```
cath-af-cli create-dataset-uniprot-ids --uniprot_ids_csv af_uniprot_id_100k_from_g3d.list \
--dbname gene3d_21 \
--max_evalue 1e-50 \
--max_records 100000 \
```

Output: `af_uniprot_id_100k_from_g3d.list`

## Run create-cath-dataset-from-db

Time: 4 minutes

Input: af_uniprot_id_100k_from_g3d.list

Command:

```
cath-af-cli create-cath-dataset-from-db \
--csv_uniprot_ids af_uniprot_id_100k_from_g3d.list \
--dbname gene3d_21 \
--af_cath_annotations af_100k_cath_annotations.tsv \
--af_chainlist_ids af_100k_chainlist_ids.tsv \
--af_domainlist_ids af_100k_domainlist_ids.tsv \
--gene3d_crh_output af_100k.crh \
--csv_uniprot_md5 af_100k_md5.tsv \
```

Output:

- af_100k_cath_annotations.tsv
- af_100k_chainlist_ids.tsv
- af_100k_chainlist_ids.tsv
- af_100k.crh
- af_100k_md5.tsv

## Create GCloud Manifest File

Time: less than 1 min

Input: `af_100k_chainlist_ids.tsv`

Output: `af_100k_manifest.txt`

```
grep -v 'chain_id' af_100k_chainlist_ids.tsv | sort | uniq | tr -d '\r' | awk '{print "gs://public-datasets-deepmind-alphafold-v4/"$1".cif"}' > af_100k_manifest.txt
```

## Download structures from GCloud

time: 2hrs and 20mins

Input: af_100k_manifest.txt

Output: `af_raw_cif` folder containing unchopped AlphaFold chains

```
mkdir af_cif_raw
cat af_100k_manifest.txt | (/share/apps/google-cloud-sdk/bin/gsutil -o GSUtil:parallel_process_count=1 -m cp -I af_cif_raw/ || echo "Ignoring non-zero exit code: \$?")
```

The step outputs the number of UniProt IDs not available in AlphaFoldDB.

```
CommandException: 6532 files/objects could not be transferred. Total downloaded CIF: 62,212
```

## Create new AFChainList after GCloud download.

Rationale:

The `cif_to_fasta` module in `cath-af-cli` expects a list of existing files, so we need to feed it just the chains that are in AFDB and are available on the filesystem.

```
cd af_cif_raw
find *.cif > ../af_100k_chainlist_after_gsutil.txt
sed -i 's/\.cif//g' af_100k_chainlist_after_gsutil.txt
echo af_chain_id | cat - af_100k_chainlist_after_gsutil.txt > temp && mv temp af_100k_chainlist_after_gsutil.txt
```


## Convert CIF to FASTA

Process: 62,212 CIF files -> 62,212 FASTA files + merged.fasta

Thoughts: Too slow on 1CPU (i.e. 30mins for 10k sequences, splitting and parallelizing)

Time: 1hr 9 mins

Input:

```
af_cif_raw/
af_100k_chainlist_after_gsutil.txt
```

Output:
`af_fasta_raw` - folder containing a FASTA file for each chain available in the filesystem
`merged.fasta` - file containing a multiFASTA of all processed CIF files.

```
mkdir af_fasta_raw
split -l 10000 af_100k_chainlist_after_gsutil.txt splitlist
find splitlista* | xargs -P4 -I XXX sh -c 'cath-af-cli convert-cif-to-fasta --cif_in_dir af_cif_raw/ \
--id_file XXX \
--fasta_out_dir af_fasta_raw/'
```

Careful: the module expects each chunk to have a header, so the first structure doesn't get processed if we use chunks instead of the first file.

## Create MD5 from FASTA

Time: seconds

Input: `merged.fasta`

Output: `af_100k_cif_raw_md5.txt`

```
cath-af-cli create-md5 --fasta merged.fasta \
--uniprot_md5_csv af_100k_cif_raw_md5.txt
```

## Filter CRH

Rationale:

- Extract and further process only matches between CIF-md5 and chopping MD5
- Remove all CRH entries that rely on a missing AFDB entry.

Time: 1 second

Issue: The files generated in step 6 have some carriage returns (like ^M) that needs removing using tr -d '\r'

Temporary fix: Strip these characters before doing the grep command.

Command:

```
grep -v 'af_chain_id' af_100k_cif_raw_md5.txt | awk '{print $3}' | tr -d '\r' | grep -F -f - af_100k.crh > af_100k_crh_after_filtering.crh
sort af_100k_crh_after_filtering.crh | uniq > af_100k_crh_after_filtering_uniq.crh
awk '{print $1}' af_100k_crh_after_filtering_uniq.crh | grep -F -f - af_100k_cif_raw_md5.txt | awk '{print $1}' | grep -F -f - af_100k_domainlist_ids.txt > af_100k_domainlist_after_md5_filter.txt
echo af_domain_id | cat - af_100k_domainlist_after_md5_filter.txt > temp && mv temp af_100k_domainlist_after_md5_filter.txt
```

Output: `af_100k_crh_after_filtering.crh`

## Chop CIF before optimisation

### WARNING!!

### Unnecessary step as `optimise-domain-boundaries` acts on chains, not on domains, so we can optimise first and chop once.

Running on xargs with 12 processes with a uniq-ed crh file and domain list (120,129 domains)
Time: 1hr 45 minutes

Command:

```
split -l 7500 af_100k_domainlist_after_md5_filter.txt split_crh_uniq_ -a 5 -d
find split_crh_uniq_000* | xargs -P12 -I XXX sh -c 'cath-af-cli chop-cif --cif_in_dir af_cif_raw/ \
--id_file XXX \
--cif_out_dir chopped_cif_before_optimization/'
```

Issues Found:
Remove af_domain_id from header, program can't parse it or add exception, as it skips the first entry but it crashes as it can't find af_domain_id

Stats: Chopped 120,129 domains

## Optimise boundaries
On 16 threads, 7500 entries each

Time: 1hr 1 min

Comments/TODO:
- Include percentage of delta length (i.e. how much the boundaries are changing?)
- Remove af_domain_id from header, it crashes the process.
- Optimise boundaries works on chains, not on domains.

The initial chopping is redundant, keeping it in this instance as it's interesting for debugging.

Commands:

```
split -l 7500 af_100k_domainlist_after_md5_filter_uniq.txt split_af_100k -a 5 -d
find split_af_100k000* | xargs -P16 -I XXX sh -c 'cath-af-cli optimise-domain-boundaries --af_domain_list XXX \
--af_chain_mmcif_dir af_cif_raw/ \
--af_domain_list_post_tailchop af_100k_domainlist_post_tailchop_XXX.txt \
--af_domain_mapping_post_tailchop af_100k_domain_mapping_post_tailchop_XXX.tsv \
--status_log optimise_boundaries_status_log_XXX'
```

Afterwards:

- Concatenate logs
- Concatenate af_domain_list_post_tailchop
- Concatenate af_domain_mapping_post_tailchop
- Remove duplicated headers

```
cat af_100k_domain_mapping_post_tailchop_split_af_100k000* | grep -v 'af_domain_id' > af_100k_domain_mapping_post_tailchop.tsv
cat af_100k_domainlist_post_tailchop_split_af_100k000* grep -v 'af_domain_id' > af_100k_domainlist_post_tailchop.txt
cat optimise_boundaries_status_log_split_af_100k000* > optimise_boundaries_status_log
rm af_100k_domain_mapping_post_tailchop_split_af_100k000*
rm optimise_boundaries_status_log_split_af_100k000*
rm af_100k_domainlist_post_tailchop_split_af_100k000*
rm split_af_100k000*
```

Stats:

- 84,604 have unchanged boundaries
- 35,378 have adjusted boundaries
- 129 have failed due to not having residues above pLDDT 70



## Chop CIF files after boundaries optimisation

Time: 1hr 46 mins on 12 threads

```
split -l 7500 af_100k_domainlist_post_tailchop.txt af_100k_domainlist -a 5 -d
find af_100k_domainlist00* | xargs -P12 -I XXX sh -c 'cath-af-cli chop-cif --cif_in_dir af_cif_raw/ \
--id_file XXX \
--cif_out_dir chopped_cif_after_optimisation/'
```

TODO: Add StatusLog to ALL MODULES

Chopped 120,112 domains with optimised boundaries

## Calculate pLDDT and LUR report

Time: 59 mins on 16 threads

```
split -l 7500 af_100k_domainlist_post_tailchop.txt split_af_100k_plddt_ -a 5 -d
find split_af_100k_plddt_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-cif-to-plddt-summary --cif_in_dir af_cif_raw/ \
--id_file XXX \
--plddt_stats_file af_100k_plddt_summary_XXX.tsv'
cat af_100k_plddt_summary_split_af_100k_plddt_*.tsv > af_100k_plddt_summary.tsv
```

## Create DSSP Files

Time: 4 hrs

Requirements: boost library. On CS HPC can be loaded as a module.

```
module load boost/1.71.0
mkdir af_dssp_dir
find split_af_100k_dssp_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-cif-to-dssp --cif_in_dir af_cif_raw/ \
--id_file XXX \
--dssp_out_dir af_dssp_dir/'
```

Comments: Implement a skip if the file is already present, maybe switch to IUPRED? Or some other secondary structure predictors based on sequence.

## Convert DSSP to SSE summary

Time: 4 mins

Command:
```
find split_af_100k_dssp_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-dssp-to-sse-summary --dssp_dir af_dssp_dir/ \
--id_file XXX \
--sse_out_file af_100k_sse_XXX.tsv'
```

## Convert optimised domains to Foldseek DB

Time: 2 hrs 7 mins on a single thread, RAM:1G

Command:
```
cath-af-cli convert-cif-to-foldseek-db --cif_dir chopped_cif_after_optimisation/ \
--fs_querydb_dir af_foldseek_db/ \
--fs_bin_path /SAN/biosciences/alphafold/foldseek/bin/foldseek
```

Comment:
- Added an option to generate a database only from a few structures. Needs to be specified as a CLI option using id_list
- Investigate why m_core on CS is not seen by Foldseek. Increasing RAM available speeds up the process significantly.



## Run Foldseek of Query_Domain_Structures against CATH_S95 Representatives

Time: 2hrs 30

Command:
```
cath-af-cli run-foldseek --fs_querydb af_foldseek_db/af_query_foldseek.db \
--fs_targetdb /SAN/cath/cath_v4_3_0/databases/foldseek/cath_s95/cath_s95_db \
--fs_bin_path /SAN/biosciences/alphafold/foldseek/bin/foldseek
```

## Convert Foldseek Results to Summary

Time: 5 mins

Command:
```
cath-af-cli convert-foldseek-output-to-summary --id_file af_100k_domainlist_post_tailchop.txt \
--fs_input_file fs_query_results.m8 \
--fs_results fs_query_results.txt
```

Starting from 110,250 unique domain ids in foldseek output, resulting into 110,218 hits passing the thresholds.

## Globularity prediction

Time: 2hrs 6 mins

Split into 16 processes. Processing 120,113 domains

Command:
```
split -l 7500 af_100k_domainlist_post_tailchop.txt split_af_100k_globularity_ -a 5 -d
find split_af_100k_globularity_000* | xargs -P16 -I XXX sh -c 'cath-af-cli measure-globularity --af_domain_list XXX \
--af_chain_mmcif_dir af_cif_raw/ \
--af_domain_globularity af_100k_globularity_XXX.txt'
cat af_100k_globularity_split_af_100k_globularity_000* > af_100k_globularity.txt
rm split_af_100k_globularity_000*
af_100k_globularity_split_af_100k_globularity_000*
```

As this was done in chunks, remove the 16 additional occurences of the header from af_100k_globularity.txt


## Combine results

TBD






















2 changes: 2 additions & 0 deletions cath_alphaflow/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from .commands import create_md5
from .commands import convert_cif_to_fasta
from .commands import load_mongo
from .commands import measure_globularity

logging.basicConfig(
level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s"
Expand Down Expand Up @@ -65,3 +66,4 @@ def dump_config():
cli.add_command(create_md5.create_md5)
cli.add_command(convert_cif_to_fasta.convert_cif_to_fasta)
cli.add_command(load_mongo.load_af_from_archive)
cli.add_command(measure_globularity.measure_globularity)
Loading

0 comments on commit e79bd0f

Please sign in to comment.