UCLOrengoGroup · sillitoe · Apr 25, 2023 · Jan 19, 2023 · Mar 15, 2023 · Mar 15, 2023
diff --git a/README.manual_run.md b/README.manual_run.md
@@ -0,0 +1,377 @@
+# Running CATH-AlphaFlow-CLI - Step-by-step Tutorial
+## UCL CS HPC (Test on 100,000 structures)
+
+Initial Steps:
+
+- load python module - `module load python/3.8.5`
+- create virtual environment `python3 -m venv venv && . venv/bin/activate`
+- Upgrade pip, wheel and pytest `pip install --upgrade pip wheel pytest`
+- Install repo - `git clone git@github.com:UCLOrengoGroup/cath-alphaflow.git`
+- Create .env file (use prod settings) - see `config.env`
+- export instantclient for Oracle  `export LD_LIBRARY_PATH=~/instantclient_21_7`
+
+
+## Retrieve list of 100k UniProt IDs from database 
+
+Time: 7 minutes
+
+Command: 
+
+```
+cath-af-cli create-dataset-uniprot-ids --uniprot_ids_csv af_uniprot_id_100k_from_g3d.list \
+    --dbname gene3d_21 \
+    --max_evalue 1e-50 \
+    --max_records 100000 \
+ ```
+
+Output: `af_uniprot_id_100k_from_g3d.list`
+
+## Run create-cath-dataset-from-db
+
+Time: 4 minutes
+
+Input: af_uniprot_id_100k_from_g3d.list
+
+Command: 
+
+```
+cath-af-cli create-cath-dataset-from-db \
+    --csv_uniprot_ids af_uniprot_id_100k_from_g3d.list \
+    --dbname gene3d_21 \
+    --af_cath_annotations af_100k_cath_annotations.tsv \
+    --af_chainlist_ids af_100k_chainlist_ids.tsv \
+    --af_domainlist_ids af_100k_domainlist_ids.tsv \
+    --gene3d_crh_output af_100k.crh \
+    --csv_uniprot_md5 af_100k_md5.tsv \
+```
+
+Output: 
+
+    - af_100k_cath_annotations.tsv 
+    - af_100k_chainlist_ids.tsv 
+    - af_100k_chainlist_ids.tsv 
+    - af_100k.crh 
+    - af_100k_md5.tsv
+
+## Create GCloud Manifest File
+
+Time: less than 1 min
+
+Input: `af_100k_chainlist_ids.tsv`
+
+Output: `af_100k_manifest.txt`
+
+```
+grep -v 'chain_id' af_100k_chainlist_ids.tsv | sort | uniq | tr -d '\r' | awk '{print "gs://public-datasets-deepmind-alphafold-v4/"$1".cif"}' > af_100k_manifest.txt
+```
+
+## Download structures from GCloud
+
+time: 2hrs and 20mins
+
+Input: af_100k_manifest.txt
+
+Output: `af_raw_cif` folder containing unchopped AlphaFold chains
+
+```
+mkdir af_cif_raw
+cat af_100k_manifest.txt | (/share/apps/google-cloud-sdk/bin/gsutil -o GSUtil:parallel_process_count=1 -m cp -I af_cif_raw/ || echo "Ignoring non-zero exit code: \$?")
+```
+
+The step outputs the number of UniProt IDs not available in AlphaFoldDB.
+
+```
+CommandException: 6532 files/objects could not be transferred. Total downloaded CIF: 62,212
+```
+
+## Create new AFChainList after GCloud download. 
+
+Rationale: 
+
+The `cif_to_fasta` module in `cath-af-cli` expects a list of existing files, so we need to feed it just the chains that are in AFDB and are available on the filesystem.
+
+```
+cd af_cif_raw
+find *.cif > ../af_100k_chainlist_after_gsutil.txt
+sed -i 's/\.cif//g' af_100k_chainlist_after_gsutil.txt
+echo af_chain_id | cat - af_100k_chainlist_after_gsutil.txt > temp && mv temp af_100k_chainlist_after_gsutil.txt
+```
+
+
+## Convert CIF to FASTA 
+
+Process: 62,212 CIF files -> 62,212 FASTA files + merged.fasta
+
+Thoughts: Too slow on 1CPU (i.e. 30mins for 10k sequences, splitting and parallelizing)
+
+Time:  1hr 9 mins
+
+Input: 
+
+```
+af_cif_raw/ 
+af_100k_chainlist_after_gsutil.txt
+```
+
+Output: 
+`af_fasta_raw`  - folder containing a FASTA file for each chain available in the filesystem
+`merged.fasta`  - file containing a multiFASTA of all processed CIF files.
+
+```
+mkdir af_fasta_raw
+split -l 10000 af_100k_chainlist_after_gsutil.txt splitlist
+find splitlista* | xargs -P4 -I XXX sh -c 'cath-af-cli convert-cif-to-fasta --cif_in_dir af_cif_raw/ \
+    --id_file XXX \
+    --fasta_out_dir af_fasta_raw/'
+```
+
+Careful: the module expects each chunk to have a header, so the first structure doesn't get processed if we use chunks instead of the first file.
+
+## Create MD5 from FASTA
+
+Time: seconds
+
+Input: `merged.fasta`
+
+Output: `af_100k_cif_raw_md5.txt`
+
+```
+cath-af-cli create-md5 --fasta merged.fasta \
+    --uniprot_md5_csv af_100k_cif_raw_md5.txt
+```
+
+## Filter CRH 
+
+Rationale:
+
+- Extract and further process only matches between CIF-md5 and chopping MD5
+- Remove all CRH entries that rely on a missing AFDB entry.
+
+Time: 1 second
+
+Issue: The files generated in step 6 have some carriage returns (like ^M) that needs removing using tr -d '\r'
+
+Temporary fix: Strip these characters before doing the grep command.
+
+Command: 
+
+```
+grep -v 'af_chain_id' af_100k_cif_raw_md5.txt | awk '{print $3}' | tr -d '\r' | grep -F -f - af_100k.crh > af_100k_crh_after_filtering.crh
+sort af_100k_crh_after_filtering.crh | uniq > af_100k_crh_after_filtering_uniq.crh
+awk '{print $1}' af_100k_crh_after_filtering_uniq.crh | grep -F -f - af_100k_cif_raw_md5.txt | awk '{print $1}' | grep -F -f - af_100k_domainlist_ids.txt > af_100k_domainlist_after_md5_filter.txt
+echo af_domain_id | cat - af_100k_domainlist_after_md5_filter.txt > temp && mv temp af_100k_domainlist_after_md5_filter.txt
+```
+
+Output: `af_100k_crh_after_filtering.crh`
+
+## Chop CIF before optimisation
+
+### WARNING!! 
+
+### Unnecessary step as `optimise-domain-boundaries` acts on chains, not on domains, so we can optimise first and chop once.
+
+Running on xargs with 12 processes with a uniq-ed crh file and domain list (120,129 domains)
+Time:  1hr 45 minutes
+
+Command: 
+
+```
+split -l 7500 af_100k_domainlist_after_md5_filter.txt split_crh_uniq_ -a 5 -d 
+find split_crh_uniq_000* | xargs -P12 -I XXX sh -c 'cath-af-cli chop-cif --cif_in_dir af_cif_raw/ \
+    --id_file XXX \
+    --cif_out_dir chopped_cif_before_optimization/'
+```
+
+Issues Found: 
+Remove af_domain_id from header, program can't parse it or add exception, as it skips the first entry but it crashes as it can't find af_domain_id
+
+Stats: Chopped 120,129 domains
+
+## Optimise boundaries
+On 16 threads, 7500 entries each
+
+Time: 1hr 1 min
+
+Comments/TODO: 
+ - Include percentage of delta length (i.e. how much the boundaries are changing?)
+ - Remove af_domain_id from header, it crashes the process.
+ - Optimise boundaries works on chains, not on domains. 
+
+The initial chopping is redundant, keeping it in this instance as it's interesting for debugging.
+
+Commands:
+
+```
+split -l 7500 af_100k_domainlist_after_md5_filter_uniq.txt split_af_100k -a 5 -d
+find split_af_100k000* | xargs -P16 -I XXX sh -c 'cath-af-cli optimise-domain-boundaries --af_domain_list XXX \
+    --af_chain_mmcif_dir af_cif_raw/ \
+    --af_domain_list_post_tailchop af_100k_domainlist_post_tailchop_XXX.txt \
+    --af_domain_mapping_post_tailchop af_100k_domain_mapping_post_tailchop_XXX.tsv \
+    --status_log optimise_boundaries_status_log_XXX'
+```
+
+Afterwards:
+
+- Concatenate logs
+- Concatenate af_domain_list_post_tailchop 
+- Concatenate af_domain_mapping_post_tailchop 
+- Remove duplicated headers
+
+```
+cat af_100k_domain_mapping_post_tailchop_split_af_100k000* | grep -v 'af_domain_id' > af_100k_domain_mapping_post_tailchop.tsv
+cat af_100k_domainlist_post_tailchop_split_af_100k000* grep -v 'af_domain_id' > af_100k_domainlist_post_tailchop.txt
+cat optimise_boundaries_status_log_split_af_100k000* > optimise_boundaries_status_log
+rm af_100k_domain_mapping_post_tailchop_split_af_100k000*
+rm optimise_boundaries_status_log_split_af_100k000*
+rm af_100k_domainlist_post_tailchop_split_af_100k000*
+rm split_af_100k000*
+```
+
+Stats:
+
+- 84,604 have unchanged boundaries
+- 35,378 have adjusted boundaries
+- 129 have failed due to not having residues above pLDDT 70
+
+
+
+## Chop CIF files after boundaries optimisation
+
+Time: 1hr 46 mins on 12 threads
+
+```
+split -l 7500 af_100k_domainlist_post_tailchop.txt af_100k_domainlist -a 5 -d
+find af_100k_domainlist00* | xargs -P12 -I XXX sh -c 'cath-af-cli chop-cif --cif_in_dir af_cif_raw/ \
+    --id_file XXX \
+    --cif_out_dir chopped_cif_after_optimisation/'
+```
+
+TODO: Add StatusLog to ALL MODULES
+
+Chopped 120,112 domains with optimised boundaries
+
+## Calculate pLDDT and LUR report
+
+Time:  59 mins on 16 threads
+
+```
+split -l 7500 af_100k_domainlist_post_tailchop.txt split_af_100k_plddt_ -a 5 -d
+find split_af_100k_plddt_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-cif-to-plddt-summary --cif_in_dir af_cif_raw/ \
+    --id_file XXX \
+    --plddt_stats_file af_100k_plddt_summary_XXX.tsv'
+cat af_100k_plddt_summary_split_af_100k_plddt_*.tsv > af_100k_plddt_summary.tsv
+```
+
+## Create DSSP Files
+
+Time: 4 hrs
+
+Requirements: boost library. On CS HPC can be loaded as a module.
+
+```
+module load boost/1.71.0
+mkdir af_dssp_dir
+find split_af_100k_dssp_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-cif-to-dssp --cif_in_dir af_cif_raw/ \
+    --id_file XXX \
+    --dssp_out_dir af_dssp_dir/'
+```
+
+Comments: Implement a skip if the file is already present, maybe switch to IUPRED? Or some other secondary structure predictors based on sequence. 
+
+## Convert DSSP to SSE summary
+
+Time: 4 mins
+
+Command: 
+```
+find split_af_100k_dssp_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-dssp-to-sse-summary --dssp_dir af_dssp_dir/ \
+    --id_file XXX \
+    --sse_out_file af_100k_sse_XXX.tsv'
+```
+
+## Convert optimised domains to Foldseek DB
+
+Time: 2 hrs 7 mins on a single thread, RAM:1G
+
+Command:
+```
+cath-af-cli convert-cif-to-foldseek-db --cif_dir chopped_cif_after_optimisation/ \
+    --fs_querydb_dir af_foldseek_db/ \
+    --fs_bin_path /SAN/biosciences/alphafold/foldseek/bin/foldseek
+```
+
+Comment:
+- Added an option to generate a database only from a few structures. Needs to be specified as a CLI option using id_list
+- Investigate why m_core on CS is not seen by Foldseek. Increasing RAM available speeds up the process significantly.
+
+
+
+## Run Foldseek of Query_Domain_Structures against CATH_S95 Representatives
+
+Time: 2hrs 30
+
+Command: 
+```
+cath-af-cli run-foldseek --fs_querydb af_foldseek_db/af_query_foldseek.db \
+    --fs_targetdb /SAN/cath/cath_v4_3_0/databases/foldseek/cath_s95/cath_s95_db \
+    --fs_bin_path /SAN/biosciences/alphafold/foldseek/bin/foldseek 
+```
+
+## Convert Foldseek Results to Summary
+
+Time: 5 mins
+
+Command: 
+```
+cath-af-cli convert-foldseek-output-to-summary --id_file af_100k_domainlist_post_tailchop.txt \
+    --fs_input_file fs_query_results.m8 \
+    --fs_results fs_query_results.txt
+```
+
+Starting from 110,250 unique domain ids in foldseek output, resulting into 110,218 hits passing the thresholds.
+
+## Globularity prediction
+
+Time: 2hrs 6 mins
+
+Split into 16 processes. Processing 120,113 domains
+
+Command:
+```
+split -l 7500 af_100k_domainlist_post_tailchop.txt split_af_100k_globularity_ -a 5 -d
+find split_af_100k_globularity_000* | xargs -P16 -I XXX sh -c 'cath-af-cli measure-globularity --af_domain_list XXX \
+    --af_chain_mmcif_dir af_cif_raw/ \
+    --af_domain_globularity af_100k_globularity_XXX.txt'
+cat af_100k_globularity_split_af_100k_globularity_000* > af_100k_globularity.txt
+rm split_af_100k_globularity_000*
+af_100k_globularity_split_af_100k_globularity_000*
+```
+
+As this was done in chunks, remove the 16 additional occurences of the header from af_100k_globularity.txt
+
+
+## Combine results
+
+TBD
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/cath_alphaflow/cli.py b/cath_alphaflow/cli.py
@@ -16,6 +16,7 @@
 from .commands import create_md5
 from .commands import convert_cif_to_fasta
 from .commands import load_mongo
+from .commands import measure_globularity
 
 logging.basicConfig(
     level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s"
@@ -65,3 +66,4 @@ def dump_config():
 cli.add_command(create_md5.create_md5)
 cli.add_command(convert_cif_to_fasta.convert_cif_to_fasta)
 cli.add_command(load_mongo.load_af_from_archive)
+cli.add_command(measure_globularity.measure_globularity)