qverron · Zaf4 · Feb 4, 2024 · Feb 20, 2024 · Feb 20, 2024 · Feb 20, 2024
diff --git a/.DS_Store b/.DS_Store
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,4 @@ dist/
 /__pycache__/
 *.fa
 .DS_Store
+token*
diff --git a/README.md b/README.md
@@ -1,16 +1,18 @@
 # Instructions for probe design
 
 > [!CAUTION]
-> You may want to add python and pip as aliases for python3 and pip3
+> You may want to add pip as alias for pip3
 
 On your terminal;
 
 ```shell
-echo "alias python='python3'" >> ~/.bashrc
+
 echo "alias pip='pip3'" >> ~/.bashrc
 source ~/.bashrc
 ```
 
+Or simply use `pip3` instead of `pip`
+
 ## Installation
 
 - Install **probe_design**  (also installs **ifpd2q**)
@@ -26,11 +28,11 @@ This adds `prb` (short for probe design) as a shell command.
 On your terminal;
 
 ```shell
-git clone http://github.com/ggirelli/oligo-melting ~/oligo-melting
-cd ~/oligo-melting
-pip install .
+pip install git+https://github.com/ggirelli/oligo-melting.git
 ```
 
+
+
 > [!NOTE]
 > nHUSH, HUSH and escafish are private repositories
 
@@ -69,6 +71,8 @@ pip install .
 
 - All the commands below assume you are starting from your project directory
 
+- To make a project directory and change directory to project directory
+
 ```shell
 mkdir <project_name>
 cd <project_name>
@@ -80,30 +84,67 @@ cd <project_name>
 
 1. Preparation
 
-- The probe desin pipeline data is currently intended to be run on a 
-  folder called `data/` contained within the pipeline project folder <project_name>.
+Inside the project dirctory.
+
+```shell
+prb makedirs
+```  
+
+This will create `data` directory and its subdirectories `data/rois` and `data/ref`.
 
 - Upon starting the pipeline, the `data/` folder should only contain
   `data/rois/` and `data/ref/` (and possibly `data/blacklist/`, see 6.). If more folders are included, consider making a back-up or simply removing them.
-
+
+2. Input file for Region of Intrests (ROIS)
+
+> [!CAUTION]
+> 1. your region of interests file MUST be named `all_regions.tsv`
+> 2. `all_regions.tsv` MUST follow the [EXAMPLE](probe_design/data/rois/all_regions.tsv) format.
+> 3. `all_regions.tsv` MUST be placed within `data/rois` folder.
+> 4. 
+
+
 - List your regions of interest and their coordinates in the input file:
   `data/rois/all_regions.tsv`
 
-- Place your reference genome in the `data/ref/` folder. Make
-  sure that the chromosome naming matches with the reference genome
-  name provided in `all_regions.tsv`.
-
-- The reference folder can alternatively be gathered using `prb get_ref_genome`.
-  In that case, adjust the script manually with the correct Ensembl 
-  address for your genome of interest.
 
-2. Generate all required subfolders inside your project directory:
+3. Download Reference genome
+
+For CHM13 T2T 
 
 ```shell
-prb makedirs
-```  
+prb get_T2T
+```
+options
+> -p: prefix for the chromosomes ;default: CHM13.T2T
+> names will prefix.chromosome.ID.fa where ID stands for chromosome ID i.e., 1-22+X,Y,M
 
-3. Retrieve your region sequences and extract all k-mers of correct length:
+For GRCh38
+
+```shell
+prb get_GRC -split
+```
+>usage: prb get_GRC [-h] [-s {homo_sapiens,mus_musculus}] [-b BUILD] [-r RELEASE] [-d DIR] [-f FILENAME] [-k] [-split]
+>
+>download ensemble genome
+>
+>options:<br>
+>  -h, --help<br>
+> ------> show this help message and exit<br>
+>  -s {homo_sapiens,mus_musculus}, --species {homo_sapiens,mus_musculus}<br>
+>  -b BUILD, --build BUILD<br>
+> ------> the build number of the genome<br>
+>  -r RELEASE, --release RELEASE<br>
+> ------> release number of the build<br>
+>  -d DIR, --dir DIR     destination directory<br>
+>  -f FILENAME, --filename FILENAME<br>
+> ------> give a specific name to the downloaded file<br>
+>  -k, --keep<br>
+> ------> whether to keep gzip files<br>
+>  -split                <br>
+> ------> whether to split into chromosomes
+
+4. Retrieve your region sequences and extract all k-mers of correct length:
 
 ```shell
 prb get_oligos DNA|RNA [optional: applyGCfilter 0|1]
@@ -115,12 +156,16 @@ prb get_oligos DNA 1
 > If indicating `RNA`, the module will assume that the transcript / region
 > sequences are already present in the `data/regions` folder. Default: `DNA.
 
-4. Test all k-mers for their homology to other regions in the genome,
+
+5. Test all k-mers for their homology to other regions in the genome,
    using nHUSH. Instead of running the entire k-mers (of length `L`) at
    once, can be sped up by testing shorter sublength oligos (of length
    l).  `-m` number of mismatches to test for (always use 1 when running
    sublength); `-t` number of threads, `-i` comb size
 
+> [!CAUTION]
+> Make sure your Length (-L) here matches with the Length in your all_regions.tsv file
+
 - Full length:
 
 ``` shell
@@ -133,39 +178,47 @@ prb run_nHUSH -d RNA -L 35 -m 5 -t 40 -i 14
 prb run_nHUSH -d DNA -L 40 -l 21 -m 3 -t 40 -i 14
 ```
 
-> [!TIP]
-> ADD -g if this is the first time running with a new reference genome!  
+> prb run_nHUSH -d {DNA|RNA} -L {length} -l (optional){sublength} -m {number of mismatches} -t {threads} i {comb size}
+
+
+
 
 - In case nHUSH is interrupted before completion, run before continuing:
 
 ``` shell
 prb unfinished_HUSH
 ```
 
-5. Recapitulate nHUSH results as a score 
+6. Recapitulate nHUSH results as a score 
 
 ``` shell
 prb reform_hush_combined DNA|RNA|-RNA length sublength until
 ```
 
+> e.g., prb reform_hush_combined DNA 40 21 3
+
 (`until` denotes the same number as specified after `-m` when running nHUSH). 
 
-6. Calculate the melting temperature of k-mers and the free energy of
+7. Calculate the melting temperature of k-mers and the free energy of
    secondary structure formation:
 
 ``` shell
-prb melt_secs_parallel (optional DNA(ref) / RNA(rev. compl))   
+prb melt_secs_parallel (optional DNA(ref) | RNA(rev. compl))   
 ```
 
+> e.g., prb melt_secs_parallel DNA
+
 7. Generate a black list of abundantly repeated oligos in the reference genome.
 
-``` shell
-prb generate_blacklist -L 40 -c 100
-```
 > [!NOTE]
 > This only needs to be run once per reference genome if not using any 
 > exclusion regions! Just save the blacklist folder between runs.
 
+``` shell
+prb generate_blacklist -L 40 -c 100
+```
+
+
 > L: oligo length <br>
 > c: min abundance to be included in oligo black list
 
@@ -191,7 +244,7 @@ prb build-db_BL -f q_bl -m 32 -i 6 -L 40 -c 100 -d 8 -T 72
 9. Query the database to get candidate probes:
 
 ``` shell
-prb cycling_query -s DNA -L 40 -m 8 -c 100 -t 40 -greedy
+prb cycling_query -s DNA -L 40 -m 8 -c 100 -t 40 -g 500 -greedy
 ```
 
 **[optional: -greedy. Speed > quality]
@@ -210,7 +263,7 @@ If enough oligos cannot be found, design probes with fewer oligos, decreasing wi
 prb summarize_probes_final
 ```
 
-Some visual elements can be obtained using the following notebooks (needs updating!):
+Some visual elements can be obtained using the following notebooks (TODO!):
 
 ``` shell
 prb plot_probe_candidates
@@ -225,26 +278,47 @@ oligos that are specific for the ROI can be included in the final probe.
 
 ### Warning: This approach occupies a lot more hard drive space!
 
-1. Preparation
+1. Generate all required subfolders:
+
+``` shell
+prb makedirs
+```
+
+2. Input file for Region of Intrests (ROIS)
+
+> [!CAUTION]
+> 1. your region of interests file MUST be named `all_regions.tsv`
+> 2. `all_regions.tsv` MUST follow the [EXAMPLE](probe_design/data/rois/all_regions.tsv) format.
+> 3. `all_regions.tsv` MUST be placed within `data/rois` folder.
+> 4. 
+
+3. Additional Preparation
 - Besides `data/rois/` and `data/ref/`, the pipeline requires an additional
   `data/exclude/` folder containing BED files with the coordinates of sections
   to mask out when running HUSH for each ROI. 
-
-2. (UNLESS manually providing exclusion regions)
+
+4. Download Reference genome
+
+For CHM13 T2T (advised for repetetive regions)
+
+```shell
+prb get_T2T
+```
+options
+> -p: prefix for the chromosomes ;default: CHM13.T2T
+> names will prefix.chromosome.ID.fa where ID stands for chromosome ID i.e., 1-22+X,Y,M
+
+
+5. (UNLESS manually providing exclusion regions)
 Exclude regions of interest from HUSH scan.
 
 ``` shell
 prb generate_exclude
 ```
 - The same sheet template can be used to manually add further regions to exclude.
 
-2. Generate all required subfolders:
-
-``` shell
-prb makedirs
-```
 
-3. Retrieve your region sequences and extract all k-mers of correct length:
+6. Retrieve your region sequences and extract all k-mers of correct length:
 
 ``` shell
 # (from Pipeline/)
@@ -256,13 +330,13 @@ prb get_oligos DNA
    If indicating `RNA`, the module will assume that the transcript / region
    sequences are already present in the `data/regions` folder. Default: `DNA.
 
-4. Apply the region exclusion mask on the reference genome.
+7. Apply the region exclusion mask on the reference genome.
 
 ``` shell
 prb exclude_region
 ```
 
-5. Generate a black list of abundantly repeated oligos in the reference genome.
+8. Generate a black list of abundantly repeated oligos in the reference genome.
 
 ```shell
 prb generate_blacklist -L 40 -c 100
@@ -272,18 +346,20 @@ Needs to be re-run everytime when using exclusion masks.
 L: oligo length; c: min abundance to be included in oligo black list   
 
 
-6. Test all k-mers for their homology to other regions in the genome,
-   using nHUSH. Instead of running the entire k-mers (of length `L`) at
-   once, can be sped up by testing shorter sublength oligos (of length
-   l).  `-m` number of mismatches to test for (minimum 1 for sublength;
-   more gives better information but takes longer time);
-   `-t` number of threads, `-i` comb size
+9. Test all k-mers for their homology to other regions in the genome,
+using nHUSH. Instead of running the entire k-mers (of length `L`) at
+once, can be sped up by testing shorter sublength oligos (of length
+l).  `-m` number of mismatches to test for (minimum 1 for sublength;
+more gives better information but takes longer time);
+`-t` number of threads, `-i` comb size
 
 Sublength:
 
 ```shell
 prb run_nHUSH_excl -d DNA -L 40 -l 21 -m 3 -t 40 -i 14
 ```
+
+> prb run_nHUSH_excl -d {DNA|RNA} -L {length} -l (optional){sublength} -m {number of mismatches} -t {threads} i {comb size}
 
 Note the `_excl` specific to the exclusion mode.  
 
@@ -293,7 +369,7 @@ In case nHUSH is interrupted before completion, run before continuing:
 prb unfinished_HUSH
 ```
 
-7. Recapitulate nHUSH results as a score
+10. Recapitulate nHUSH results as a score
 
 ```shell
 # Format:
@@ -302,16 +378,18 @@ prb reform_hush_combined DNA|RNA|-RNA length sublength until
 prb reform_hush_combined DNA 40 21 3
 ```
 
+
 (`until` denotes the same number as specified after `-m` when running nHUSH).
 
-8. Calculate the melting temperature of k-mers and the free energy of
+11. Calculate the melting temperature of k-mers and the free energy of
    secondary structure formation:
 
 ```shell
 prb melt_secs_parallel (optional DNA(ref) / RNA(rev. compl))
 ```
+> e.g., prb melt_secs_parallel DNA
 
-9. Create k-mer database, convert to TSV for querying and attribute
+12. Create k-mer database, convert to TSV for querying and attribute
    score to each oligo (based on nHUSH score, GC content, melting
    temperature, homopolymer stretches, secondary structures).
 
@@ -329,7 +407,7 @@ prb build-db_BL -f q_bl -m 32 -i 6 -L 40 -c 100 -d 8 -T 72
 > T: target temperature <br>
 > m: max length of consecutive off-target match <br>
 
-10. Query the database to get candidate probes:
+13. Query the database to get candidate probes:
 
 ``` shell
 prb cycling_query -s DNA -L 40 -m 8 -c 100 -t 40 -g 500 -stepdown 50 -greedy -excl
@@ -345,10 +423,10 @@ Number of oligos to decrease probe size with every iteration that does not find
 Cycling query which generate probe candidates, then checks the resulting oligos using HUSH, removes inacceptable oligos and generate probes again.
 If enough oligos cannot be found, design probes with fewer oligos, decreasing with `stepdown` at each step.
 
-11. Summarize the final probes:
+14. Summarize the final probes:
 
 ``` shell
-prb summarize-probes-final
+prb summarize_probes_final
 ```
 
 ## Generate probes for ordering
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,3 +6,4 @@ dist/ @@
     /__pycache__/
     *.fa
     .DS_Store
+    token*