Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hard link creation issue #14

Open
wants to merge 72 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
0eeac7e
fix #4
Zaf4 Feb 4, 2024
053bf5e
fix rename problem
Zaf4 Feb 20, 2024
be4956e
fix rename not enough arguments
Zaf4 Feb 20, 2024
b742961
fix folder melt secs exist error
Zaf4 Feb 20, 2024
278f4ac
fix no file named ./shell/probe-query.sh
Zaf4 Feb 20, 2024
25f86bc
small improvs
Zaf4 Feb 21, 2024
1001e7b
fix line endings
Zaf4 Feb 21, 2024
3023389
revert to rename
Zaf4 Feb 21, 2024
3ae8ba9
fix hardcoded path to a command
Zaf4 Feb 22, 2024
b3893bd
revert back from q for now
Zaf4 Feb 22, 2024
e50730a
update dependencies
Zaf4 Feb 22, 2024
d76eca1
version update
Zaf4 Feb 22, 2024
cef1ffd
revert back to ifpd2q upon the fix
Zaf4 Feb 23, 2024
8d86e39
fix LF
Zaf4 Feb 23, 2024
2d96eb5
Delete .DS_Store
Zaf4 Feb 29, 2024
67c8d5f
fix for missing args
Zaf4 Feb 29, 2024
5b19584
Merge remote-tracking branch 'origin/main' into zk2poetry
Zaf4 Feb 29, 2024
d305032
add pre comnfig version
Zaf4 Mar 4, 2024
6599c9b
add old escafish
Zaf4 Mar 4, 2024
1c5861c
fix wrong naming
Zaf4 Mar 4, 2024
85145eb
stat to np for mean and std
Zaf4 Mar 5, 2024
7b458ef
fix pandas warning
Mar 6, 2024
054bd2e
update version
Mar 6, 2024
0762c0c
old
Zaf4 Mar 8, 2024
092a237
Merge remote-tracking branch 'refs/remotes/origin/zk2poetry' into zk2…
Zaf4 Mar 8, 2024
02e70df
add a t2t splitter
Zaf4 Mar 8, 2024
ccdaf97
add R scripts to prb and reshape the pipeline_exc
Mar 9, 2024
c8f7646
refactor pipeline.sh
Mar 9, 2024
5e0de6b
update version
Mar 9, 2024
2eb5318
R to lowercase
Mar 10, 2024
490b4a9
update version
Mar 10, 2024
41a486a
add -y flag to nHUSHes fix #15
Mar 10, 2024
de98a08
add correction for -y confirmation
Zaf4 Mar 11, 2024
fea2199
version update
Zaf4 Mar 11, 2024
ebf8d2c
add -y flags to db scripts
Zaf4 Mar 11, 2024
533d8d5
old cycling query
Apr 21, 2024
9894cae
-p flag to mkdir to prevent error
Apr 22, 2024
cf583da
return to hard link from soft link of genome
Apr 22, 2024
c3d5541
version update
Apr 22, 2024
eec321d
ln to cp
Zaf4 Apr 23, 2024
dbed38c
revert to hard link
Zaf4 Apr 23, 2024
1b442e3
version update 0.2.23
Zaf4 Apr 23, 2024
a61c5b9
pandas downgrade
Zaf4 May 2, 2024
11d0e72
pandas update and gap default 500
May 8, 2024
661039d
update gitignore
May 8, 2024
86ee13f
add reference to upper
Zaf4 May 9, 2024
f80de53
version update
Zaf4 May 9, 2024
368d4f4
add get_T2T and split_T2T
Zaf4 May 14, 2024
22a1d23
fix get_T2T
Zaf4 May 14, 2024
3dcdd49
add -p flag to get_t2t
Zaf4 May 14, 2024
4158e34
nicer prints
Zaf4 May 14, 2024
ea020e0
add reset
Zaf4 May 14, 2024
76bd28c
update version
Zaf4 May 14, 2024
f828f60
revert to ln -s and add ensemble build finder
Zaf4 May 16, 2024
3b14d9f
create HUSH dir auto
Zaf4 May 16, 2024
86b8ebb
add get grc
Zaf4 May 17, 2024
aa42cfb
rename to genome.fa
Zaf4 May 17, 2024
7c12b15
directory flag
Zaf4 May 17, 2024
154604f
udapte prog name and genome.fa loc
Zaf4 May 17, 2024
1da9185
shutil move will overwrite
Zaf4 May 20, 2024
ff9396c
simplify makedirs
Zaf4 May 20, 2024
367b453
update readme
Zaf4 May 20, 2024
e765a5c
another update
Zaf4 May 20, 2024
42bd291
Update README alt. ordering
Zaf4 May 21, 2024
4bafabd
add summary test
Zaf4 May 21, 2024
10b1365
add color diff colors
Zaf4 May 23, 2024
83d484e
add visual report
Zaf4 May 27, 2024
a1c0e17
update pyroject
Zaf4 May 27, 2024
2cc2bcd
update version
Zaf4 Jun 3, 2024
bdedb23
add r string for regex
Zaf4 Jun 4, 2024
5d423b6
version update
Zaf4 Jun 4, 2024
1229242
fix regular ex
Zaf4 Jun 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed .DS_Store
Binary file not shown.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ dist/
/__pycache__/
*.fa
.DS_Store
token*
184 changes: 131 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
# Instructions for probe design

> [!CAUTION]
> You may want to add python and pip as aliases for python3 and pip3
> You may want to add pip as alias for pip3

On your terminal;

```shell
echo "alias python='python3'" >> ~/.bashrc

echo "alias pip='pip3'" >> ~/.bashrc
source ~/.bashrc
```

Or simply use `pip3` instead of `pip`

## Installation

- Install **probe_design** (also installs **ifpd2q**)
Expand All @@ -26,11 +28,11 @@ This adds `prb` (short for probe design) as a shell command.
On your terminal;

```shell
git clone http://github.com/ggirelli/oligo-melting ~/oligo-melting
cd ~/oligo-melting
pip install .
pip install git+https://github.com/ggirelli/oligo-melting.git
```



> [!NOTE]
> nHUSH, HUSH and escafish are private repositories

Expand Down Expand Up @@ -69,6 +71,8 @@ pip install .

- All the commands below assume you are starting from your project directory

- To make a project directory and change directory to project directory

```shell
mkdir <project_name>
cd <project_name>
Expand All @@ -80,30 +84,67 @@ cd <project_name>

1. Preparation

- The probe desin pipeline data is currently intended to be run on a
folder called `data/` contained within the pipeline project folder <project_name>.
Inside the project dirctory.

```shell
prb makedirs
```

This will create `data` directory and its subdirectories `data/rois` and `data/ref`.

- Upon starting the pipeline, the `data/` folder should only contain
`data/rois/` and `data/ref/` (and possibly `data/blacklist/`, see 6.). If more folders are included, consider making a back-up or simply removing them.


2. Input file for Region of Intrests (ROIS)

> [!CAUTION]
> 1. your region of interests file MUST be named `all_regions.tsv`
> 2. `all_regions.tsv` MUST follow the [EXAMPLE](probe_design/data/rois/all_regions.tsv) format.
> 3. `all_regions.tsv` MUST be placed within `data/rois` folder.
> 4.


- List your regions of interest and their coordinates in the input file:
`data/rois/all_regions.tsv`

- Place your reference genome in the `data/ref/` folder. Make
sure that the chromosome naming matches with the reference genome
name provided in `all_regions.tsv`.

- The reference folder can alternatively be gathered using `prb get_ref_genome`.
In that case, adjust the script manually with the correct Ensembl
address for your genome of interest.

2. Generate all required subfolders inside your project directory:
3. Download Reference genome

For CHM13 T2T

```shell
prb makedirs
```
prb get_T2T
```
options
> -p: prefix for the chromosomes ;default: CHM13.T2T
> names will prefix.chromosome.ID.fa where ID stands for chromosome ID i.e., 1-22+X,Y,M

3. Retrieve your region sequences and extract all k-mers of correct length:
For GRCh38

```shell
prb get_GRC -split
```
>usage: prb get_GRC [-h] [-s {homo_sapiens,mus_musculus}] [-b BUILD] [-r RELEASE] [-d DIR] [-f FILENAME] [-k] [-split]
>
>download ensemble genome
>
>options:<br>
> -h, --help<br>
> ------> show this help message and exit<br>
> -s {homo_sapiens,mus_musculus}, --species {homo_sapiens,mus_musculus}<br>
> -b BUILD, --build BUILD<br>
> ------> the build number of the genome<br>
> -r RELEASE, --release RELEASE<br>
> ------> release number of the build<br>
> -d DIR, --dir DIR destination directory<br>
> -f FILENAME, --filename FILENAME<br>
> ------> give a specific name to the downloaded file<br>
> -k, --keep<br>
> ------> whether to keep gzip files<br>
> -split <br>
> ------> whether to split into chromosomes

4. Retrieve your region sequences and extract all k-mers of correct length:

```shell
prb get_oligos DNA|RNA [optional: applyGCfilter 0|1]
Expand All @@ -115,12 +156,16 @@ prb get_oligos DNA 1
> If indicating `RNA`, the module will assume that the transcript / region
> sequences are already present in the `data/regions` folder. Default: `DNA.

4. Test all k-mers for their homology to other regions in the genome,

5. Test all k-mers for their homology to other regions in the genome,
using nHUSH. Instead of running the entire k-mers (of length `L`) at
once, can be sped up by testing shorter sublength oligos (of length
l). `-m` number of mismatches to test for (always use 1 when running
sublength); `-t` number of threads, `-i` comb size

> [!CAUTION]
> Make sure your Length (-L) here matches with the Length in your all_regions.tsv file

- Full length:

``` shell
Expand All @@ -133,39 +178,47 @@ prb run_nHUSH -d RNA -L 35 -m 5 -t 40 -i 14
prb run_nHUSH -d DNA -L 40 -l 21 -m 3 -t 40 -i 14
```

> [!TIP]
> ADD -g if this is the first time running with a new reference genome!
> prb run_nHUSH -d {DNA|RNA} -L {length} -l (optional){sublength} -m {number of mismatches} -t {threads} i {comb size}




- In case nHUSH is interrupted before completion, run before continuing:

``` shell
prb unfinished_HUSH
```

5. Recapitulate nHUSH results as a score
6. Recapitulate nHUSH results as a score

``` shell
prb reform_hush_combined DNA|RNA|-RNA length sublength until
```

> e.g., prb reform_hush_combined DNA 40 21 3

(`until` denotes the same number as specified after `-m` when running nHUSH).

6. Calculate the melting temperature of k-mers and the free energy of
7. Calculate the melting temperature of k-mers and the free energy of
secondary structure formation:

``` shell
prb melt_secs_parallel (optional DNA(ref) / RNA(rev. compl))
prb melt_secs_parallel (optional DNA(ref) | RNA(rev. compl))
```

> e.g., prb melt_secs_parallel DNA

7. Generate a black list of abundantly repeated oligos in the reference genome.

``` shell
prb generate_blacklist -L 40 -c 100
```
> [!NOTE]
> This only needs to be run once per reference genome if not using any
> exclusion regions! Just save the blacklist folder between runs.

``` shell
prb generate_blacklist -L 40 -c 100
```


> L: oligo length <br>
> c: min abundance to be included in oligo black list

Expand All @@ -191,7 +244,7 @@ prb build-db_BL -f q_bl -m 32 -i 6 -L 40 -c 100 -d 8 -T 72
9. Query the database to get candidate probes:

``` shell
prb cycling_query -s DNA -L 40 -m 8 -c 100 -t 40 -greedy
prb cycling_query -s DNA -L 40 -m 8 -c 100 -t 40 -g 500 -greedy
```

**[optional: -greedy. Speed > quality]
Expand All @@ -210,7 +263,7 @@ If enough oligos cannot be found, design probes with fewer oligos, decreasing wi
prb summarize_probes_final
```

Some visual elements can be obtained using the following notebooks (needs updating!):
Some visual elements can be obtained using the following notebooks (TODO!):

``` shell
prb plot_probe_candidates
Expand All @@ -225,26 +278,47 @@ oligos that are specific for the ROI can be included in the final probe.

### Warning: This approach occupies a lot more hard drive space!

1. Preparation
1. Generate all required subfolders:

``` shell
prb makedirs
```

2. Input file for Region of Intrests (ROIS)

> [!CAUTION]
> 1. your region of interests file MUST be named `all_regions.tsv`
> 2. `all_regions.tsv` MUST follow the [EXAMPLE](probe_design/data/rois/all_regions.tsv) format.
> 3. `all_regions.tsv` MUST be placed within `data/rois` folder.
> 4.

3. Additional Preparation
- Besides `data/rois/` and `data/ref/`, the pipeline requires an additional
`data/exclude/` folder containing BED files with the coordinates of sections
to mask out when running HUSH for each ROI.

2. (UNLESS manually providing exclusion regions)

4. Download Reference genome

For CHM13 T2T (advised for repetetive regions)

```shell
prb get_T2T
```
options
> -p: prefix for the chromosomes ;default: CHM13.T2T
> names will prefix.chromosome.ID.fa where ID stands for chromosome ID i.e., 1-22+X,Y,M


5. (UNLESS manually providing exclusion regions)
Exclude regions of interest from HUSH scan.

``` shell
prb generate_exclude
```
- The same sheet template can be used to manually add further regions to exclude.

2. Generate all required subfolders:

``` shell
prb makedirs
```

3. Retrieve your region sequences and extract all k-mers of correct length:
6. Retrieve your region sequences and extract all k-mers of correct length:

``` shell
# (from Pipeline/)
Expand All @@ -256,13 +330,13 @@ prb get_oligos DNA
If indicating `RNA`, the module will assume that the transcript / region
sequences are already present in the `data/regions` folder. Default: `DNA.

4. Apply the region exclusion mask on the reference genome.
7. Apply the region exclusion mask on the reference genome.

``` shell
prb exclude_region
```

5. Generate a black list of abundantly repeated oligos in the reference genome.
8. Generate a black list of abundantly repeated oligos in the reference genome.

```shell
prb generate_blacklist -L 40 -c 100
Expand All @@ -272,18 +346,20 @@ Needs to be re-run everytime when using exclusion masks.
L: oligo length; c: min abundance to be included in oligo black list


6. Test all k-mers for their homology to other regions in the genome,
using nHUSH. Instead of running the entire k-mers (of length `L`) at
once, can be sped up by testing shorter sublength oligos (of length
l). `-m` number of mismatches to test for (minimum 1 for sublength;
more gives better information but takes longer time);
`-t` number of threads, `-i` comb size
9. Test all k-mers for their homology to other regions in the genome,
using nHUSH. Instead of running the entire k-mers (of length `L`) at
once, can be sped up by testing shorter sublength oligos (of length
l). `-m` number of mismatches to test for (minimum 1 for sublength;
more gives better information but takes longer time);
`-t` number of threads, `-i` comb size

Sublength:

```shell
prb run_nHUSH_excl -d DNA -L 40 -l 21 -m 3 -t 40 -i 14
```

> prb run_nHUSH_excl -d {DNA|RNA} -L {length} -l (optional){sublength} -m {number of mismatches} -t {threads} i {comb size}

Note the `_excl` specific to the exclusion mode.

Expand All @@ -293,7 +369,7 @@ In case nHUSH is interrupted before completion, run before continuing:
prb unfinished_HUSH
```

7. Recapitulate nHUSH results as a score
10. Recapitulate nHUSH results as a score

```shell
# Format:
Expand All @@ -302,16 +378,18 @@ prb reform_hush_combined DNA|RNA|-RNA length sublength until
prb reform_hush_combined DNA 40 21 3
```


(`until` denotes the same number as specified after `-m` when running nHUSH).

8. Calculate the melting temperature of k-mers and the free energy of
11. Calculate the melting temperature of k-mers and the free energy of
secondary structure formation:

```shell
prb melt_secs_parallel (optional DNA(ref) / RNA(rev. compl))
```
> e.g., prb melt_secs_parallel DNA

9. Create k-mer database, convert to TSV for querying and attribute
12. Create k-mer database, convert to TSV for querying and attribute
score to each oligo (based on nHUSH score, GC content, melting
temperature, homopolymer stretches, secondary structures).

Expand All @@ -329,7 +407,7 @@ prb build-db_BL -f q_bl -m 32 -i 6 -L 40 -c 100 -d 8 -T 72
> T: target temperature <br>
> m: max length of consecutive off-target match <br>

10. Query the database to get candidate probes:
13. Query the database to get candidate probes:

``` shell
prb cycling_query -s DNA -L 40 -m 8 -c 100 -t 40 -g 500 -stepdown 50 -greedy -excl
Expand All @@ -345,10 +423,10 @@ Number of oligos to decrease probe size with every iteration that does not find
Cycling query which generate probe candidates, then checks the resulting oligos using HUSH, removes inacceptable oligos and generate probes again.
If enough oligos cannot be found, design probes with fewer oligos, decreasing with `stepdown` at each step.

11. Summarize the final probes:
14. Summarize the final probes:

``` shell
prb summarize-probes-final
prb summarize_probes_final
```

## Generate probes for ordering
Expand Down
Loading