Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #20

Merged
merged 68 commits into from
Oct 10, 2023
Merged

Dev #20

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
185dfad
add minimum pyslow5 version for readgroups api
Psy-Fer Sep 26, 2023
d6876ac
add --seq_sum for sequencing_summary.txt file
Psy-Fer Sep 26, 2023
a430b75
add barcode demultiplexing for fastq output
Psy-Fer Oct 3, 2023
f01dafe
snakeworkflow added, awaiting failure
hasindu2008 Oct 4, 2023
637037f
update test scripts
hasindu2008 Oct 4, 2023
36c0fa7
score2
hasindu2008 Oct 4, 2023
bc469ff
update test
hasindu2008 Oct 4, 2023
c8a08b1
update test
hasindu2008 Oct 4, 2023
48d6416
remora test
hasindu2008 Oct 4, 2023
ee32c63
rectify the remoratest
hasindu2008 Oct 4, 2023
c98d53b
sesum compare
hasindu2008 Oct 4, 2023
36a8249
update
hasindu2008 Oct 4, 2023
8035331
split tests and add scoring
Psy-Fer Oct 5, 2023
5a41708
add sam demuxing and fix seq sum
Psy-Fer Oct 5, 2023
7de35b0
add citation
Psy-Fer Oct 5, 2023
0d67685
add more seqsum tests
hasindu2008 Oct 6, 2023
f543607
demux script
hasindu2008 Oct 6, 2023
451b652
demux script
hasindu2008 Oct 6, 2023
db9bb13
demux pass
hasindu2008 Oct 6, 2023
188c924
dash to dot in seq sum, round qscore output
Psy-Fer Oct 7, 2023
2d6d88a
mkae output dir if not exist
Psy-Fer Oct 7, 2023
ee6107a
more tests
hasindu2008 Oct 7, 2023
7e04665
dash to dot for passes filtering
Psy-Fer Oct 7, 2023
f523c00
add parent_read_id to barcode summary
Psy-Fer Oct 7, 2023
02e9725
Create formats.md
hasindu2008 Oct 7, 2023
0745138
update tests
hasindu2008 Oct 7, 2023
84b823f
Create param.md
hasindu2008 Oct 7, 2023
a466fd4
Update formats.md
hasindu2008 Oct 8, 2023
2a08bd5
Update formats.md
hasindu2008 Oct 8, 2023
817bf5b
Update README.md
hasindu2008 Oct 8, 2023
8e1353f
a new test
hasindu2008 Oct 8, 2023
67f0f86
dorado test
hasindu2008 Oct 8, 2023
639039e
more dorado tests - no exit for now
hasindu2008 Oct 8, 2023
511d0ab
polish test scripts
hasindu2008 Oct 8, 2023
7d44638
updating seqsum
Psy-Fer Oct 8, 2023
79d1950
update the format spec
Psy-Fer Oct 9, 2023
26176fe
add parent read id sam tag pi:Z: on read splitting
Psy-Fer Oct 9, 2023
fc93a2c
switch read_id and parent_read_id in barcode sum
Psy-Fer Oct 9, 2023
6620b65
Check input and output extentions are correct
Psy-Fer Oct 9, 2023
beee36b
remove input check, as it excludes dir reading
Psy-Fer Oct 9, 2023
905fccd
update help to be more readable
Psy-Fer Oct 9, 2023
aecafe0
update usage with more readable version
Psy-Fer Oct 9, 2023
b1a860d
initial documentation for split_qscore.py
Psy-Fer Oct 9, 2023
4aebfa6
remove pore_type
Psy-Fer Oct 9, 2023
1557ce3
add pi sam tag
Psy-Fer Oct 9, 2023
4cdff61
remove pore_type
Psy-Fer Oct 9, 2023
7301416
update readme
hasindu2008 Oct 9, 2023
fa37041
fix formats
hasindu2008 Oct 9, 2023
0cc7141
Update formats.md
hasindu2008 Oct 9, 2023
1a03aed
Update formats.md
hasindu2008 Oct 9, 2023
7bff954
update formats
hasindu2008 Oct 9, 2023
f721e59
read splitting pi tag on meth calls
Psy-Fer Oct 9, 2023
4816b73
Merge branch 'dev' of github.com:Psy-Fer/buttery-eel into dev
Psy-Fer Oct 9, 2023
ef3b139
updated dorado scripts
hasindu2008 Oct 9, 2023
7bf51e0
Merge branch 'dev' of github.com:Psy-Fer/buttery-eel into dev
hasindu2008 Oct 9, 2023
f323db5
debugging
Psy-Fer Oct 9, 2023
50f7dea
debug
Psy-Fer Oct 9, 2023
d3ede6a
--beam_width set to 40 for dorado server
Psy-Fer Oct 9, 2023
7b8d1dd
undo debug changes
Psy-Fer Oct 9, 2023
9d91485
dorado scripts update not to exit
hasindu2008 Oct 9, 2023
a2f6f9e
add notice about dorado basecalls being different
Psy-Fer Oct 9, 2023
2c229fe
seq_sum start_time value and header update
Psy-Fer Oct 9, 2023
4a1e94a
format spec updated
Psy-Fer Oct 9, 2023
9c6b56c
fix dorado scripts
hasindu2008 Oct 10, 2023
b660f4f
convert duration to seconds
Psy-Fer Oct 10, 2023
1836abd
Merge branch 'dev' of github.com:Psy-Fer/buttery-eel into dev
Psy-Fer Oct 10, 2023
4568574
Update formats.md
hasindu2008 Oct 10, 2023
4d50493
update scripts
hasindu2008 Oct 10, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .github/workflows/snake.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: snake CI

on:
push:
branches: [ '**' ]
pull_request:
branches: [ '**' ]

jobs:
ubuntu_20:
name: ubuntu-20.04
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v2
with:
submodules: recursive
- name: install packages
run: sudo apt-get update && sudo apt-get install -y zlib1g-dev gcc python3 python3-pip && pip3 install setuptools cython numpy
- name: install
run: pip3 install --upgrade pip && pip3 install .
- name: test
run: buttery-eel --help
ubuntu_22:
name: ubuntu-22.04
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v2
with:
submodules: recursive
- name: install packages
run: sudo apt-get update && sudo apt-get install -y zlib1g-dev gcc python3 python3-pip && pip3 install setuptools cython numpy
- name: install
run: pip3 install --upgrade pip && pip3 install .
- name: test
run: buttery-eel --help

163 changes: 109 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,29 +8,32 @@



## The buttery eel - A slow5 guppy basecaller wrapper
## The buttery eel - A slow5 guppy/dorado basecaller wrapper

`buttery-eel` is a wrapper for `guppy`. It allows us to read [`SLOW5` files](https://github.com/hasindu2008/slow5tools), and send that data to [`guppy`](https://community.nanoporetech.com/downloads) to basecall. It requires matching versions of [`guppy`](https://community.nanoporetech.com/downloads) and [`ont-pyguppy-client-lib`](https://pypi.org/project/ont-pyguppy-client-lib/) to work.
`buttery-eel` is a wrapper for `guppy` and `dorado`. It allows us to read [`SLOW5` files](https://github.com/hasindu2008/slow5tools), and send that data to [`guppy`] or `dorado` server (https://community.nanoporetech.com/downloads) to basecall. It requires matching versions of [`guppy`](https://community.nanoporetech.com/downloads) and [`ont-pyguppy-client-lib`](https://pypi.org/project/ont-pyguppy-client-lib/) to work.

You can download guppy here: https://community.nanoporetech.com/downloads. An ONT login is required to access that page, sorry no easy way around that one without legal headaches.
You can download guppy or dorado server here: https://community.nanoporetech.com/downloads. An ONT login is required to access that page, sorry no easy way around that one without legal headaches.

The main branch is a simple single-process version (one process to communicate to/from the Guppy client) that works well for HAC and SUP models. If you want performance scaling for multi-GPU setups, especially for FAST basecalling or shorter reads, please use the multi-process version (parallel processes to communicate to/from Guppy client) under the `multiproc` branch.
- Currently, the main branch is the multi-process version (parallel processes to communicate to/from Guppy client) that enables performance scaling for multi-GPU setups, especially for FAST basecalling or shorter reads. A simple single-process version (one process to communicate to/from the Guppy client) that works well for HAC and SUP models is available in the `singleproc` branch for learning purposes.
- Before v0.3.3, the main branch was the single-process version (`singleproc` branch) and the multi=process version was under the `multiproc` branch.

# Quick start
## Dorado basecalls not matching

Using python3, preferably python3.7 to 3.9. Python 3.10 and higher does not yet have any pip wheel builds available for v6.3.8 and lower of guppy
Currently if you basecall the same data with `dorado==0.3.4`, `dorado_basecall_server`/`ont_basecall_client==7.4.1`, and `ont-pyguppy-client-lib==7.4.1`, you will get 3 different answers.
We are following up with ONT why this is the case. The output is very close, but not identical in the base calls.

Install a version of `guppy` (something higher than 4) where `GUPPY_VERSION` is the version, for example, `6.3.8`
There is no such issue with the latest guppy build, `6.5.7`.

Download: https://community.nanoporetech.com/downloads

The `guppy` and `ont-pyguppy-client-lib` versions need to match
# Quickstart

# if GUPPY_VERSION=6.3.8
# modify requirements.txt to have:
# ont-pyguppy-client-lib==6.3.8
Using python3, preferably python3.7 to 3.9. Python 3.10 and higher does not yet have any pip wheel builds available for v6.3.8 and lower of guppy

Install a version of `guppy` (something higher than 4) where `GUPPY_VERSION` is the version, for example, `6.3.8`. Alternatively, you can install a version of `dorado server` too.
Download: https://community.nanoporetech.com/downloads

The `guppy` and `ont-pyguppy-client-lib` versions need to match
```
git clone https://github.com/Psy-Fer/buttery-eel.git
cd buttery-eel
python3 -m venv venv3
Expand All @@ -43,49 +46,102 @@ The `guppy` and `ont-pyguppy-client-lib` versions need to match
# set this first to ensure pyslow5 installs with zstd:
# export PYSLOW5_ZSTD=1

# if GUPPY_VERSION=6.3.8
# modify requirements.txt to have:
# ont-pyguppy-client-lib==6.3.8
# if using DORADO_SERVER_VERSION=7.1.4
# ont-pyguppy-client-lib==7.1.4

python setup.py install

buttery-eel --help


Usage:

usage: buttery-eel [-h] -i INPUT -o OUTPUT -g GUPPY_BIN --config CONFIG [--guppy_batchsize GUPPY_BATCHSIZE] [--call_mods] [-q QSCORE] [--slow5_threads SLOW5_THREADS] [--slow5_batchsize SLOW5_BATCHSIZE]
[--quiet] [--moves_out] [--do_read_splitting] [--min_score_read_splitting MIN_SCORE_READ_SPLITTING] [--log LOG] [--seq_sum] [-v]

buttery-eel - wrapping guppy for SLOW5 basecalling

optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
input blow5 file for basecalling (default: None)
-o OUTPUT, --output OUTPUT
output .fastq or unaligned .sam file to write (default: None)
-g GUPPY_BIN, --guppy_bin GUPPY_BIN
path to ont_guppy/bin folder (default: None)
--config CONFIG basecalling model config (default: dna_r9.4.1_450bps_fast.cfg)
--guppy_batchsize GUPPY_BATCHSIZE
number of reads to send to guppy at a time. (default: 4000)
--call_mods output MM/ML tags for methylation - will output sam - use with appropriate mod config (default: False)
-q QSCORE, --qscore QSCORE
A mean q-score to split fastq/sam files into pass/fail output (default: None)
--slow5_threads SLOW5_THREADS
Number of threads to use reading slow5 file (default: 4)
--slow5_batchsize SLOW5_BATCHSIZE
Number of reads to process at a time reading slow5 (default: 4000)
--quiet Don't print progress (default: False)
--moves_out output move table (sam format only) (default: False)
--do_read_splitting Perform read splitting based on mid-strand adapter detection (default: False)
--min_score_read_splitting MIN_SCORE_READ_SPLITTING
Minimum mid-strand adapter score for reads to be split (default: 50.0)
--log LOG guppy log folder path (default: buttery_guppy_logs)
--seq_sum [Experimental] - Write out sequencing_summary.tsv file (default: False)
-v, --version Prints version





```

Suppose the name of the virtual environment you created is venv3 and resides directly in the root of the cloned buttery-eel git repository. In that case, you can use the wrapper script available under `/path/to/repository/scripts/eel` for conveniently executing buttery-eel. This script will automatically source the virtual environment, find a free port, execute the buttery-eel with the parameters you specified and finally deactivate the virtual environment. If you add the path of `/path/to/repository/scripts/` to your PATH environment variable, you can simply use buttery-eel as:
```
eel -g /path/to/ont-guppy/bin/ --config dna_r10.4.1_e8.2_400bps_hac_prom.cfg --device cuda:all -i reads.blow5 -o reads.reads # and any other parameters
```

Alternatively, you can manually execute buttery-eel if you have sourced the virtual environment. You must provide `--port PORT --use_tcp` parameters manually in this case. Example:
```
buttery-eel -g /path/to/ont-guppy/bin/ --config dna_r10.4.1_e8.2_400bps_hac_prom.cfg --device cuda:all -i reads.blow5 -o reads.reads.fastq --port 5000 --use_tcp # and any other parameters
```

# Usage

```
usage: buttery-eel [-h] -i INPUT -o OUTPUT -g GUPPY_BIN --config CONFIG [--guppy_batchsize GUPPY_BATCHSIZE] [--call_mods] [-q QSCORE] [--slow5_threads SLOW5_THREADS]
[--procs PROCS] [--slow5_batchsize SLOW5_BATCHSIZE] [--quiet] [--max_read_queue_size MAX_READ_QUEUE_SIZE] [--log LOG] [--moves_out]
[--do_read_splitting] [--min_score_read_splitting MIN_SCORE_READ_SPLITTING] [--detect_adapter] [--min_score_adapter MIN_SCORE_ADAPTER]
[--trim_adapters] [--detect_mid_strand_adapter] [--seq_sum] [--barcode_kits BARCODE_KITS] [--enable_trim_barcodes] [--require_barcodes_both_ends]
[--detect_mid_strand_barcodes] [--min_score_barcode_front MIN_SCORE_BARCODE_FRONT] [--min_score_barcode_rear MIN_SCORE_BARCODE_REAR]
[--min_score_barcode_mid MIN_SCORE_BARCODE_MID] [--profile] [-v]

buttery-eel - wrapping guppy/dorado for SLOW5 basecalling

optional arguments:
-h, --help show this help message and exit
--profile run cProfile on all processes - for debugging benchmarking (default: False)
-v, --version Prints version

Run Options:
-i INPUT, --input INPUT
input blow5 file or directory for basecalling (default: None)
-o OUTPUT, --output OUTPUT
output .fastq or unaligned .sam file to write (default: None)
-g GUPPY_BIN, --guppy_bin GUPPY_BIN
path to ont_guppy/bin or ont-dorado-server/bin folder (default: None)
--config CONFIG basecalling model config (default: dna_r9.4.1_450bps_fast.cfg)
--guppy_batchsize GUPPY_BATCHSIZE
number of reads to send to guppy/dorado at a time. (default: 4000)
--call_mods output MM/ML tags for methylation - will output sam - use with appropriate mod config (default: False)
-q QSCORE, --qscore QSCORE
A mean q-score to split fastq/sam files into pass/fail output (default: None)
--slow5_threads SLOW5_THREADS
Number of threads to use reading slow5 file (default: 4)
--procs PROCS Number of worker processes to use processing reads (default: 4)
--slow5_batchsize SLOW5_BATCHSIZE
Number of reads to process at a time reading slow5 (default: 4000)
--quiet Don't print progress (default: False)
--max_read_queue_size MAX_READ_QUEUE_SIZE
Number of reads to process at a time reading slow5 (default: 20000)
--log LOG guppy/dorado log folder path (default: buttery_basecaller_logs)
--moves_out output move table (sam format only) (default: False)

Sequencing summary Options:
--seq_sum Write out sequencing_summary.txt file (default: False)

Read splitting Options:
--do_read_splitting Perform read splitting based on mid-strand adapter detection (default: False)
--min_score_read_splitting MIN_SCORE_READ_SPLITTING
Minimum mid-strand adapter score for reads to be split (default: 50.0)

Adapter trimming Options:
--detect_adapter Enable detection of adapters at the front and rear of the sequence (default: False)
--min_score_adapter MIN_SCORE_ADAPTER
Minimum score for a front or rear adapter to be classified. Default is 60. (default: 60.0)
--trim_adapters Flag indicating that adapters should be trimmed. Default is False. (default: False)
--detect_mid_strand_adapter
Flag indicating that read will be marked as unclassified if the adapter sequence appears within the strand itself. Default is False. (default:
False)

Barcode demultiplexing Options:
--barcode_kits BARCODE_KITS
Strings naming each barcode kit to use. Default is to not do barcoding. (default: None)
--enable_trim_barcodes
Flag indicating that barcodes should be trimmed. (default: False)
--require_barcodes_both_ends
Flag indicating that barcodes must be at both ends. (default: False)
--detect_mid_strand_barcodes
Flag indicating that read will be marked as unclassified if barcodes appear within the strand itself. (default: False)
--min_score_barcode_front MIN_SCORE_BARCODE_FRONT
Minimum score for a front barcode to be classified (default: 60.0)
--min_score_barcode_rear MIN_SCORE_BARCODE_REAR
Minimum score for a rear barcode to be classified (default: 60.0)
--min_score_barcode_mid MIN_SCORE_BARCODE_MID
Minimum score for mid barcodes to be detected (default: 60.0)
```

Set up flags needed and run (`--use_tcp` is needed but not forced in these early versions):

Expand All @@ -97,15 +153,14 @@ You must use guppy 6.3.0 or higher for mod calling

buttery-eel -g ont-guppy-6.3.8/bin --use_tcp -x "cuda:all" --config dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_fast.cfg --call_mods --port 5558 -i PAF25452_pass_bfdfd1d8_11.blow5 -o test.mod.sam


the `--config` file can be found using this command with guppy `guppy_basecaller --print_workflows` and looking up the appropriate kit and flowcell type. Specify the format like this `--config dna_r9.4.1_450bps_fast.cfg` ending in `.cfg`


## Aligning uSAM output and getting sorted bam using -y in minimap2

samtools fastq -TMM,ML test.mod.sam | minimap2 -ax map-ont -y ref.fa - | samtools view -Sb - | samtools sort - > test.aln.mod.bam


If you also wish to keep the quality scores in the unofficial qs tags or if mapping a regular unmapped sam the -T argument can be used in conjuntion with minimap2 -y for example: `-TMM,ML,qs` or `-Tqs`
If you also wish to keep the quality scores in the unofficial qs tags or if mapping a regular unmapped sam the -T argument can be used in conjunction with minimap2 -y for example: `-TMM,ML,qs` or `-Tqs`


# Shutting down server
Expand Down
Loading