Different number of reads for same pod5 with different models #409

KunFang93 · 2025-03-12T17:24:49Z

Hi,

I tried iterative strategy to finer tuning the model. I trained three models with
round0

#!/bin/bash

MODEL_PATH="dna_r10.4.1_e8.2_400bps_hac@v5.0.0"
REFERENCE="/mnt/data/kfang/reference/sacCer3_mm/sacCer3.mmi"
MIN_ACC=0.95

# Create necessary directories if they do not exist
#mkdir -p round1/test

# Loop over parts from 00 to 12
#for i in $(seq -w 0 12)
#do
#  # Create the directory for each part if it doesn't exist
#  mkdir -p "round1/part${i}"
#  bonito basecaller "$MODEL_PATH" \
#    "data_pod5_lig" \
#    --reference "$REFERENCE" \
#    --recursive \
#    --save-ctc \
#    --min-accuracy-save-ctc "$MIN_ACC" \
#    --read-ids "all_part${i}" \
#    > "round0/part${i}/basecalls_${i}.bam"
#done

# Create a comma-separated list of part directories for combining
#combine_list=$(seq -w 0 12 | sed 's/^/round1\/part/' | paste -sd "," -)
#python bonito_basecall_parts_combine_large.py -s "$combine_list" -o ./round0_combine

bonito train --epochs 1 --lr 5e-4 --directory ./round0_combine/ ./round0_combine/fine-tuned-model --pretrain "$MODEL_PATH"

round1

#!/bin/bash

MODEL_PATH="./round0_combine/fine-tuned-model/"
REFERENCE="/mnt/data/kfang/reference/sacCer3_mm/sacCer3.mmi"
MIN_ACC=0.95

# Create necessary directories if they do not exist
#mkdir -p round1/test

# Loop over parts from 00 to 12
#for i in $(seq -w 0 12)
#do
#  # Create the directory for each part if it doesn't exist
#  mkdir -p "round1/part${i}"
#  bonito basecaller "$MODEL_PATH" \
#    "data_pod5_lig" \
#    --reference "$REFERENCE" \
#    --recursive \
#    --save-ctc \
#    --min-accuracy-save-ctc "$MIN_ACC" \
#    --read-ids "all_part${i}" \
#    > "round1/part${i}/basecalls_${i}.bam"
#done

# Create a comma-separated list of part directories for combining
#combine_list=$(seq -w 0 12 | sed 's/^/round1\/part/' | paste -sd "," -)
#python bonito_basecall_parts_combine_large.py -s "$combine_list" -o ./round1_combine

bonito train --epochs 1 --lr 5e-4 --directory ./round1_combine/ ./round1_combine/fine-tuned-model --pretrain "$MODEL_PATH"

And I exported round0_combine/fine-tuned-model and round1_combine/fine-tune-model with bonito export

bonito export --output dna_bonito_model_r0 round0_combine/fine-tuned-model/
bonito export --output dna_bonito_model_r1 round1_combine/fine-tuned-model/

When I check the performance of different models, I found that the total number of reads are different from different model:
dorado official support sup model 2297227 in total

dorado basecaller sup,5mC_5hmC,6mA --kit-name SQK-RBK114-96 -r --output-dir ./ --reference sacCer3.mmi --mm2-opts "-k 15 -w 10" pod5/

samtools flagstat -@ 40 ../calls_2025-02-22_T21-36-58.bam
2297227 + 0 in total (QC-passed reads + QC-failed reads)
1894467 + 0 primary
327805 + 0 secondary
74955 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1717024 + 0 mapped (74.74% : N/A)
1314264 + 0 primary mapped (69.37% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

round0_model 1880743 in total

dorado basecaller /dna_bonito_model_r0 --kit-name SQK-RBK114-96 -r --output-dir ./ --reference sacCer3.mmi --mm2-opts "-k 15 -w 10" pod5/

samtools flagstat -@ 40 calls_2025-03-12_T15-08-54.bam
1880743 + 0 in total (QC-passed reads + QC-failed reads)
1554593 + 0 primary
266328 + 0 secondary
59822 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1589597 + 0 mapped (84.52% : N/A)
1263447 + 0 primary mapped (81.27% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

and round1_model 1753039 in total

dorado basecaller /dna_bonito_model_r1 --kit-name SQK-RBK114-96 -r --output-dir ./ --reference sacCer3.mmi --mm2-opts "-k 15 -w 10" pod5/

samtools flagstat -@ 40/bonito_r1/calls_2025-03-12_T15-45-24.bam
1753039 + 0 in total (QC-passed reads + QC-failed reads)
1420365 + 0 primary
265031 + 0 secondary
67643 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1595665 + 0 mapped (91.02% : N/A)
1262991 + 0 primary mapped (88.92% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I wondered if

these differences are as expected?
what's the potential cause for these?
how could I mitigate the reduction of the total reads number?
the procedure of training looks reasonable?

Apologize for so many questions. Thank you so much for helping!

Best,
Kun

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different number of reads for same pod5 with different models #409

Different number of reads for same pod5 with different models #409

KunFang93 commented Mar 12, 2025

Different number of reads for same pod5 with different models #409

Different number of reads for same pod5 with different models #409

Comments

KunFang93 commented Mar 12, 2025