Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different number of reads for same pod5 with different models #409

Open
KunFang93 opened this issue Mar 12, 2025 · 0 comments
Open

Different number of reads for same pod5 with different models #409

KunFang93 opened this issue Mar 12, 2025 · 0 comments

Comments

@KunFang93
Copy link

Hi,

I tried iterative strategy to finer tuning the model. I trained three models with
round0

#!/bin/bash

MODEL_PATH="dna_r10.4.1_e8.2_400bps_hac@v5.0.0"
REFERENCE="/mnt/data/kfang/reference/sacCer3_mm/sacCer3.mmi"
MIN_ACC=0.95

# Create necessary directories if they do not exist
#mkdir -p round1/test

# Loop over parts from 00 to 12
#for i in $(seq -w 0 12)
#do
#  # Create the directory for each part if it doesn't exist
#  mkdir -p "round1/part${i}"
#  bonito basecaller "$MODEL_PATH" \
#    "data_pod5_lig" \
#    --reference "$REFERENCE" \
#    --recursive \
#    --save-ctc \
#    --min-accuracy-save-ctc "$MIN_ACC" \
#    --read-ids "all_part${i}" \
#    > "round0/part${i}/basecalls_${i}.bam"
#done

# Create a comma-separated list of part directories for combining
#combine_list=$(seq -w 0 12 | sed 's/^/round1\/part/' | paste -sd "," -)
#python bonito_basecall_parts_combine_large.py -s "$combine_list" -o ./round0_combine

bonito train --epochs 1 --lr 5e-4 --directory ./round0_combine/ ./round0_combine/fine-tuned-model --pretrain "$MODEL_PATH"

round1

#!/bin/bash

MODEL_PATH="./round0_combine/fine-tuned-model/"
REFERENCE="/mnt/data/kfang/reference/sacCer3_mm/sacCer3.mmi"
MIN_ACC=0.95

# Create necessary directories if they do not exist
#mkdir -p round1/test

# Loop over parts from 00 to 12
#for i in $(seq -w 0 12)
#do
#  # Create the directory for each part if it doesn't exist
#  mkdir -p "round1/part${i}"
#  bonito basecaller "$MODEL_PATH" \
#    "data_pod5_lig" \
#    --reference "$REFERENCE" \
#    --recursive \
#    --save-ctc \
#    --min-accuracy-save-ctc "$MIN_ACC" \
#    --read-ids "all_part${i}" \
#    > "round1/part${i}/basecalls_${i}.bam"
#done

# Create a comma-separated list of part directories for combining
#combine_list=$(seq -w 0 12 | sed 's/^/round1\/part/' | paste -sd "," -)
#python bonito_basecall_parts_combine_large.py -s "$combine_list" -o ./round1_combine

bonito train --epochs 1 --lr 5e-4 --directory ./round1_combine/ ./round1_combine/fine-tuned-model --pretrain "$MODEL_PATH"

And I exported round0_combine/fine-tuned-model and round1_combine/fine-tune-model with bonito export

bonito export --output dna_bonito_model_r0 round0_combine/fine-tuned-model/
bonito export --output dna_bonito_model_r1 round1_combine/fine-tuned-model/

When I check the performance of different models, I found that the total number of reads are different from different model:
dorado official support sup model 2297227 in total

dorado basecaller sup,5mC_5hmC,6mA --kit-name SQK-RBK114-96 -r --output-dir ./ --reference sacCer3.mmi --mm2-opts "-k 15 -w 10" pod5/

samtools flagstat -@ 40 ../calls_2025-02-22_T21-36-58.bam
2297227 + 0 in total (QC-passed reads + QC-failed reads)
1894467 + 0 primary
327805 + 0 secondary
74955 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1717024 + 0 mapped (74.74% : N/A)
1314264 + 0 primary mapped (69.37% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

round0_model 1880743 in total

dorado basecaller /dna_bonito_model_r0 --kit-name SQK-RBK114-96 -r --output-dir ./ --reference sacCer3.mmi --mm2-opts "-k 15 -w 10" pod5/

samtools flagstat -@ 40 calls_2025-03-12_T15-08-54.bam
1880743 + 0 in total (QC-passed reads + QC-failed reads)
1554593 + 0 primary
266328 + 0 secondary
59822 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1589597 + 0 mapped (84.52% : N/A)
1263447 + 0 primary mapped (81.27% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

and round1_model 1753039 in total

dorado basecaller /dna_bonito_model_r1 --kit-name SQK-RBK114-96 -r --output-dir ./ --reference sacCer3.mmi --mm2-opts "-k 15 -w 10" pod5/

samtools flagstat -@ 40/bonito_r1/calls_2025-03-12_T15-45-24.bam
1753039 + 0 in total (QC-passed reads + QC-failed reads)
1420365 + 0 primary
265031 + 0 secondary
67643 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1595665 + 0 mapped (91.02% : N/A)
1262991 + 0 primary mapped (88.92% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I wondered if

  1. these differences are as expected?
  2. what's the potential cause for these?
  3. how could I mitigate the reduction of the total reads number?
  4. the procedure of training looks reasonable?

Apologize for so many questions. Thank you so much for helping!

Best,
Kun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant