Inconsistency in read numbers between Bonito input and output #407

KunFang93 · 2025-03-11T02:59:01Z

Dear bonito team,

Thank you for providing this powerful tool! I’m a new user and recently encountered an issue regarding inconsistent read numbers between my input and output after basecalling with Bonito.

Initially, I extracted read IDs from POD5 files using:

pod5 view -r -I -o ./all_reads.txt -t 40 data_pod5_all

This resulted in all_reads.txt containing 1,837,954 reads.

Due to limited RAM on our workstation, I performed chunk-wise basecalling by splitting the read IDs:

split -d -l 150000  all_reads.txt all_part

 wc -l all_part*
  149999 all_part00
  150000 all_part01
  150000 all_part02
  150000 all_part03
  150000 all_part04
  150000 all_part05
  150000 all_part06
  150000 all_part07
  150000 all_part08
  150000 all_part09
  150000 all_part10
  150000 all_part11
   37955 all_part12
 1837954 total

and then I ran the following Bonito basecalling command:

MODEL_PATH="./round0_combine/fine-tuned-model/"
REFERENCE="/mnt/data/kfang/reference/sacCer3_mm/sacCer3.mmi"
MIN_ACC=0.95

# Create necessary directories if they do not exist
mkdir -p round1/test

# Loop over parts from 00 to 12
for i in $(seq -w 0 12)
do
  # Create the directory for each part if it doesn't exist
  mkdir -p "round1/part${i}"
  bonito basecaller "$MODEL_PATH" \
    "data_pod5_lig" \
    --reference "$REFERENCE" \
    --recursive \
    --save-ctc \
    --min-accuracy-save-ctc "$MIN_ACC" \
    --read-ids "all_part${i}" \
    > "round1/part${i}/basecalls_${i}.bam"
done

However, when examining the outputs (references.npy), I found discrepancies in the total number of processed reads:

for i in ['00','01','02','03','04','05','06','07','08','09','10','11','12']:
   print(i,len(np.load(f'part{i}/references.npy')))

00 152404
01 154351
02 149025
03 159176
04 153722
05 165322
06 141393
07 93054
08 154408
09 144537
10 135260
11 152000
12 17400

The total number of reads after basecalling is 1,772,052, differing significantly from the original 1,837,954 reads.

The training step shows even less reads of 1718890

bonito_train$ bash bonito_round1.sh
[loading model]
[using pretrained model ./round0_combine/fine-tuned-model/]
[loading data]
[validation set not found: splitting training set]
[826176/1718890]:  48%|##########################4                            | [39:34, loss=0.0645]^

Could you please help me understand the potential reasons for this inconsistency?

Thank you for your assistance!

Best regards,
Kun

The text was updated successfully, but these errors were encountered:

davidnewman02 · 2025-03-11T16:33:46Z

When you prepare data with --save-ctc you are extracting fixed-length "chunks" of signal from variable-length total signals only for the parts which align to the reference sequence.

There are two things that cause the difference in the number of input reads compared to output chunks:

Some reads are being filtered out due to not aligning, low-accuracy or short-reads
Although you have 1.84M reads in your input data, not every read will align to the reference (check your *_summary.tsv file to see which) and reads which don't align will be excluded. There is also an accuracy filter (--min-accuracy-save-ctc) which will only keep reads with high alignment accuracy. Finally if the section of signal which aligns to the reference is shorter than the chunk-length it will be excluded. The full code is here: https://github.com/nanoporetech/bonito/blob/master/bonito/io.py#L550
Some reads will have >1 chunk per read
If the section of read which aligns to the reference is long enough it is possible to extract >1 chunk from a single read.

KunFang93 · 2025-03-11T17:34:22Z

Thank you so much for the reply!

Please correct me if I'm misunderstood:
the varies of the size of chunks.npy resulted from either
a. reads be filtered or
b. reads longer enough to split into multiple chunks.

Based on this, I did have some following questions:

I have set '--min-accuracy-save-ctc 0.95' but I noticed there are many reads with alignment_accuracy less than 0.95 (and a lot of them are 0). I wondered if '--min-accuracy-save-ctc 0.95' controls the alignment_accuracy column in summary.tsv file?
I wondered if it is possible to skip short reads basecall with certain cutoff, e.g., throw reads shorter than 500 bp?
I wondered what the reason cause this kind of reads?

filename                              FBA81105_4a3cebc4_3a9bd518_0.pod5
read_id                            2079bd15-37fb-4bbf-b7de-6b48b848c77c
run_id                         3a9bd5180507d78ad25a325cdb3cbb42e73439b0
channel                                                             256
mux                                                                   4
start_time                                                      106.694
duration                                                         1.0098
template_start                                                  106.696
template_duration                                                1.0078
sequence_length_template                                              1
mean_qscore_template                                                1.0
alignment_genome                                                      *
alignment_genome_start                                               -1
alignment_genome_end                                                 -1
alignment_strand_start                                               -1
alignment_strand_end                                                 -1
alignment_direction                                                   *
alignment_length                                                      0
alignment_num_aligned                                                 0
alignment_num_correct                                                 0
alignment_num_insertions                                              0
alignment_num_deletions                                               0
alignment_num_substitutions                                           0
alignment_mapq                                                        0
alignment_strand_coverage                                           0.0
alignment_identity                                                  0.0
alignment_accuracy                                                  0.0

Thanks again!

Best,
Kun

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in read numbers between Bonito input and output #407

Inconsistency in read numbers between Bonito input and output #407

KunFang93 commented Mar 11, 2025 •

edited

Loading

davidnewman02 commented Mar 11, 2025

KunFang93 commented Mar 11, 2025

Inconsistency in read numbers between Bonito input and output #407

Inconsistency in read numbers between Bonito input and output #407

Comments

KunFang93 commented Mar 11, 2025 • edited Loading

davidnewman02 commented Mar 11, 2025

KunFang93 commented Mar 11, 2025

KunFang93 commented Mar 11, 2025 •

edited

Loading