Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in read numbers between Bonito input and output #407

Open
KunFang93 opened this issue Mar 11, 2025 · 2 comments
Open

Inconsistency in read numbers between Bonito input and output #407

KunFang93 opened this issue Mar 11, 2025 · 2 comments

Comments

@KunFang93
Copy link

KunFang93 commented Mar 11, 2025

Dear bonito team,

Thank you for providing this powerful tool! I’m a new user and recently encountered an issue regarding inconsistent read numbers between my input and output after basecalling with Bonito.

Initially, I extracted read IDs from POD5 files using:

pod5 view -r -I -o ./all_reads.txt -t 40 data_pod5_all

This resulted in all_reads.txt containing 1,837,954 reads.

Due to limited RAM on our workstation, I performed chunk-wise basecalling by splitting the read IDs:

split -d -l 150000  all_reads.txt all_part

 wc -l all_part*
  149999 all_part00
  150000 all_part01
  150000 all_part02
  150000 all_part03
  150000 all_part04
  150000 all_part05
  150000 all_part06
  150000 all_part07
  150000 all_part08
  150000 all_part09
  150000 all_part10
  150000 all_part11
   37955 all_part12
 1837954 total

and then I ran the following Bonito basecalling command:

MODEL_PATH="./round0_combine/fine-tuned-model/"
REFERENCE="/mnt/data/kfang/reference/sacCer3_mm/sacCer3.mmi"
MIN_ACC=0.95

# Create necessary directories if they do not exist
mkdir -p round1/test

# Loop over parts from 00 to 12
for i in $(seq -w 0 12)
do
  # Create the directory for each part if it doesn't exist
  mkdir -p "round1/part${i}"
  bonito basecaller "$MODEL_PATH" \
    "data_pod5_lig" \
    --reference "$REFERENCE" \
    --recursive \
    --save-ctc \
    --min-accuracy-save-ctc "$MIN_ACC" \
    --read-ids "all_part${i}" \
    > "round1/part${i}/basecalls_${i}.bam"
done

However, when examining the outputs (references.npy), I found discrepancies in the total number of processed reads:

for i in ['00','01','02','03','04','05','06','07','08','09','10','11','12']:
   print(i,len(np.load(f'part{i}/references.npy')))

00 152404
01 154351
02 149025
03 159176
04 153722
05 165322
06 141393
07 93054
08 154408
09 144537
10 135260
11 152000
12 17400

The total number of reads after basecalling is 1,772,052, differing significantly from the original 1,837,954 reads.

The training step shows even less reads of 1718890

bonito_train$ bash bonito_round1.sh
[loading model]
[using pretrained model ./round0_combine/fine-tuned-model/]
[loading data]
[validation set not found: splitting training set]
[826176/1718890]:  48%|##########################4                            | [39:34, loss=0.0645]^

Could you please help me understand the potential reasons for this inconsistency?

Thank you for your assistance!

Best regards,
Kun

@davidnewman02
Copy link
Collaborator

When you prepare data with --save-ctc you are extracting fixed-length "chunks" of signal from variable-length total signals only for the parts which align to the reference sequence.

There are two things that cause the difference in the number of input reads compared to output chunks:

  1. Some reads are being filtered out due to not aligning, low-accuracy or short-reads
    Although you have 1.84M reads in your input data, not every read will align to the reference (check your *_summary.tsv file to see which) and reads which don't align will be excluded. There is also an accuracy filter (--min-accuracy-save-ctc) which will only keep reads with high alignment accuracy. Finally if the section of signal which aligns to the reference is shorter than the chunk-length it will be excluded. The full code is here: https://github.com/nanoporetech/bonito/blob/master/bonito/io.py#L550

  2. Some reads will have >1 chunk per read
    If the section of read which aligns to the reference is long enough it is possible to extract >1 chunk from a single read.

@KunFang93
Copy link
Author

Thank you so much for the reply!

Please correct me if I'm misunderstood:
the varies of the size of chunks.npy resulted from either
a. reads be filtered or
b. reads longer enough to split into multiple chunks.

Based on this, I did have some following questions:

  1. I have set '--min-accuracy-save-ctc 0.95' but I noticed there are many reads with alignment_accuracy less than 0.95 (and a lot of them are 0). I wondered if '--min-accuracy-save-ctc 0.95' controls the alignment_accuracy column in summary.tsv file?
  2. I wondered if it is possible to skip short reads basecall with certain cutoff, e.g., throw reads shorter than 500 bp?
  3. I wondered what the reason cause this kind of reads?
filename                              FBA81105_4a3cebc4_3a9bd518_0.pod5
read_id                            2079bd15-37fb-4bbf-b7de-6b48b848c77c
run_id                         3a9bd5180507d78ad25a325cdb3cbb42e73439b0
channel                                                             256
mux                                                                   4
start_time                                                      106.694
duration                                                         1.0098
template_start                                                  106.696
template_duration                                                1.0078
sequence_length_template                                              1
mean_qscore_template                                                1.0
alignment_genome                                                      *
alignment_genome_start                                               -1
alignment_genome_end                                                 -1
alignment_strand_start                                               -1
alignment_strand_end                                                 -1
alignment_direction                                                   *
alignment_length                                                      0
alignment_num_aligned                                                 0
alignment_num_correct                                                 0
alignment_num_insertions                                              0
alignment_num_deletions                                               0
alignment_num_substitutions                                           0
alignment_mapq                                                        0
alignment_strand_coverage                                           0.0
alignment_identity                                                  0.0
alignment_accuracy                                                  0.0

Thanks again!

Best,
Kun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants