Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are novel transcripts with most reads assigned not reported? #127

Closed
SarahNadeau opened this issue Dec 5, 2023 · 8 comments
Closed
Labels
weird results Something looks odd in the resulting files

Comments

@SarahNadeau
Copy link

SarahNadeau commented Dec 5, 2023

Hello,

I am a bit confused by my isoquant results. I analyzed three BAM files with combined 109932 mappings (109931 primary, 1 supplementary). The log shows that all 109931 reads were classified as intergenic and have polyAs (this is what I expect, we added polyAs to the alignment).

Questions

  1. I was suprised there are only 107234 reads in transcript_model_reads.tsv. Why would some reads not be in this output?

  2. I was also surprised that a pivot table of OUT.transcript_model_reads.tsv shows reads are assigned to many different transcripts, but only one is reported in OUT.transcript_model_grouped_counts.tsv. The reported transcript this is not even the one with the most read support (see screenshots below).

Supporting files

isoquant_transcript_model_reads_counts.zip

Pivot table of the 107234 reads in transcript_model_reads.tsv:
Screenshot 2023-12-14 at 10 33 15

OUT.transcript_model_grouped_counts.tsv:
Screenshot 2023-12-14 at 10 34 09

Command line (run as part of a snakemake workflow):

isoquant.py \
    --reference {input.reference} \
    --genedb {input.annotations} \
    --data_type nanopore \
    --transcript_quantification unique_only \
    --model_construction_strategy all \
    --stranded forward \
    --count_exons \
    --bam {input.bam} \
    --data_type nanopore \
    --threads {threads} \
    -o $WORKDIR
@andrewprzh
Copy link
Collaborator

Dear @SarahNadeau

I was suprised there are only 107234 reads in transcript_model_reads.tsv. Why would some reads not be in this output?

In some cases there is not enough support for the transcript to be reported as reliable. E.g. if a novel transcript is supported only by a few, it will be discarded, and these reads will not be used.

I was also surprised that a pivot table of OUT.transcript_model_reads.tsv shows reads are assigned to many different transcripts, but only one is reported in OUT.transcript_model_grouped_counts.tsv. The reported transcript this is not even the one with the most read support (see screenshots below).

This is indeed surprising, could you share non-grouped counts?

Best
Andrey

@andrewprzh andrewprzh added the weird results Something looks odd in the resulting files label Dec 13, 2023
@SarahNadeau
Copy link
Author

Hi @andrewprzh,

Thanks for the reply! The non-grouped counts also only include the one reported transcript. In case it helps, I've also included a copy of the log file.

isoquant_log_transcript_model_counts.zip

@SarahNadeau
Copy link
Author

Just fyi I redacted some of the paths etc. in the log file, I hope that's not a problem.

@SarahNadeau
Copy link
Author

Dear @andrewprzh,

I think I found the reason for these weird results. I played around with the source code, commenting out the lines that filter out similar and not-well-supported novel transcripts and uncommenting the lines that log why they would be filtered out.

I can now see that there is an unreported transcript (number 9) that is discarded for being very similar to the transcript with the highest read support. For this reason, the unique reads are quite low for the transcript with the highest read support and thus the count values are also low compared to the transcript with the third highest number of reads supporting it.

Based on these observations, would it make sense to re-calculate the unique reads assigned to a transcript after filtering out similar and not-well-supported transcripts? That way the counts are representative of the read support.

Attached are some supporting materials so you can see what I mean. I apologize that I don't have a small reproducible example and that also these transcript and read/count numbers differ from my original example because I started using only a subset of the reads in order to iterate faster.

Command line:

./IsoQuant/isoquant.py \
   --bam results/align_reads/OUT/aux/OUT_barcode_01.filtered.bam \
         results/align_reads/OUT/aux/OUT_barcode_02.filtered.bam \
         results/align_reads/OUT/aux/OUT_barcode_03.filtered.bam \
   -r data/references/ref.fa.gz \
   -g data/references/ref_coords_corrected.gtf \
   -d nanopore \
   --transcript_quantification unique_only \
   --debug \
   -o results/isoquant

Selected lines from isoquant.log:

2023-12-21 14:58:38,116 - DEBUG - Novel model transcript9.NC_000072.7.nnic has a similar isoform transcript7.NC_000072.7.nnic
2023-12-21 14:58:38,119 - DEBUG - Novel model transcript11.NC_000072.7.nnic has coverage 480 < 534.46, component cov = 26723
2023-12-21 14:58:38,144 - DEBUG - Novel model transcript25.NC_000072.7.nnic has a similar isoform transcript7.NC_000072.7.nnic
2023-12-21 14:58:38,151 - DEBUG - Novel model transcript38.NC_000072.7.nnic has a similar isoform transcript23.NC_000072.7.nnic
2023-12-21 14:58:38,153 - DEBUG - Novel model transcript44.NC_000072.7.nnic has coverage 9 < 534.46, component cov = 26723
2023-12-21 14:58:38,155 - DEBUG - Novel model transcript49.NC_000072.7.nnic has coverage 5 < 534.46, component cov = 26723

Screenshot 2023-12-21 at 15 15 04

transcript_model_counts_and_reads.zip

@andrewprzh
Copy link
Collaborator

Dear @SarahNadeau

Wow, thanks a lot for sharing those insights! It's really cool that you wend down to looks at the code, especially taking into account that transcript discovery part is not that well structured.

I totally agree about recalculating the counts! Could you provide the exact lines you were modifying? I probably have a clue which ones are you talking about, but there a few filtering steps so I'd like to be sure :)

Best
Andrey

@SarahNadeau
Copy link
Author

Hi @andrewprzh,

The changes are in the filter_transcripts function in src/graph_based_model_construction.py: https://github.com/SarahNadeau/IsoQuant/blob/master/src/graph_based_model_construction.py#L201.

Here is the git diff:

diff --git a/src/graph_based_model_construction.py b/src/graph_based_model_construction.py
index 8b7b382..0b9f3ec 100644
--- a/src/graph_based_model_construction.py
+++ b/src/graph_based_model_construction.py
@@ -220,24 +220,24 @@ class GraphBasedModelConstructor:
                                            self.params.min_novel_count_rel * component_coverage)
 
             if model.transcript_id in to_substitute:
-                #logger.debug("Novel model %s has a similar isoform %s" % (model.transcript_id, to_substitute[model.transcript_id]))
+                logger.debug("Novel model %s has a similar isoform %s" % (model.transcript_id, to_substitute[model.transcript_id]))
                 self.transcript_read_ids[to_substitute[model.transcript_id]] += self.transcript_read_ids[model.transcript_id]
-                del self.transcript_read_ids[model.transcript_id]
+                # del self.transcript_read_ids[model.transcript_id]
                 continue
 
             if self.internal_counter[model.transcript_id] < novel_isoform_cutoff:
-                #logger.debug("Novel model %s has coverage %d < %.2f, component cov = %d" % (model.transcript_id,
-                #                                                        self.internal_counter[model.transcript_id],
-                #                                                        novel_isoform_cutoff, component_coverage))
-                del self.transcript_read_ids[model.transcript_id]
+                logger.debug("Novel model %s has coverage %d < %.2f, component cov = %d" % (model.transcript_id,
+                                                                       self.internal_counter[model.transcript_id],
+                                                                       novel_isoform_cutoff, component_coverage))
+                # del self.transcript_read_ids[model.transcript_id]
                 continue
 
             if len(model.exon_blocks) <= 2:
                 mapq = self.mapping_quality(model)
-                #logger.debug("Novel model %s has quality %.2f" % (model.transcript_id, mapq))
+                logger.debug("Novel model %s has quality %.2f" % (model.transcript_id, mapq))
                 if mapq < self.params.simple_models_mapq_cutoff:
-                    #logger.debug("Novel model %s has poor quality" % model.transcript_id)
-                    del self.transcript_read_ids[model.transcript_id]
+                    logger.debug("Novel model %s has poor quality" % model.transcript_id)
+                    # del self.transcript_read_ids[model.transcript_id]
                     continue
 
             # TODO: correct ends for known

@andrewprzh
Copy link
Collaborator

Dear @SarahNadeau

Thanks a lot! Will get my hands on it!

Best
Andrey

@andrewprzh
Copy link
Collaborator

Should be now fixed in IsoQuant 3.4.
Read assignment is completely reworked now. Thank you for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
weird results Something looks odd in the resulting files
Projects
None yet
Development

No branches or pull requests

2 participants