Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transcript numbers don't match between OUT.transcript_model_grouped_tpm.tsv and OUT.transcript_models.gtf #195

Closed
dudududu12138 opened this issue May 29, 2024 · 11 comments · Fixed by #216
Labels
bug Something isn't working fixed in dev Issue resolved but not released yet fixed in release Issue resolved and the fix is released, waiting for approval

Comments

@dudududu12138
Copy link

dudududu12138 commented May 29, 2024

Hi, I recently used isoquant(version 3.4.1) to detect isoforms from multiple samples. But the results are a little odd. Below are my quaetions:

  1. There are 167413 transcripts in OUT.transcript_models.gtf while 166857 in OUT.transcript_model_tpm.tsv.
  2. I didn't see the <combine.*> files. There are only <grouped> files. So what is the result file that report the expression levels of multisamples?
  3. Below is all my outputs:
aux                          OUT.gene_grouped_counts.tsv  OUT.transcript_counts.tsv          OUT.transcript_model_grouped_counts.tsv  OUT.transcript_model_tpm.tsv
OUT.corrected_reads.bed.gz   OUT.gene_grouped_tpm.tsv     OUT.transcript_grouped_counts.tsv  OUT.transcript_model_grouped_tpm.tsv     OUT.transcript_tpm.tsv
OUT.extended_annotation.gtf  OUT.gene_tpm.tsv             OUT.transcript_grouped_tpm.tsv     OUT.transcript_model_reads.tsv.gz
OUT.gene_counts.tsv          OUT.read_assignments.tsv.gz  OUT.transcript_model_counts.tsv    OUT.transcript_models.gtf
  1. Below is my codes:
samples=`ls ../minimap2/*sorted.bam`

isoquant.py -t 30 \
        --reference $ref \
        --genedb $anno --complete_genedb \
        --bam $samples \
        --data_type pacbio_ccs \
        -o $out/

Thank you very much!

@andrewprzh
Copy link
Collaborator

Dear @dudududu12138

  1. Could you send me the files? This looks unexpected.
  2. Currently, if you provide several BAM files, they will treated as distinct samples within the same experiment. Thus, a single output folder with a single transcript_models.gtf files will be created. "Per-file" count/TPM tables are stored in _grouped files.
    Combined tables are now created only if you provide 2 or more experiments via YAML dataset file.

Best
Andrey

@dudududu12138
Copy link
Author

dudududu12138 commented Jun 24, 2024

Dear @dudududu12138

  1. Could you send me the files? This looks unexpected.
  2. Currently, if you provide several BAM files, they will treated as distinct samples within the same experiment. Thus, a single output folder with a single transcript_models.gtf files will be created. "Per-file" count/TPM tables are stored in _grouped files.
    Combined tables are now created only if you provide 2 or more experiments via YAML dataset file.

Best Andrey

Sorry for the late reply. Below are the OUT.transcript_model_counts.tsv and OUT.transcript_models.gtf files. There are more transcripts in the gtf file. That is ,the transcripts in *counts.tsv are subset of transcripts in *models.gtf. It is odd.
OUT.transcript_models.gtf.gz
OUT.transcript_model_counts.tsv.gz

Moreover, I noticed that in my result, OUT.transcript_model_reads.tsv.gz and OUT.transcript_models.gtf have the same number of transcripts. While OUT.transcript_model_tpm.tsv and OUT.transcript_model_counts.tsv have the same number of transcripts and they are the subset of OUT.transcript_models.gtf

@dudududu12138
Copy link
Author

dudududu12138 commented Jun 24, 2024

By the way, I checked myself. I extracted an example of a transcript that is reported in the OUT.transcript_models.gtf file but not in the OUT.transcript_model_counts.tsv file.
Below is the transcript id and read id that support the transcript that extracted from OUT.transcript_model_reads.tsv.gz.
1719198912153
Then I extracted the these reads from OUT.read_assignments.tsv.gz and saved them into file(example.read.assignment.txt.)
It is odd that some reads that support the novel transcript are full splice match and uniquely assigned to the reference transcript. So I am very confused.
Is the OUT.transcript_model_counts.tsv just counts the reads that uniquely mapped to the transcript?

@andrewprzh
Copy link
Collaborator

@dudududu12138

Yes, this is really odd.
Could you also send me the full log file?

@andrewprzh andrewprzh added the bug Something isn't working label Jun 28, 2024
@andrewprzh
Copy link
Collaborator

@dudududu12138

Here is what I found so far. Output GTF and supporting reads are likely correct. However, expression estimation algorithm is independent from transcript discovery and for some reason assigns 0 to some of the transcripts.
Transcripts with 0 counts/TPMs are not printed to transcript_model_counts.tsv. Hence the difference.

I will now check why expression estimation algorithm assigns 0 to these transcripts despite the fact they are supported by reads.

Best
Andrey

@dudududu12138
Copy link
Author

dudududu12138 commented Jul 4, 2024

@dudududu12138

Here is what I found so far. Output GTF and supporting reads are likely correct. However, expression estimation algorithm is independent from transcript discovery and for some reason assigns 0 to some of the transcripts. Transcripts with 0 counts/TPMs are not printed to transcript_model_counts.tsv. Hence the difference.

Yes, I also found this reason.

I will now check why expression estimation algorithm assigns 0 to these transcripts despite the fact they are supported by reads.

Thanks, looking forward to your checking results!

@dudududu12138
Copy link
Author

@dudududu12138

Yes, this is really odd. Could you also send me the full log file?

This is the log file.
isoquant.log

@dudududu12138
Copy link
Author

@dudududu12138

Yes, this is really odd. Could you also send me the full log file?

@andrewprzh
Copy link
Collaborator

I think I found the problem. IsoQuant outputs a few novel isoforms that are very similar to the reference ones, and that's where read-to-isoform assignment goes wrong and the novel isoform gets 0 counts because all assignments are ambiguous. So, these similar isoforms should not be there in the first place.

I'll make a bug-fix release soon.

@andrewprzh andrewprzh added the fixed in dev Issue resolved but not released yet label Jul 14, 2024
@andrewprzh
Copy link
Collaborator

andrewprzh commented Jul 14, 2024

Will be fixed in 3.5.0.

@andrewprzh andrewprzh added the fixed in release Issue resolved and the fix is released, waiting for approval label Aug 3, 2024
@andrewprzh
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed in dev Issue resolved but not released yet fixed in release Issue resolved and the fix is released, waiting for approval
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants