Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kallisto bustools with reference transcriptome #45

Open
MartaBenegas opened this issue Sep 26, 2022 · 5 comments
Open

kallisto bustools with reference transcriptome #45

MartaBenegas opened this issue Sep 26, 2022 · 5 comments

Comments

@MartaBenegas
Copy link

Dear team,

I'm a little bit confused about the build index step. The manual says that it builds a transcriptome index but needs as input a genomic fasta and a gff. I would like to create the count table using a reference transcriptome. Is this possible with kallisto + bustools?

Thank you,
Marta.

@Yenaled
Copy link

Yenaled commented Sep 26, 2022

kb ref makes a reference transcriptome from a genome fasta and gtf.

If you already have a transcriptome, there's no need to use kb ref. Simply use kallisto index -i index.idx reference_transcriptome.fasta to create your index (index.idx).

@MartaBenegas
Copy link
Author

Thanks for the explanation!

@MartaBenegas
Copy link
Author

Dear Delaney, sorry for re-open the issue.

In order to use the kb count, I also need the transcript-to-gene mapping file. Which kind of file it is? Is it a tab file with transcript in one column and gene name in another?

Moreover, is there another option to perform the counting without using this file? I would like to use a de novo assembled transcriptome so I don't have this piece of information.

Thanks!

@MartaBenegas MartaBenegas reopened this Sep 28, 2022
@Yenaled
Copy link

Yenaled commented Sep 29, 2022

It's just a tab file with transcript in first column and gene name in second column.

You need this file to performing the counting -- but, if you want, you can pretend that each transcript is its own gene (i.e. put the transcript name in both columns).

The main issue is that kb count will discard all multimappers (i.e. if a UMI maps to more than 1 gene, that UMI will not be counted). Thus, multimapping might be a big issue if you pretend each transcript belongs to a different gene.

There are ways around this (e.g. if you use the --tcc option in kb count, an EM algorithm will try to probablistically figure out what to do with the multimappers). It basically boils down to: If you have a UMI associated with transcripts A, B, and C but have no gene-level information, how do you want to count that UMI?

@MartaBenegas
Copy link
Author

Hi Delaney, thank you very much for your explanation!
Now I see that multimappers are really an issue, I hadn't taken this fact into account so thank you for pointing that out!

Is there a way to not discard multimappers? And assign the count to the transcript with the most reliable alignment or something similar.

To explain my context a little bit, I'm working with a non-model organism and I've obtained my own curated reference transcriptome. Now I would like to use it for single-cell analysis, so I was searching for a counting algorithm that worked with a reference transcriptome. For the time being, I think I'll use your workaround to see how it behaves and maybe perform a sequence clustering to my transcriptome prior to the counting. I know it's not the perfect procedure, but I'll let you know how it goes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants