Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get isoform counts for single cell RNA-seq data? #48

Open
biopzhang opened this issue Nov 2, 2022 · 3 comments
Open

How to get isoform counts for single cell RNA-seq data? #48

biopzhang opened this issue Nov 2, 2022 · 3 comments

Comments

@biopzhang
Copy link

biopzhang commented Nov 2, 2022

Great tool that integrates lots of functions!

I was wondering if there is a way to get the isoform counts. I was trying to get the isoform counts following your Nature paper (specifically https://github.com/pachterlab/BYVSTZP_2020).

You mentioned that for the 10xv3 data, "gene-count matrices were made by using the -genecounts flag and TCC matrices were made by omitting it". It works great for the gene-count part with the following command:

$ kb count --h5ad -i index.idx -g t2g.txt -x 10xv3 -o XXX -m 64G --workflow standard --filter bustools -t 32

I got the cells x genes matrix both in the mtx and h5ad format.

My question is, how to get a cells x transcripts matrix? It does not seem to work by simply adding the "--tcc" to the above command. I can get a cells x tcc mtx, but not the cells x transcripts mtx. Moreover, I don't know how to apply or omit the "--genecounts" flag.

Thank you so much!
P.

@Yenaled
Copy link

Yenaled commented Nov 3, 2022

Currently, kb count only does transcript quantification for bulk/smart-seq data (where each sample or cell is in a separate FASTA file).

For 10X type data, kb count stops at the cells x tcc mtx. However, you can run "kallisto quant-tcc" on the cells x tcc mtx to try to get transcript quantification.

@biopzhang
Copy link
Author

Thank you for your quick reply, Yenaled!

I was testing this on the forebrain glutamatergic neuronal lineage data in the KBtools tutorial. The kb count tcc matrix (394,494 x 6,238,208) is huge for the kallisto quant-tcc step. It runs forever even on an HPC cluster node (64 cores, ~ TB memory; 12 hours now, still running). I think probably I should only take the cells according to other studies, such as in the RNA velocity study (only about 1800 cells are kept). Could you please commend on this?

@Yenaled
Copy link

Yenaled commented Nov 4, 2022

Oh, with such a large matrix, it's computationally intractable. You will definitely need to filter cells.

The EM algorithm (which gives you transcript counts) in quant-tcc only takes a few seconds to run, but if you multiply a few seconds by hundreds of thousands of cells, well, you do the math of how long it'll take to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants