Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing TAR-scRNA-seq outputs #8

Open
dkeitley opened this issue Jun 10, 2021 · 4 comments
Open

Processing TAR-scRNA-seq outputs #8

dkeitley opened this issue Jun 10, 2021 · 4 comments

Comments

@dkeitley
Copy link

Hi Michael,

Sorry for all the questions. This might be a bit naive.

I was just wondering how you went about processing the matrices outputted from the pipeline to construct a counts matrix...? Did you use standard functions to aggregate the matrices across the different directories (e.g. something like read10Counts from DropletUtils)?

@fw262
Copy link
Owner

fw262 commented Jun 10, 2021

Hi Dan,

More questions are never a problem!

The pipeline generates count matrices in text form (from_fastq branch) and in 10X output form (from_cellranger branch) for each individual sample. The count matrices are not combined across samples. If you want to combined count matrices across samples, you would need to do that in Seurat or Scanpy through different integration techniques such as Harmony, Scanorama, SCT transform, etc. Integration is currently NOT available in the TAR-scRNA-seq workflow.

Hope that helps!
Michael

@dkeitley
Copy link
Author

Ok makes sense. I thought there might be a standard way to read in lots of .txt.gz files and aggregate them together but maybe just calling fread and cbind will work fine in a loop.

But now I'm thinking about this, I've also realised that the features in the TAR count matrices across my different samples aren't the same.

e.g.

> mat1 <- fread("SIGAA12_S45_L001_TAR_expression_matrix_withDir.txt.gz")
> mat2 <- fread("../SIGAB12_S49_L001/SIGAB12_S49_L001_TAR_expression_matrix_withDir.txt.gz")

> dim(mat1)
[1] 41553 15001

> dim(mat2)
[1] 88374 15001

> mat1[1:5,1:3]
                                       GENE TCCACGTGTTGACGGA CACTGTCCACACCGCA
1: 10_10019099_10066299_-_28895_C7orf31_-_1                0                0
2:            10_10021149_10060449_+_6602_0                0                0
3:            10_10074599_10079749_+_1017_0                0                0
4:             10_10106549_10110499_+_505_0                0                0
5:            10_10117999_10129149_-_1177_0                0                0

> mat2[1:5,1:3]
                                       GENE CACGTGGAGCCGATCC TCCACCATCGACGCGT
1: 10_10019099_10066299_-_28895_C7orf31_-_1                0                0
2:            10_10021149_10060449_+_6602_0                0                0
3:            10_10074599_10079749_+_1017_0                0                0
4:             10_10106399_10110949_-_567_0                0                0
5:            10_10117999_10129149_-_1177_0                0                0


Maybe I've misunderstood or have run the pipeline incorrectly but I was expecting that the features would be consistent across the different samples (ignoring maybe the coverage values in the feature names which differ) so that I could combine the count matrices together and then as you say, integrate to get a multi-sample dataset that includes TAR features.

Am I getting confused? In the chicken dataset for example, is it possible to combine the day 4 and day 7 samples with TAR features?

@mckellardw

This comment has been minimized.

@fw262
Copy link
Owner

fw262 commented Jun 21, 2021

Hi Dan,

Sorry for getting back to you so late.

Please note that there is in fact a common set of TAR features across all samples, but each sample will not have expression in all of the features. In your example, mat1 and mat2 have many features in common (i.e. rows 1,2,3,5), but mat1 has expression in the feature named "10_10106549_10110499_+505_0" while mat2 has expression in the feature named "10_10106399_10110949-_567_0". You can simply merge (with rbind, not just combine) your mat1 and mat2 dataframes if you want to merge your samples.

This can occur in scRNA-seq with standard annotations as well where some samples have expression of genes unique to that particular sample.

I hope this clears up the confusion.

Best,
Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants