Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pseudobulk_by apparently dropping columns #23

Open
malcook opened this issue May 23, 2021 · 1 comment
Open

pseudobulk_by apparently dropping columns #23

malcook opened this issue May 23, 2021 · 1 comment

Comments

@malcook
Copy link

malcook commented May 23, 2021

Hi, I am successfully using your package with scRNA-Seq data, but finding a possible bug when trying out the pseudobulk_by option. Below I show it working fine without pseudobulk_by and then failing when I add it in:

s2<-as.SingleCellExperiment(s)
design<- model.matrix(~ 0 + experimentGroup,s2@colData)
glmGamPoi.fit <- glm_gp(s2
                       ,design = design
                       ,size_factors = "deconvolution"
                       ,on_disk = FALSE
                        )

This worked fine. My grouping factor and design matrix looks like this:

 unique(s2@colData$experimentGroup)
 [1] LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
 Levels: LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600

 head(design)
                        LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
 homeo_AAACCTGAGAAGCCCA        1        0        0        0        0        0        0
 homeo_AAACCTGAGGAGCGTT        1        0        0        0        0        0        0
 homeo_AAACCTGCAACTGCGC        1        0        0        0        0        0        0
 homeo_AAACCTGCAAGTAGTA        1        0        0        0        0        0        0
 homeo_AAACCTGCATATGAGA        1        0        0        0        0        0        0
 homeo_AAACCTGGTGCAACTT        1        0        0        0        0        0        0

Testing for DGE with the following contrast works fine:

contrast=c('LL_N_000-LL_H_000', 'LL_N_030-LL_H_000', 'LL_N_060-LL_H_000', 'LL_N_180-LL_H_000', 'LL_N_300-LL_H_000','LL_N_600-LL_H_000')
contrast<-limma::makeContrasts(levels=l,contrasts=contrast)

glmGamPoi.test <- test_de(glmGamPoi.fit
                        , contrast = contrast
                         ##,pseudobulk_by='experimentGroup'
                          ) 

... the problem comes when I try to test using pseudobulk_by, as

glmGamPoi.bulk.test <- test_de(glmGamPoi.fit
                        , contrast = contrast
                         ,pseudobulk_by='experimentGroup'
                          ) %>% setDT

Error in handle_design_parameter(design, data, col_data, reference_level) :
  The model_matrix has more columns (7) than the there are samples in the data matrix (7 columns).
Too few replicates / too many coefficients to fit model.
The head of the design matrix:
         LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
LL_H_000        1        0        0        0        0        0        0
LL_N_000        0        1        0        0        0        0        0
LL_N_030        0        0        1        0        0        0        0
The head of the data:
      LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180
rpl24    21728    14048    13748    19649    14109
cep97       76       18       21        9       55
  eed      270       86       96       88      257

However by my calculations the aggregated data columns should d be there and are somehow erroneously being dropped, , viz::

 library(DelayedArray)
 grs<-colsum(s2@assays@data$counts,s2@colData$experimentGroup)
 head(grs)
         LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
 rpl24      21728    14048    13748    19649    14109    11293    11562
 cep97         76       18       21        9       55       32       35
 eed          270       86       96       88      257      176      101
 hikeshi      459       98      120      137      458      380      144
 tmem39a      296      134       68       82      132       63       96
 ildr1a       746      471      620      460      435      261      342

I could potentially share the s2 object via ftp if it would help you sleuth this.

Or perhaps you recognize some error on my part?

@const-ae
Copy link
Owner

I think the problem is that you use the experimentGroup to form the pseudobulk, but also want to use it as the design argument.

The manual way of forming the pseudo-bulk would look like:

splitter <- split(seq_len(ncol(s2)), s2$experimentGroup)
pseudobulk_mat <- do.call(cbind, lapply(splitter, function(idx){
    matrixStats::rowSums2(assay(s2), cols = idx)
}))
pseudo_design_mat <- do.call(rbind, lapply(splitter, function(idx){
    matrixStats::colMeans2(glmGamPoi.fit$model_matrix, rows = idx)
 }))

If you look at pseudobulk_mat, you will see that it has 7 columns. At the same time, you want to fit a model which also has 7 different parameters, which fails with the error message: Too few replicates / too many coefficients to fit model..

Best,
Constantin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants