BUGFIX for sparse expression matrices #20

redst4r · 2019-12-10T17:28:47Z

Hi,

arboreto seems to nominally support sparse matrices (according to the docs/docstrings),
however I ran into a couple of errors:

len() and np.delete() doesn't work on sparse matrices
sklearn.ensemble.GradientBoostingRegressor can handle sparse matrices for X,
but y has to be dense!

Except for potential memory saving, this unfortunately doesn't speed things up much. Not an expert in boosting, but I don't see how to exploit sparsity for speed in boosting, so that's expected I guess.

Actually sparse matrices slow down GradientBoostingRegressor quite a bit.
This seems to be due to some conversions of sparse matrices internally (csc vs csr):

for arboreto csc makes the most sense (since we're pulling out columns all the time)
GradientBoostingRegressor wants crs
but even if one provides a csr-matrix to GradientBoostingRegressor it does some weird csr->csc conversion internally. Not sure whats going on there.

So maybe it makes the most sense in the future to have the original expression matrix sparse, pull out the transcription factor matrix and target gene vector and cast them into dense matrices (the tf-matrix shouldnt be too large either, around 2000 columns)

Let me know what you guys think!

- len() doesnt work on sparse matrices - sklearn.GradientBoostingRegressor can handle sparse matrices for X, but y has to be dense!

cflerin · 2020-01-16T14:15:19Z

Hey @redst4r , this look like a nice fix! I've been running into some memory issues and wanted to use sparse matrices as a way around this. I ran some tests on this branch and thought it would be worth pointing out that it looks like (slightly) different results are returned depending on whether the input matrix is sparse or dense:

Here's how I tested:

# get test data:
wget https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/allTFs_hg38.txt
wget https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/expr_mat.loom

pip install --force-reinstall git+https://github.com/redst4r/arboreto@sparse_fix

Then, in python:

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
import loompy as lp
import pandas as pd

tf_names = load_tf_names('allTFs_hg38.txt')

##################################################
# using a sparse matrix as input:
lf = lp.connect('expr_mat.loom', mode='r', validate=False)
ex_matrix = lf.layers[""].sparse().T.tocsc()
gene_names = lf.ra.Gene.tolist()
lf.close()

network = grnboost2(expression_data=ex_matrix,
                    gene_names=gene_names,
                    tf_names=tf_names,
                    seed=777)

>>> network.head()
        TF   target  importance
665   SPI1   TYROBP   61.497067
27   RPS4X   EEF1A1   61.275101
735    PKM  HLA-DRA   60.627965
692  RPL35    RPS18   58.127846
27   RPS4X    RPL30   57.864056

##################################################
# using a dense matrix as input:
lf = lp.connect('expr_mat.loom', mode='r', validate=False)
ex_matrix = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
lf.close()

network = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names,
                    seed=777)

>>> network.head()
        TF  target  importance
27   RPS4X   RPL30   57.537719
665   SPI1    CSTA   55.521603
27   RPS4X  EEF1A1   54.931686
27   RPS4X   RPS14   53.646867
692  RPL35    RPL3   52.932191

I think this is actually due to the GradientBoostingRegressor code itself, and not something wrong with the code here. But it's worth documenting that the results are very slightly different with sparse vs dense inputs.

BUGFIX for sparse expression matrices

77ddb9f

- len() doesnt work on sparse matrices - sklearn.GradientBoostingRegressor can handle sparse matrices for X, but y has to be dense!

cflerin mentioned this pull request Feb 19, 2020

GRN multiprocessing and sparse matrix support aertslab/pySCENIC#140

Merged

cflerin merged commit 3cf1923 into aertslab:master Feb 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUGFIX for sparse expression matrices #20

BUGFIX for sparse expression matrices #20

redst4r commented Dec 10, 2019 •

edited

Loading

cflerin commented Jan 16, 2020

BUGFIX for sparse expression matrices #20

BUGFIX for sparse expression matrices #20

Conversation

redst4r commented Dec 10, 2019 • edited Loading

cflerin commented Jan 16, 2020

redst4r commented Dec 10, 2019 •

edited

Loading