Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUGFIX for sparse expression matrices #20

Merged
merged 1 commit into from
Feb 8, 2021

Conversation

redst4r
Copy link
Contributor

@redst4r redst4r commented Dec 10, 2019

Hi,

arboreto seems to nominally support sparse matrices (according to the docs/docstrings),
however I ran into a couple of errors:

  • len() and np.delete() doesn't work on sparse matrices
  • sklearn.ensemble.GradientBoostingRegressor can handle sparse matrices for X,
    but y has to be dense!

Except for potential memory saving, this unfortunately doesn't speed things up much. Not an expert in boosting, but I don't see how to exploit sparsity for speed in boosting, so that's expected I guess.

Actually sparse matrices slow down GradientBoostingRegressor quite a bit.
This seems to be due to some conversions of sparse matrices internally (csc vs csr):

  • for arboreto csc makes the most sense (since we're pulling out columns all the time)
  • GradientBoostingRegressor wants crs
  • but even if one provides a csr-matrix to GradientBoostingRegressor it does some weird csr->csc conversion internally. Not sure whats going on there.

So maybe it makes the most sense in the future to have the original expression matrix sparse, pull out the transcription factor matrix and target gene vector and cast them into dense matrices (the tf-matrix shouldnt be too large either, around 2000 columns)

Let me know what you guys think!

- len() doesnt work on sparse matrices
- sklearn.GradientBoostingRegressor can handle sparse matrices for X,
but y has to be dense!
@cflerin
Copy link
Contributor

cflerin commented Jan 16, 2020

Hey @redst4r , this look like a nice fix! I've been running into some memory issues and wanted to use sparse matrices as a way around this. I ran some tests on this branch and thought it would be worth pointing out that it looks like (slightly) different results are returned depending on whether the input matrix is sparse or dense:

Here's how I tested:

# get test data:
wget https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/allTFs_hg38.txt
wget https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/expr_mat.loom

pip install --force-reinstall git+https://github.com/redst4r/arboreto@sparse_fix

Then, in python:

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
import loompy as lp
import pandas as pd

tf_names = load_tf_names('allTFs_hg38.txt')

##################################################
# using a sparse matrix as input:
lf = lp.connect('expr_mat.loom', mode='r', validate=False)
ex_matrix = lf.layers[""].sparse().T.tocsc()
gene_names = lf.ra.Gene.tolist()
lf.close()

network = grnboost2(expression_data=ex_matrix,
                    gene_names=gene_names,
                    tf_names=tf_names,
                    seed=777)

>>> network.head()
        TF   target  importance
665   SPI1   TYROBP   61.497067
27   RPS4X   EEF1A1   61.275101
735    PKM  HLA-DRA   60.627965
692  RPL35    RPS18   58.127846
27   RPS4X    RPL30   57.864056

##################################################
# using a dense matrix as input:
lf = lp.connect('expr_mat.loom', mode='r', validate=False)
ex_matrix = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
lf.close()

network = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names,
                    seed=777)

>>> network.head()
        TF  target  importance
27   RPS4X   RPL30   57.537719
665   SPI1    CSTA   55.521603
27   RPS4X  EEF1A1   54.931686
27   RPS4X   RPS14   53.646867
692  RPL35    RPL3   52.932191

I think this is actually due to the GradientBoostingRegressor code itself, and not something wrong with the code here. But it's worth documenting that the results are very slightly different with sparse vs dense inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants