[ERROR] modules_from_adjacencies: exprMat error #347

cliftonlewis · 2021-12-09T09:42:11Z

Describe the bug
A clear and concise description of what the bug is.

Steps to reproduce the behavior

Command run when the error occurred:

from pyscenic.utils import modules_from_adjacencies
modules = list(modules_from_adjacencies(adjacencies, exprMat))

Error encountered:

AssertionError                            Traceback (most recent call last)
<ipython-input-39-235636e9fdb2> in <module>
      1 from pyscenic.utils import modules_from_adjacencies
----> 2 modules = list(modules_from_adjacencies(adjacencies, exprMat))

/home/shared/anaconda3/envs/scenic3/lib/python3.6/site-packages/pyscenic/utils.py in modules_from_adjacencies(adjacencies, ex_mtx, thresholds, top_n_targets, top_n_regulators, min_genes, absolute_thresholds, rho_dichotomize, keep_only_activating, rho_threshold, rho_mask_dropouts)
    267             # test for genes present in the adjacencies but not present in the expression matrix:
    268             unique_adj_genes = set(adjacencies[COLUMN_NAME_TF]).union(set(adjacencies[COLUMN_NAME_TARGET])) - set(ex_mtx.columns)
--> 269             assert len(unique_adj_genes)==0, f"Found {len(unique_adj_genes)} genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?"
    270             LOGGER.warn(f"Note on correlation calculation: the default behaviour for calculating the correlations has changed after pySCENIC verion 0.9.16. Previously, the default was to calculate the correlation between a TF and target gene using only cells with non-zero expression values (mask_dropouts=True). The current default is now to use all cells to match the behavior of the R verision of SCENIC. The original settings can be retained by setting 'rho_mask_dropouts=True' in the modules_from_adjacencies function, or '--mask_dropouts' from the CLI.\n\tDropout masking is currently set to [{rho_mask_dropouts}].")
    271             adjacencies = add_correlation(adjacencies, ex_mtx,

AssertionError: Found 1 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?

Expected behavior
A clear and concise description of what you expected to happen.
I have run the code as per the previous pipeline so I don't really understand the error to well as it was using the same expression matrix. I have seen a previous issue #103 which discusses the same problem and it seemed at that time it was an issue with arboreto which was then fixed so I'm not too sure about the source of my error. Thank you in advance for all your help.
Please complete the following information:

pySCENIC version: 0.11.1
Installation method: Conda
Run environment: Jupyter notebook
OS: Linux

cflerin · 2021-12-09T20:20:56Z

@cliftonlewis,

This error means that there was a gene found in the network output (adjacencies file) that is not present in your expression matrix. You must be using a different expression matrix for this step vs. the network inference step -- there's no other explanation (ok, there could a weird bug in this check step).

You have two options: 1) re-run the network inference on your new expression matrix, or 2) remove the extra gene(s) from your adjacencies file (take all genes in the TF and Target columns of your adjacencies and compare with your expression matrix genes, find the one that shouldn't be there, and remove it from the network output file). I would strongly suggest option 1 since there's no way to know what else changed with the expression matrix.

cliftonlewis · 2021-12-17T10:18:18Z

Thank you very much @cflerin

I have spent the last week trying every which to redo the network inference and I'm still not able to get this working. I tried the other option of just removing the gene of issue and it does fix it but as you said that's not the ideal method. I'm still not sure why this would be the case and I double-checked that the expression matrix is the right one etc. I may just have to forgo this last step for visualisation

hyjforesight · 2022-01-12T04:54:53Z

Hello @cflerin @cliftonlewis ,
Same issue here.
All procedures start from ADJACENCIES_FNAME and I exported the final result into loom file LOOM_FNAME.

export2loom(ex_mtx=adata.to_df(), regulons=regulons, out_fname=LOOM_FNAME,
            cell_annotations=adata.obs['pca_leiden'].to_dict(), tree_structure=(),
            title='{}_{}'.format(TCGA_CODE, DATASET_ID), nomenclature="mm10", num_workers=16,
            embeddings=OrderedDict([('AUCell + umap', embedding_aucell_umap), ('PCA + umap', embedding_pca_umap)]),
            auc_mtx = auc_mtx, auc_thresholds=thresholds,
            compress=True)

Then I reload the ADJACENCIES_FNAME into adjacencies. exprMat is also loaded from LOOM_FNAME, but running modules_from_adjacencies creates errors.

adjacencies = pd.read_csv(ADJACENCIES_FNAME, index_col=False, sep='\t')
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(LOOM_FNAME, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()

AssertionError: Found 7350 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?

I also submit this issue here.
#355

SeppeDeWinter · 2022-01-12T07:36:33Z

@hyjforesight could you show part of the data contained in the adjacencies and exprMat variable, for example using the head command.

Using the following command you can get the genes which are only present in the adjacencies but not in the expression matrix. This might help with troubleshooting as well.

unique_adj_genes = set(adjacencies["TF"]).union(set(adjacencies["target"])) - set(ex_mtx.columns)

hyjforesight · 2022-01-19T17:12:18Z

Hello @SeppeDeWinter ,
Thanks for the recommendations. Sorry for the late response, I was busy with some wet experimental things.

By calling adjacencies and exprMat, I realize that adjacencies was generated from the whole genes, whereas exprMat was generated from highly variable genes. I did a view of my adata and use this view for generating UMAP and then embed with the AUCell. Please see below.

Because it's a common way to only use the highly variable genes for generating a UMAP, but GRNBoost2 requires the whole genes, the gene numbers of adjacencies and exprMat will not be matched. In this case, how do we solve the AssertionError: Found 7350 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix error?
Thanks!
Best,
YJ

f_loom_path_scenic includes the whole genes for GRNBoost2, generating adjacencies.

# preprocessing by eliminating some unwanted cells, all genes are included
sc.pl.highest_expr_genes(adata, n_top=20)
sc.pp.filter_cells(adata, min_genes=100)
sc.pp.filter_genes(adata, min_cells=25)
adata.var['mt'] = adata.var_names.str.startswith('mt-')
adata.var['rpl'] = adata.var_names.str.startswith('Rpl')
adata.var['rps'] = adata.var_names.str.startswith('Rps')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','rpl','rps'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, keys=['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_rpl','pct_counts_rps'], jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rpl')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rps')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
adata = adata[adata.obs.n_genes_by_counts < 6000, :]
adata = adata[adata.obs.pct_counts_mt < 50, :]
adata = adata[adata.obs.pct_counts_rpl < 50, :]
adata = adata[adata.obs.pct_counts_rps < 50, :]

# write the preprocessed data into f_loom_path_scenic
row_attrs = {"Gene": np.array(adata.var_names)}
col_attrs = {"CellID": np.array(adata.obs_names), "nGene": np.array(np.sum(adata.X.transpose()>0, axis=0)).flatten(),
             "nUMI": np.array(np.sum(adata.X.transpose(), axis=0)).flatten()}
lp.create(f_loom_path_scenic, adata.X.transpose(), row_attrs, col_attrs)

# use f_loom_path_scenic for GRNBoost2
!pyscenic grn {f_loom_path_scenic} {Mouse_TFS_FNAME} -o {ADJACENCIES_FNAME} --num_workers 16

LOOM_FNAME only includes the highly_variable_genes, generating exprMat.

# do a view of adata by highly_variable_genes and generate UMAP
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
sc.pl.highly_variable_genes(adata)
print(sum(adata.var.highly_variable))
adata.raw=adata
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, keys=['pct_counts_mt','pct_counts_rpl','pct_counts_rps'], n_jobs=16)
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_pcs=50, knn=True)
sc.tl.leiden(adata, resolution=0.4)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden'], legend_loc='on data', frameon=False, title='', use_raw=False)

# after doing all SCENIC steps, export into loom, but this loom is embedded with UMAP drawn by highly_variable_genes
export2loom(ex_mtx=adata.to_df(), regulons=regulons, out_fname=LOOM_FNAME,
            cell_annotations=adata.obs['leiden'].to_dict(), tree_structure=(),
            title='{}_{}'.format(TCGA_CODE, DATASET_ID), nomenclature="mm10", num_workers=16,
            embeddings=OrderedDict([('AUCell + umap', embedding_aucell_umap), ('PCA + umap', embedding_pca_umap)]),
            auc_mtx = auc_mtx, auc_thresholds=thresholds,
            compress=True)

# exprMat was generated by this loom file
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(LOOM_FNAME, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()

SeppeDeWinter · 2022-01-20T10:31:14Z

Hi @hyjforesight

Indeed it won't work with the matrix which only includes highly variable genes.

If f_loom_path_scenic contains the matrix with all genes you should use this one.

i.e.

lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))

hyjforesight · 2022-01-21T05:55:10Z

Hello @SeppeDeWinter
It's so kind of you!
As you recommend, f_loom_path_scenic works. However, it generates TypeError: list indices must be integers or slices, not str.

# Create the modules
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()

# pick out modules for Atoh1
tf = 'Atoh1'
tf_mods = [ x for x in modules if x.transcription_factor==tf ]

for i,mod in enumerate( tf_mods ):
    print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
print( f'{tf} regulon: {len(regulons[tf])} genes' )

Atoh1 module 0: 1390 genes
Atoh1 module 1: 772 genes
Atoh1 module 2: 51 genes
Atoh1 module 3: 218 genes
Atoh1 module 4: 329 genes
Atoh1 module 5: 1027 genes
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_21964/1801393588.py in <module>
      5 for i,mod in enumerate( tf_mods ):
      6     print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
----> 7 print( f'{tf} regulon: {len(regulons[tf])} genes' )

TypeError: list indices must be integers or slices, not str

Here is the output of regulons

[Regulon(name='Ahr', gene2weight=frozendict.frozendict({'4933434E20Rik': 1.2266120420120072, 'Tsc22d2': 2.493458558861111, 'Relb': 1.6059775174853306, 'Wdr81': 2.7312633394303125, 'N4bp2': 2.517364206954634, 'Gm15417': 2.9100744069765594, 'Mrm3': 1.8340456841873831, 'Slc43a2': 3.2958363969906035, 'Ahr': 1.0, 'Pwp2': 1.4280092164572429, 'Snapc1': 1.375446330644375, 'Rrp8': 1.1964840576238376}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ahr', context=frozenset({'cisbp__M5217.png', 'activating'}), score=1.1945444591398073, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
 Regulon(name='Arntl', gene2weight=frozendict.frozendict({'Gtf2h1': 1.620336555726875, 'Slc35a5': 1.91533670349294, 'Stxbp2': 1.8737801640073328, 'Slc20a1': 2.6734488198151336, 'Mms19': 1.5346505049314778, 'Zfp750': 3.07825326637646, 'Trpv3': 1.3858996748844017, 'Lmo7': 1.4820359038217124, 'Ppfia4': 2.2181361459582285, 'Vps11': 3.3435825773521883, 'Per2': 5.276266659342296, 'Net1': 1.6967910742666223, 'Miga2': 2.6992598513071764, 'Bhlhe40': 1.048126353591973, 'Cnnm1': 0.2176312578590142, 'Zfp296': 0.5270087719933082}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Arntl', context=frozenset({'transfac_pro__M08868.png', 'activating'}), score=3.6772977258300505, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
 Regulon(name='Atf1', gene2weight=frozendict.frozendict({'Haus5': 2.696609854293601, 'Ccna2': 3.966379215412084, 'Adnp2': 3.1412411963973, 'Knl1': 2.978216740685239, 'Zfp574': 2.6709956233650742, 'Gstcd': 4.495503046617849, 'Akt2': 7.703748410076394, 'Wdr33': 4.470218914364348, 'Osbpl8': 4.128940542758419, 'Dmtf1': 2.9518457104833984, 'Ecd': 4.228484233174275, 'Tufm': 3.241183423254145, 'Atf1': 1.0, 'Tmem79': 3.039202512848713, 'Txndc12': 2.7651144726669266, 'H2-T23': 4.091599231055049, 'Sars2': 4.88815935652636, 'Hist2h2ac': 6.218816342974511, 'Espl1': 5.891555428730409, 'Trim59': 5.5231863844652365, 'Repin1': 7.575504985608425, 'Cdc6': 4.835485993983268, 'Atp5g2': 6.370272443998833, 'Mycbpap': 1.786616119915864, 'Zranb3': 1.884314316529295}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Atf1', context=frozenset({'jaspar__MA0604.1.png', 'activating'}), score=3.7760690161205934, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
 Regulon(name='Atf2', gene2weight=frozendict.frozendict({'Ube2h': 0.897783367787763, 'Atxn1l': 1.0011890012392768, 'Spag4': 0.0504887189046225, 'Tsc22d2': 1.446490814474661, 'Fam53c': 1.0264002590265495, 'Celf3': 0.8921666004374457, 'Baz2b': 1.7387011414932263, 'Sstr2': 0.1305991991641035, 'Ppp1r15a': 2.255760431335529, 'Ccdc86': 1.4541805544558348, 'Per3': 0.2268716251155102, 'Tmem86a': 1.3049516716894871, 'Fdxacb1': 1.2384143650749797, 'Ift57': 0.1866952283639881, 'Dnajb4': 2.4881353785712856, 'Ift81': 0.6129619623801797, 'Nedd1': 2.330723795868671, 'Tfcp2': 0.9635833030927822, 'Atg2b': 0.6862142465533663, 'Pum3': 1.0244760901333978, 'A130010J15Rik': 0.6386723714882772, 'Cdc37': 1.8154579196928788, 'Topors': 1.6841496409710848, 'Gskip': 1.033190551910205, 'Ldb1': 1.0867077410261163, 'Arf4': 0.8804604227117785, 'Ppard': 1.789885033526931, 'U2af1l4': 0.9695135646637782, 'Cry2': 0.7857601469517936, 'Ppp1r2': 1.9509700635671885, 'Washc2': 2.6742612959268044, 'Hexim2': 0.3199586402604067, 'Tapt1': 1.2681447687682663, 'Tmcc3': 0.9065785116044712, 'Srrt': 1.1198520353424557, 'Fkbp7': 0.5068524674075681, 'Egr3': 0.6450683762001174, 'Mthfr': 1.6912746112244794, 'Slc16a12': 1.1161316547263636, 'Ccdc173': 0.3230739469914306}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Atf2', context=frozenset({'cisbp__M3088.png', 'activating'}), score=3.376461889441437, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
...

SeppeDeWinter · 2022-01-24T10:16:35Z

Hi

You're trying to index a list using a string key, this is not possible.

You should first convert the list of regulons to a dictionary (see tutorial):

# create a dictionary of regulons:
regulons = {}
for i,r in pd.DataFrame(lf.ra.Regulons,index=lf.ra.Gene).iteritems():
    regulons[i] =  list(r[r==1].index.values)

hyjforesight · 2022-01-25T00:15:19Z

Holy crap! You guys @SeppeDeWinter who're doing the coding things are so genius, my friend! You're the real PhD, not me (old guys sitting on the bench and doing pipetting).

Here attaches the coding I learn from @SeppeDeWinter. Hope it helps @cliftonlewis.

# Step 8: Cytoscape-iRegulon exploration of modules
adjacencies = pd.read_csv(ADJACENCIES_FNAME, index_col=False, sep='\t')
adjacencies

# Create the modules
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()

# create a dictionary of regulons for converting the index from string to values:
lf2 = lp.connect(LOOM_FNAME, mode='r', validate=False)
regulons_dict = {}
for i,r in pd.DataFrame(lf2.ra.Regulons, index=lf2.ra.Gene).iteritems():
    regulons_dict[i] =  list(r[r==1].index.values)
lf2.close()

# pick out modules for Atoh1
tf = 'Atoh1'
tf_mods = [ x for x in modules if x.transcription_factor==tf ]

for i,mod in enumerate( tf_mods ):
    print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
print( f'{tf} regulon: {len(regulons_dict[tf])} genes' )

# write these modules, and the regulon to files
for i,mod in enumerate( tf_mods ):
    with open( tf+'_module_'+str(i)+'.txt', 'w') as f:
        for item in mod.genes:
            f.write("%s\n" % item)
            
with open( tf+'_regulon.txt', 'w') as f:
    for item in regulons_dict[tf]:
        f.write("%s\n" % item)

SeppeDeWinter · 2022-01-25T13:52:32Z

Happy to help! Closing the issue for now, feel free to open again if there are further issues.

Seppe

cliftonlewis added the bug Something isn't working label Dec 9, 2021

cliftonlewis changed the title ~~[ERROR] Hi, I am trying to go through the downstream analysis for pySCENIC and when I run the command~~ [ERROR] modules_from_adjacencies: exprMat error Dec 9, 2021

hyjforesight mentioned this issue Jan 19, 2022

Unable to pick out a specific module for iRegulon #355

Closed

SeppeDeWinter closed this as completed Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ERROR] modules_from_adjacencies: exprMat error #347

[ERROR] modules_from_adjacencies: exprMat error #347

cliftonlewis commented Dec 9, 2021 •

edited

Loading

cflerin commented Dec 9, 2021

cliftonlewis commented Dec 17, 2021

hyjforesight commented Jan 12, 2022 •

edited

Loading

SeppeDeWinter commented Jan 12, 2022

hyjforesight commented Jan 19, 2022 •

edited

Loading

SeppeDeWinter commented Jan 20, 2022

hyjforesight commented Jan 21, 2022 •

edited

Loading

SeppeDeWinter commented Jan 24, 2022

hyjforesight commented Jan 25, 2022 •

edited

Loading

SeppeDeWinter commented Jan 25, 2022

[ERROR] modules_from_adjacencies: exprMat error #347

[ERROR] modules_from_adjacencies: exprMat error #347

Comments

cliftonlewis commented Dec 9, 2021 • edited Loading

cflerin commented Dec 9, 2021

cliftonlewis commented Dec 17, 2021

hyjforesight commented Jan 12, 2022 • edited Loading

SeppeDeWinter commented Jan 12, 2022

hyjforesight commented Jan 19, 2022 • edited Loading

SeppeDeWinter commented Jan 20, 2022

hyjforesight commented Jan 21, 2022 • edited Loading

SeppeDeWinter commented Jan 24, 2022

hyjforesight commented Jan 25, 2022 • edited Loading

SeppeDeWinter commented Jan 25, 2022

cliftonlewis commented Dec 9, 2021 •

edited

Loading

hyjforesight commented Jan 12, 2022 •

edited

Loading

hyjforesight commented Jan 19, 2022 •

edited

Loading

hyjforesight commented Jan 21, 2022 •

edited

Loading

hyjforesight commented Jan 25, 2022 •

edited

Loading