-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ERROR] modules_from_adjacencies: exprMat error #347
Comments
This error means that there was a gene found in the network output (adjacencies file) that is not present in your expression matrix. You must be using a different expression matrix for this step vs. the network inference step -- there's no other explanation (ok, there could a weird bug in this check step). You have two options: 1) re-run the network inference on your new expression matrix, or 2) remove the extra gene(s) from your adjacencies file (take all genes in the TF and Target columns of your adjacencies and compare with your expression matrix genes, find the one that shouldn't be there, and remove it from the network output file). I would strongly suggest option 1 since there's no way to know what else changed with the expression matrix. |
Thank you very much @cflerin I have spent the last week trying every which to redo the network inference and I'm still not able to get this working. I tried the other option of just removing the gene of issue and it does fix it but as you said that's not the ideal method. I'm still not sure why this would be the case and I double-checked that the expression matrix is the right one etc. I may just have to forgo this last step for visualisation |
Hello @cflerin @cliftonlewis , export2loom(ex_mtx=adata.to_df(), regulons=regulons, out_fname=LOOM_FNAME,
cell_annotations=adata.obs['pca_leiden'].to_dict(), tree_structure=(),
title='{}_{}'.format(TCGA_CODE, DATASET_ID), nomenclature="mm10", num_workers=16,
embeddings=OrderedDict([('AUCell + umap', embedding_aucell_umap), ('PCA + umap', embedding_pca_umap)]),
auc_mtx = auc_mtx, auc_thresholds=thresholds,
compress=True) Then I reload the adjacencies = pd.read_csv(ADJACENCIES_FNAME, index_col=False, sep='\t')
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(LOOM_FNAME, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()
AssertionError: Found 7350 genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix? I also submit this issue here. |
@hyjforesight could you show part of the data contained in the Using the following command you can get the genes which are only present in the adjacencies but not in the expression matrix. This might help with troubleshooting as well. unique_adj_genes = set(adjacencies["TF"]).union(set(adjacencies["target"])) - set(ex_mtx.columns) |
Hello @SeppeDeWinter , By calling Because it's a common way to only use the highly variable genes for generating a UMAP, but GRNBoost2 requires the whole genes, the gene numbers of
# preprocessing by eliminating some unwanted cells, all genes are included
sc.pl.highest_expr_genes(adata, n_top=20)
sc.pp.filter_cells(adata, min_genes=100)
sc.pp.filter_genes(adata, min_cells=25)
adata.var['mt'] = adata.var_names.str.startswith('mt-')
adata.var['rpl'] = adata.var_names.str.startswith('Rpl')
adata.var['rps'] = adata.var_names.str.startswith('Rps')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','rpl','rps'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, keys=['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_rpl','pct_counts_rps'], jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rpl')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rps')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
adata = adata[adata.obs.n_genes_by_counts < 6000, :]
adata = adata[adata.obs.pct_counts_mt < 50, :]
adata = adata[adata.obs.pct_counts_rpl < 50, :]
adata = adata[adata.obs.pct_counts_rps < 50, :]
# write the preprocessed data into f_loom_path_scenic
row_attrs = {"Gene": np.array(adata.var_names)}
col_attrs = {"CellID": np.array(adata.obs_names), "nGene": np.array(np.sum(adata.X.transpose()>0, axis=0)).flatten(),
"nUMI": np.array(np.sum(adata.X.transpose(), axis=0)).flatten()}
lp.create(f_loom_path_scenic, adata.X.transpose(), row_attrs, col_attrs)
# use f_loom_path_scenic for GRNBoost2
!pyscenic grn {f_loom_path_scenic} {Mouse_TFS_FNAME} -o {ADJACENCIES_FNAME} --num_workers 16
# do a view of adata by highly_variable_genes and generate UMAP
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
sc.pl.highly_variable_genes(adata)
print(sum(adata.var.highly_variable))
adata.raw=adata
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, keys=['pct_counts_mt','pct_counts_rpl','pct_counts_rps'], n_jobs=16)
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_pcs=50, knn=True)
sc.tl.leiden(adata, resolution=0.4)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden'], legend_loc='on data', frameon=False, title='', use_raw=False)
# after doing all SCENIC steps, export into loom, but this loom is embedded with UMAP drawn by highly_variable_genes
export2loom(ex_mtx=adata.to_df(), regulons=regulons, out_fname=LOOM_FNAME,
cell_annotations=adata.obs['leiden'].to_dict(), tree_structure=(),
title='{}_{}'.format(TCGA_CODE, DATASET_ID), nomenclature="mm10", num_workers=16,
embeddings=OrderedDict([('AUCell + umap', embedding_aucell_umap), ('PCA + umap', embedding_pca_umap)]),
auc_mtx = auc_mtx, auc_thresholds=thresholds,
compress=True)
# exprMat was generated by this loom file
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(LOOM_FNAME, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close() |
Indeed it won't work with the matrix which only includes highly variable genes. If i.e. lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat)) |
Hello @SeppeDeWinter # Create the modules
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()
# pick out modules for Atoh1
tf = 'Atoh1'
tf_mods = [ x for x in modules if x.transcription_factor==tf ]
for i,mod in enumerate( tf_mods ):
print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
print( f'{tf} regulon: {len(regulons[tf])} genes' )
Atoh1 module 0: 1390 genes
Atoh1 module 1: 772 genes
Atoh1 module 2: 51 genes
Atoh1 module 3: 218 genes
Atoh1 module 4: 329 genes
Atoh1 module 5: 1027 genes
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_21964/1801393588.py in <module>
5 for i,mod in enumerate( tf_mods ):
6 print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
----> 7 print( f'{tf} regulon: {len(regulons[tf])} genes' )
TypeError: list indices must be integers or slices, not str Here is the output of [Regulon(name='Ahr', gene2weight=frozendict.frozendict({'4933434E20Rik': 1.2266120420120072, 'Tsc22d2': 2.493458558861111, 'Relb': 1.6059775174853306, 'Wdr81': 2.7312633394303125, 'N4bp2': 2.517364206954634, 'Gm15417': 2.9100744069765594, 'Mrm3': 1.8340456841873831, 'Slc43a2': 3.2958363969906035, 'Ahr': 1.0, 'Pwp2': 1.4280092164572429, 'Snapc1': 1.375446330644375, 'Rrp8': 1.1964840576238376}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Ahr', context=frozenset({'cisbp__M5217.png', 'activating'}), score=1.1945444591398073, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
Regulon(name='Arntl', gene2weight=frozendict.frozendict({'Gtf2h1': 1.620336555726875, 'Slc35a5': 1.91533670349294, 'Stxbp2': 1.8737801640073328, 'Slc20a1': 2.6734488198151336, 'Mms19': 1.5346505049314778, 'Zfp750': 3.07825326637646, 'Trpv3': 1.3858996748844017, 'Lmo7': 1.4820359038217124, 'Ppfia4': 2.2181361459582285, 'Vps11': 3.3435825773521883, 'Per2': 5.276266659342296, 'Net1': 1.6967910742666223, 'Miga2': 2.6992598513071764, 'Bhlhe40': 1.048126353591973, 'Cnnm1': 0.2176312578590142, 'Zfp296': 0.5270087719933082}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Arntl', context=frozenset({'transfac_pro__M08868.png', 'activating'}), score=3.6772977258300505, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
Regulon(name='Atf1', gene2weight=frozendict.frozendict({'Haus5': 2.696609854293601, 'Ccna2': 3.966379215412084, 'Adnp2': 3.1412411963973, 'Knl1': 2.978216740685239, 'Zfp574': 2.6709956233650742, 'Gstcd': 4.495503046617849, 'Akt2': 7.703748410076394, 'Wdr33': 4.470218914364348, 'Osbpl8': 4.128940542758419, 'Dmtf1': 2.9518457104833984, 'Ecd': 4.228484233174275, 'Tufm': 3.241183423254145, 'Atf1': 1.0, 'Tmem79': 3.039202512848713, 'Txndc12': 2.7651144726669266, 'H2-T23': 4.091599231055049, 'Sars2': 4.88815935652636, 'Hist2h2ac': 6.218816342974511, 'Espl1': 5.891555428730409, 'Trim59': 5.5231863844652365, 'Repin1': 7.575504985608425, 'Cdc6': 4.835485993983268, 'Atp5g2': 6.370272443998833, 'Mycbpap': 1.786616119915864, 'Zranb3': 1.884314316529295}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Atf1', context=frozenset({'jaspar__MA0604.1.png', 'activating'}), score=3.7760690161205934, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
Regulon(name='Atf2', gene2weight=frozendict.frozendict({'Ube2h': 0.897783367787763, 'Atxn1l': 1.0011890012392768, 'Spag4': 0.0504887189046225, 'Tsc22d2': 1.446490814474661, 'Fam53c': 1.0264002590265495, 'Celf3': 0.8921666004374457, 'Baz2b': 1.7387011414932263, 'Sstr2': 0.1305991991641035, 'Ppp1r15a': 2.255760431335529, 'Ccdc86': 1.4541805544558348, 'Per3': 0.2268716251155102, 'Tmem86a': 1.3049516716894871, 'Fdxacb1': 1.2384143650749797, 'Ift57': 0.1866952283639881, 'Dnajb4': 2.4881353785712856, 'Ift81': 0.6129619623801797, 'Nedd1': 2.330723795868671, 'Tfcp2': 0.9635833030927822, 'Atg2b': 0.6862142465533663, 'Pum3': 1.0244760901333978, 'A130010J15Rik': 0.6386723714882772, 'Cdc37': 1.8154579196928788, 'Topors': 1.6841496409710848, 'Gskip': 1.033190551910205, 'Ldb1': 1.0867077410261163, 'Arf4': 0.8804604227117785, 'Ppard': 1.789885033526931, 'U2af1l4': 0.9695135646637782, 'Cry2': 0.7857601469517936, 'Ppp1r2': 1.9509700635671885, 'Washc2': 2.6742612959268044, 'Hexim2': 0.3199586402604067, 'Tapt1': 1.2681447687682663, 'Tmcc3': 0.9065785116044712, 'Srrt': 1.1198520353424557, 'Fkbp7': 0.5068524674075681, 'Egr3': 0.6450683762001174, 'Mthfr': 1.6912746112244794, 'Slc16a12': 1.1161316547263636, 'Ccdc173': 0.3230739469914306}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Atf2', context=frozenset({'cisbp__M3088.png', 'activating'}), score=3.376461889441437, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation=''),
... |
Hi You're trying to index a list using a string key, this is not possible. You should first convert the list of regulons to a dictionary (see tutorial): # create a dictionary of regulons:
regulons = {}
for i,r in pd.DataFrame(lf.ra.Regulons,index=lf.ra.Gene).iteritems():
regulons[i] = list(r[r==1].index.values) |
Holy crap! You guys @SeppeDeWinter who're doing the coding things are so genius, my friend! You're the real PhD, not me (old guys sitting on the bench and doing pipetting). Here attaches the coding I learn from @SeppeDeWinter. Hope it helps @cliftonlewis. # Step 8: Cytoscape-iRegulon exploration of modules
adjacencies = pd.read_csv(ADJACENCIES_FNAME, index_col=False, sep='\t')
adjacencies
# Create the modules
from pyscenic.utils import modules_from_adjacencies
lf = lp.connect(f_loom_path_scenic, mode='r', validate=False)
exprMat = pd.DataFrame(lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID).T
modules = list(modules_from_adjacencies(adjacencies, exprMat))
lf.close()
# create a dictionary of regulons for converting the index from string to values:
lf2 = lp.connect(LOOM_FNAME, mode='r', validate=False)
regulons_dict = {}
for i,r in pd.DataFrame(lf2.ra.Regulons, index=lf2.ra.Gene).iteritems():
regulons_dict[i] = list(r[r==1].index.values)
lf2.close()
# pick out modules for Atoh1
tf = 'Atoh1'
tf_mods = [ x for x in modules if x.transcription_factor==tf ]
for i,mod in enumerate( tf_mods ):
print( f'{tf} module {str(i)}: {len(mod.genes)} genes' )
print( f'{tf} regulon: {len(regulons_dict[tf])} genes' )
# write these modules, and the regulon to files
for i,mod in enumerate( tf_mods ):
with open( tf+'_module_'+str(i)+'.txt', 'w') as f:
for item in mod.genes:
f.write("%s\n" % item)
with open( tf+'_regulon.txt', 'w') as f:
for item in regulons_dict[tf]:
f.write("%s\n" % item) |
Happy to help! Closing the issue for now, feel free to open again if there are further issues. Seppe |
Describe the bug
A clear and concise description of what the bug is.
Steps to reproduce the behavior
Expected behavior
A clear and concise description of what you expected to happen.
I have run the code as per the previous pipeline so I don't really understand the error to well as it was using the same expression matrix. I have seen a previous issue #103 which discusses the same problem and it seemed at that time it was an issue with arboreto which was then fixed so I'm not too sure about the source of my error. Thank you in advance for all your help.
Please complete the following information:
The text was updated successfully, but these errors were encountered: