prune2df runing for more than 140h #142

JPcerapio · 2020-02-26T10:17:56Z

Hello,
so I managed to get until the Phase II of your Tutorial with your data.

But after running 145h I stopped the process. I don't know if it is normal that it runs that long.

Thanks for your help.

Jp

Here some info,

dbs
[FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-5kb-7species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-10kb-7species.mc9nr"), FeatherRankingDatabase(name="mm9-500bp-upstream-10species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-10kb-10species.mc9nr"), FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr")]

PHASE I
network = grnboost2(expression_data=ex_matrix2,
gene_names=gene_names,
tf_names=tf_names)#6h of runing

modules = list(modules_from_adjacencies(network, ex_matrix))

PHASE II

with ProgressBar(): df = prune2df(dbs, modules, "/home/user/pySCENIC/data_bases/Mm/motifs-v9-nr.mgi-m0.001-o0.0.tbl")

[####################################### ] | 98% Completed | 25min 37.2s
2020-02-12 15:05:46,854 - pyscenic.transform - WARNING - Less than 80% of the genes in Tcf21 could be mapped to mm9-tss-centered-5kb-10species.mc9nr. Skipping this module.
[####################################### ] | 98% Completed | 25min 45.6s
2020-02-12 15:05:55,227 - pyscenic.transform - WARNING - Less than 80% of the genes in Mef2d could be mapped to mm9-tss-centered-5kb-10species.mc9nr. Skipping this module.
[####################################### ] | 98% Completed | 25min 46.4s
2020-02-12 15:05:56,007 - pyscenic.transform - WARNING - Less than 80% of the genes in Meox2 could be mapped to mm9-tss-centered-5kb-10species.mc9nr. Skipping this module.
[####################################### ] | 99% Completed | 18hr 5min 13.5s
[####################################### ] | 99% Completed | 145hr 0min 28.9s^CProcess ForkPoolWorker-446:

jk86754 · 2020-03-01T14:11:30Z

I having a similar issue. The progress bar creeps up relatively fast to a point and subsequently stalls. No error message but no output either.

Happened both on Linux and Anaconda on Windows.

JPcerapio · 2020-03-02T07:47:55Z

Hello @jk86754 , but did you let it end?, because i have to stopped it, I think 145h is quite a lot for a small set of samples.

Jp

cflerin · 2020-03-02T10:34:40Z

Hi @JPcerapio , @jk86754 ,

This step should definitely not take 145 hours. This seems to be a bug in the pruning step, similar to #104 . Running this step via the CLI seems to have worked for others, could you try this?

JPcerapio · 2020-03-02T12:31:42Z

Hey @cflerin thanks for your answer, I will try it but with this option the problem is that we do not have access to intermediates files or results that we will like to have.

I don't know if someone figure out if the error is coming from a some missing dependence or library.

Jp

cflerin · 2020-03-03T09:00:59Z

Hi, @JPcerapio , which intermediate files are you referring to? When you run this step in the CLI, you can still get the motif and regulon information. Although the CLI outputs only one of these, you can convert to the other without re-running, for example: #100

morganee261 · 2020-04-14T17:31:10Z

hello,
I am using the CLI of pyscenic and "creating regulons" has been running for over a week.
2020-04-06 09:15:03,025 - pyscenic.cli.pyscenic - INFO - Calculating regulons.
My data set is quite big (69,000 cells and 27,000 genes) but I am running on a cluster with 64 cores and 1Tb of RAM.

thanks for your help,
morgane

liboxun · 2020-05-06T23:13:15Z

Hi @morganee261 ,

Have you solved this problem? I'm also running the CLI (pyscenic ctx) and it's taking a long time.

Thanks,
Boxun

morganee261 · 2020-05-06T23:15:13Z

Hi @liboxun,

Unfortunately no, I haven't had any luck. It has been (and is still) running for a month now and I did not get an answer from the developers of this package.
thanks,

Morgane

cflerin · 2020-05-08T19:36:49Z

Hi @morganee261 , @liboxun ,

This step should definitely not take this long. If it's been running for a month there's clearly something wrong and I would stop it.

I've seen this issue a few times before, but I haven't been able to reproduce the problem to see where and why this step hangs, so I can't offer you a good solution. A few suggestions:

Try the Docker image, which has been working reliably for me recently. This would (hopefully) address any package version conflicts. (See here).
Try the CLI version of pyscenic ctx (see here).
Restart the process if this step seems to hang. For a dataset of 10k cells and 20k genes, this should run in ~10 minutes using 20 processes and two gene-based feather databases (human).
Try running with just a single feather database

liboxun · 2020-05-08T20:12:01Z

Thanks a lot @cflerin ! Since I'm already running the CLI version, I'll try switching to the Docker image or using just a single feather database.

I'll update this here when things come out.

morganee261 · 2020-05-11T21:58:00Z

Hello @cflerin,

I have been running the CLI of pyscenic ctx and that is what got stuck running for over a month. I stopped and I started running it with a single feather database.

I am also trying to run the docker image but I am not very familiar with it and I run into an error :

docker run -it --rm \

-v /home/Morgane/mapping/int:/scenicdata \
aertslab/pyscenic:[version] pyscenic grn \
    --num_workers 20 \
    -o /scenicdata/expr_mat.adjacencies.tsv \
    /scenicdata/ex_matrix.csv \
    /scenicdata/hgnc_tfs.txt

docker: invalid reference format.
See 'docker run --help'.

could you please advise?

thanks, for your reply and your help,

Morgane

liboxun · 2020-05-12T15:38:21Z

Hi @cflerin ,

I went back to ran the CLI with a single feather database, and it didn't help. It still got stuck forever at:

2020-05-08 15:14:50,014 - pyscenic.utils - INFO - Creating modules.

2020-05-08 15:16:46,513 - pyscenic.cli.pyscenic - INFO - Loading databases.

2020-05-08 15:16:46,515 - pyscenic.cli.pyscenic - INFO - Calculating regulons.
slurmstepd: error: *** JOB 1697596 ON NucleusA007 CANCELLED AT 2020-05-10T15:14:05 DUE TO TIME LIMIT ***

But when I tried using Singularity image (since Docker isn't available on our HPC system) of pySCENIC 0.10.0, it certainly helped. Now I actually got an progress bar, despite its failing at 57%:

[###################### ] | 57% Completed | 3hr 9min 9.2s

It failed because of it ran out of memory:

2020-05-08 21:23:27,584 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZNF165 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.

2020-05-08 21:24:06,771 - pyscenic.transform - WARNING - Less than 80% of the genes in Regulon for ZNF2 could be mapped to hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr. Skipping this module.

2020-05-08 21:47:28,929 - pyscenic.transform - ERROR - Unable to process "Regulon for NFKB1" on database "hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr" because ran out of memory. Stacktrace:

2020-05-08 21:47:31,092 - pyscenic.transform - ERROR - Unable to process "Regulon for ZNF81" on database "hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr" because ran out of memory. Stacktrace:

2020-05-08 21:47:51,126 - pyscenic.transform - ERROR - Traceback (most recent call last):
File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 185, in module2df
weighted_recovery=weighted_recovery)
File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 159, in module2features_auc1st_impl
avg2stdrcc = avgrcc + 2.0 * rccs.std(axis=0)
File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 217, in _std
keepdims=keepdims)
File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 193, in _var
x = asanyarray(arr - arrmean)
MemoryError: Unable to allocate array with shape (24453, 5000) and data type float64

2020-05-08 21:47:51,441 - pyscenic.transform - ERROR - Traceback (most recent call last):
File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 185, in module2df
weighted_recovery=weighted_recovery)
File "/opt/venv/lib/python3.7/site-packages/pyscenic/transform.py", line 159, in module2features_auc1st_impl
avg2stdrcc = avgrcc + 2.0 * rccs.std(axis=0)
File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 217, in _std
keepdims=keepdims)
File "/opt/venv/lib/python3.7/site-packages/numpy/core/_methods.py", line 193, in _var
x = asanyarray(arr - arrmean)
MemoryError: Unable to allocate array with shape (24453, 5000) and data type float64

Bus error

I used a node with 32GB memory, with 32 workers. Is that too little? What would you recommend?

Thanks!
Boxun

morganee261 · 2020-05-12T16:38:50Z

Hi @liboxun,

I got it to run in less than 14 min by using the docker image. I used 20 cores so the more the better I think. But here is my code (note that the whole code is in 1 line without "", the code that is on the tutorial did not work for me)

sudo docker pull aertslab/pyscenic:0.10.0

sudo docker run -it --rm -v /path/to/data:/scenicdata aertslab/pyscenic:0.10.0 pyscenic grn --num_workers 20 --transpose -o /scenicdata/expr_mat.adjacencies.tsv /scenicdata/ex_matrix.csv /scenicdata/hgnc_tfs.txt

I have to transpose my expression matrix to get it in the right format but you might not have to

sudo docker run -it --rm -v /path/to/data:/scenicdata aertslab/pyscenic:0.10.0 pyscenic ctx scenicdata/expr_mat.adjacencies.tsv /scenicdata/hg19-tss-centered-10kb-7species.mc9nr.feather /scenicdata/hg19-500bp-upstream-7species.mc9nr.feather --annotations_fname /scenicdata/motifs-v9-nr.hgnc-m0.001-o0.0.tbl --expression_mtx_fname /scenicdata/ex_matrix.csv --transpose --mode "dask_multiprocessing" --output /scenicdata/regulons.csv --num_workers 20

#this ran is 14 min on a server with 1Tb of RAM and using 20 out of 64 cores

sudo docker run -it --rm -v /path/to/data:/scenicdata aertslab/pyscenic:0.10.0 pyscenic aucell /scenicdata/ex_matrix.csv --transpose /scenicdata/regulons.csv -o /scenicdata/auc_mtx.csv --num_workers 20

#this took less than 10 min

hope this helps!

morgane

liboxun · 2020-05-12T19:05:15Z

Hi @morganee261 ,

Thanks for that tip! Glad to hear it eventually worked for you.

I also got it to run (~23min) when I bumped the task over to a node with 128GB of memory (using 32 out of 32 cores).

Best,
Boxun

morganee261 · 2020-05-12T23:21:31Z

Hi @cflerin

I am trying to import the results of the CLI pyscenic (3 csv files) into R for further analysis but I am having a lot of problems.

it seems like having a loom file for the importation helps however your CLI tutorial exports as csv.

could you please provide a brief tutorial on how to import them into R to be able to run the rest of the SCENIC script and look at the data?

thanks for your help,

Morgane

morganee261 · 2020-05-13T16:59:15Z

Hi @liboxun

I am having issues with the downstream analysis. I was wondering what platform you were using and if you had any luck with it.
I have imported a loom file into R but the format is very different from the tutorial.

Thanks,
Morgane

liboxun · 2020-05-13T19:20:18Z

Hi @morganee261 ,

I use Python. I haven't done any downstream analysis yet. I'll let you know how it goes in the next couple of weeks.

Best of luck,
Boxun

liboxun · 2020-05-21T16:05:58Z

Hi @morganee261 ,

I was able to run the example jupyter notebook successfully for 10x PBMC dataset:

https://github.com/aertslab/SCENICprotocol/blob/master/notebooks/PBMC10k_downstream-analysis.ipynb

This notebook was written in Python, and was meant for analysis downstream of pyscenic grn and pyscenic ctx (i.e. after you generate adj.tsv and regulons.csv).

While there were several issues (some were due to wrong versions of dependencies, which thankfully were easy enough for me to fix by myself), I could largely run through the notebook smoothly.

Hopefully this helps! I'm not sure if there's an equivalent example in R, but I'd assume there is, since the original SCENIC was written in R.

Best,
Boxun

ureyandy2009 · 2020-07-03T07:54:58Z

Hi @morganee261 ,

Thanks for that tip! Glad to hear it eventually worked for you.

I also got it to run (~23min) when I bumped the task over to a node with 128GB of memory (using 32 out of 32 cores).

Best,
Boxun

Hi @liboxun,
I met the same problem with you. The progress bar creeps up relatively fast to a 97% and subsequently stalls there. No error message but no output either. I noticed my 64G RAM was runout and no RAM was released. It seems that there was a bug eat all the memory. Could you kindly tell me how did you finally work it out? Did you use the docker image?Use only one feather? or just jump to a powerful computer? By the way, could you tell me the version you used? like python, cli, jupyter, and so on.

Many thanks.

Weijian

liboxun · 2020-07-03T15:34:24Z

Hi @ureyandy2009 ,

For me, a combination of two changes worked:

I switched from CLI to Singularity image (Docker image should work the same way);
I used a computer with 128GB RAM instead of 32GB.

Hopefully this helps!

Best,
Boxun

ureyandy2009 · 2020-07-05T01:51:14Z

Hi @ureyandy2009 ,

For me, a combination of two changes worked:
1. I switched from CLI to Singularity image (Docker image should work the same way);

2. I used a computer with 128GB RAM instead of 32GB.
Hopefully this helps!

Best,
Boxun

Thank you very much.

I think RAM maybe the main problem.
In my case (24 processors with 4.2GHZ and 64G RAM), one feather costs about 40G RAM, so the computer shut down when i used 2 feathers at the same time. And this problem was solved when i used only one feather, which cost 40GB/64GB. And then the prune2df run less than 10 min.

Many thanks.

naila53 · 2021-02-06T17:43:31Z

I have faced the same issue recently and spent 3days trying to figure it out. Singularity build would't run for me on my institute's HPC, i kept getting this error :
ERROR: You must install squashfs-tools to build images ABORT: Aborting with RETVAL=255

conda installation of squashfs-tools didn't work and needed system wide installation which was a hassle so didn't do it.
what worked for me is the following :

my data set: 14766 cells × 23011 genes

1- specified an interactive session;
srun --time=20:00:00 --partition=upgrade --nodes=1 --ntasks=1 --mem=128G --cpus-per-task=40 --pty /bin/bash -l

2- acitave conda environemt where pyscenic is installed.

3-run this script:
everything is the same as in the tutorial page script:
https://pyscenic.readthedocs.io/en/latest/tutorial.html

I just added:
from dask.distributed import Client, LocalCluster

if __name__ == '__main__':
adata=ad.read_loom('adata.all.pocessed.loom')
ex_matrix=adata.to_df()

tf_names = load_tf_names(MM_TFS_FNAME)
db_fnames = glob.glob(DATABASES_GLOB)
def name(fname):
    return os.path.splitext(os.path.basename(fname))[0]
dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]`


adjacencies=pd.read_csv("net2.tsv", index_col=False, sep='\t')
modules = list(modules_from_adjacencies(adjacencies, ex_matrix))

# Calculate a list of enriched motifs and the corresponding target genes for all modules.
with ProgressBar():
    df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME, client_or_address=Client(LocalCluster()))
# Create regulons from this table of enriched motifs.
regulons = df2regulons(df)

# Save the enriched motifs and the discovered regulons to disk.
df.to_csv(MOTIFS_FNAME)
with open(REGULONS_FNAME, "wb") as f:
    pickle.dump(regulons, f)

total consumed time:50minutes

cflerin added the bug Something isn't working label May 8, 2020

cflerin mentioned this issue May 18, 2020

Running example notebook is taking a long time aertslab/SCENICprotocol#11

Closed

cflerin mentioned this issue Jul 7, 2020

pySCENIC stuck at prune2df #104

Closed

cflerin mentioned this issue Feb 18, 2021

Pyscenic ctx taking a long time #265

Open

tuanpham96 mentioned this issue Oct 23, 2024

KeyError 'Field <GENE> does not exist in schema' at prune2df step for tutorial #589

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prune2df runing for more than 140h #142

prune2df runing for more than 140h #142

JPcerapio commented Feb 26, 2020

jk86754 commented Mar 1, 2020

JPcerapio commented Mar 2, 2020

cflerin commented Mar 2, 2020

JPcerapio commented Mar 2, 2020

cflerin commented Mar 3, 2020

morganee261 commented Apr 14, 2020

liboxun commented May 6, 2020

morganee261 commented May 6, 2020

cflerin commented May 8, 2020

liboxun commented May 8, 2020

morganee261 commented May 11, 2020

liboxun commented May 12, 2020

morganee261 commented May 12, 2020

liboxun commented May 12, 2020

morganee261 commented May 12, 2020

morganee261 commented May 13, 2020

liboxun commented May 13, 2020

liboxun commented May 21, 2020

ureyandy2009 commented Jul 3, 2020

liboxun commented Jul 3, 2020

ureyandy2009 commented Jul 5, 2020

naila53 commented Feb 6, 2021

prune2df runing for more than 140h #142

prune2df runing for more than 140h #142

Comments

JPcerapio commented Feb 26, 2020

jk86754 commented Mar 1, 2020

JPcerapio commented Mar 2, 2020

cflerin commented Mar 2, 2020

JPcerapio commented Mar 2, 2020

cflerin commented Mar 3, 2020

morganee261 commented Apr 14, 2020

liboxun commented May 6, 2020

morganee261 commented May 6, 2020

cflerin commented May 8, 2020

liboxun commented May 8, 2020

morganee261 commented May 11, 2020

liboxun commented May 12, 2020

morganee261 commented May 12, 2020

I have to transpose my expression matrix to get it in the right format but you might not have to

liboxun commented May 12, 2020

morganee261 commented May 12, 2020

morganee261 commented May 13, 2020

liboxun commented May 13, 2020

liboxun commented May 21, 2020

ureyandy2009 commented Jul 3, 2020

liboxun commented Jul 3, 2020

ureyandy2009 commented Jul 5, 2020

naila53 commented Feb 6, 2021