Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matlab crash while glmnetMex.mexmaci64 was running #69

Open
ElenaMerinoTejero opened this issue Aug 10, 2022 · 15 comments
Open

Matlab crash while glmnetMex.mexmaci64 was running #69

ElenaMerinoTejero opened this issue Aug 10, 2022 · 15 comments

Comments

@ElenaMerinoTejero
Copy link

I am trying to run the SINGE_Example.m in MATLABR2020a on macOS Catalina.

ver -support


MATLAB Version: 9.8.0.1873465 (R2020a) Update 8
MATLAB License Number: 40707400
Operating System: Mac OS X Version: 10.15.7 Build: 19H1824
Java Version: Java 1.8.0_202-b08 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode

MATLAB Version 9.8 (R2020a) License 40707400

I get the following Warning message:

Warning: from glmnet Fortran code (error code -5); Convergence for 5th lambda value not reached
after maxit=10000 iterations; solutions for larger lambdas returned

In elnet (line 33)
In glmnet (line 443)
In iLasso_for_SINGE (line 111)
In run_iLasso_row (line 27)
In SINGE_GLG_Test (line 79)
In SINGE (line 20)
In SINGE_Example (line 16)

After several iterations MATLAB crashes.
According to MathWorks technical support, the crash was detected while the MEX-file glmnetMex.mexmaci64 was running.
Any suggestion to solve this issue?

@agitter
Copy link
Member

agitter commented Aug 10, 2022

Thanks for letting us know @ElenaMerinoTejero. Can you please tell us which version glmnet you are using? I see that https://hastie.su.domains/glmnet_matlab/download.html now lists glmnet_matlab.zip as well as glmnet_matlab_new.zip.

@atuldeshpande
Copy link
Collaborator

atuldeshpande commented Aug 10, 2022

Thanks for bringing this to our notice @ElenaMerinoTejero.

Could you please share also more details of the hyperparameters file you are using to run SINGE as well as the size of the data matrix in number of genes and cells?
If you have access to docker, can you also try and run SINGE from its docker implementation?
There is an ongoing issue with a potential memory leak in the glmnetMex code which causes the crashes for larger datasets. We also noticed that this issue is more frequent with the "poisson" distribution option compared to the "gaussian" distribution.
The following general strategies may help, but based on additional information, I could give you more targeted suggestions.

  1. Reducing the number of potential regulators to make the glmnet input size. This can be achieved either by retaining only the important genes in the dataset, or using the regix argument where you specify the indices of the genes to be tested as regulators (please see https://github.com/gitter-lab/SINGE/blob/master/data1/X_regix_test.mat for an example).
  2. Increasing the values of --prob-zero-removal 0 --prob-remove-sample 0.2 to also reduce the glmnet input size. Especially for sparse data sets, we observed that prob-zero-removal do not impact SINGE performance greatly.

@ElenaMerinoTejero
Copy link
Author

Thanks for letting us know @ElenaMerinoTejero. Can you please tell us which version glmnet you are using? I see that https://hastie.su.domains/glmnet_matlab/download.html now lists glmnet_matlab.zip as well as glmnet_matlab_new.zip.

Thanks to you for the fast reply! I am using the updated glmnet version. I think that is the glmnet_matlab_new.zip.

@ElenaMerinoTejero
Copy link
Author

Thanks for bringing this to our notice @ElenaMerinoTejero.

Could you please share also more details of the hyperparameters file you are using to run SINGE as well as the size of the data matrix in number of genes and cells? If you have access to docker, can you also try and run SINGE from its docker implementation? There is an ongoing issue with a potential memory leak in the glmnetMex code which causes the crashes for larger datasets. We also noticed that this issue is more frequent with the "poisson" distribution option compared to the "gaussian" distribution. The following general strategies may help, but based on additional information, I could give you more targeted suggestions.

  1. Reducing the number of potential regulators to make the glmnet input size. This can be achieved either by retaining only the important genes in the dataset, or using the regix argument where you specify the indices of the genes to be tested as regulators (please see https://github.com/gitter-lab/SINGE/blob/master/data1/X_regix_test.mat for an example).
  2. Increasing the values of --prob-zero-removal 0 --prob-remove-sample 0.2 to also reduce the glmnet input size. Especially for sparse data sets, we observed that prob-zero-removal do not impact SINGE performance greatly.

Thanks to you too, for your fast replying.

For now I am trying to running the SINGE_Example.m which takes as hyperparameters those in 'default_hyperparameters.txt', as data 'data1/X_SCODE_data.mat' (with 356 cells) and as gene list 'data1/gene_list.mat' (with 100 genes). Furthermore, the "gaussian" distribution is used in the example.

default_hyperparameters.txt

Nevertheless, I am planning to run SINGE with a larger dataset (33694 genes and 737280 single cells) so I could use your suggestions then, thanks.

@atuldeshpande
Copy link
Collaborator

That's actually one of the most stable test cases we have run. Would it be possible for you to test the docker implementation at https://hub.docker.com/r/agitter/singe?
This would remove the OS and the Matlab version as variables and help us better diagnose if the problem still persists.

Regarding the larger dataset: I would strongly advise on limiting the genes to a much smaller number, and potentially also subsampling the cells at a much higher rate. I understand you are currently trying SINGE out on a personal computing device, but for the larger datasets, you would also want to use a high throughput computing server to speed up the analysis.

@ElenaMerinoTejero
Copy link
Author

Thanks for the suggestion! I am running the docker implementation and it doesn't crash now and produces output files. Nevertheless, the Warning message persists. I am unsure if it affects the output. Is there a way to compare my output with the expected one for the SINGE_Example?

In any case, I will take your advice about reducing the data set and trying it out on a server for higher speed.

@agitter
Copy link
Member

agitter commented Aug 12, 2022

We have formal test cases you can use to confirm the SINGE_Example output matches the expected output. However, they use a smaller set of hyperparameters so that they run quickly on GitHub Actions. You can change the hyperpameters to tests/example_hyperparameters.txt.

Then, the output files should match those in the directory https://github.com/gitter-lab/SINGE/tree/master/tests/reference/latest. You can start by comparing the SINGE_Gene_Influence.txt and SINGE_Ranked_Edge_List.txt files you generate versus those stored in the repository. If those match, you can trust SINGE_Example.m is running correctly. If you want to test in more detail, I can give you instructions for running our Python code that will compare the entire adjacency matrices.

The most relevant test script, which you don't have to run but may be a useful reference, is https://github.com/gitter-lab/SINGE/blob/master/tests/standalone_test.sh

@ElenaMerinoTejero
Copy link
Author

I ran SINGE_Example from docker with hyperparameeters from tests folder as @agitter indicated and output files indeed match those in /tests/reference/latest. Furthermore, no warning message appeared this time so I can trust SINGE running correctly. Thanks a lot for the help.

@agitter
Copy link
Member

agitter commented Aug 17, 2022

That's great! We can keep this issue open if you'd like to discuss strategies for running SINGE in parallel on a cluster as you scale up to your full dataset. That is a larger dataset than any we've tested on previously, so we're happy to help come up with strategies.

@atuldeshpande we should also separately follow up on whether glmnet_matlab_new.zip causes problems with the example dataset.

@ElenaMerinoTejero
Copy link
Author

Hi @agitter, I reduced the dataset by selecting a particular cell type. The dataset now has 433 singe cells and a mean of 157 genes per cell. I am finding that when running this data set with tests/example_hyperparameters.txt the corresponding adjacency matrixes are outputted but the list of ranked edges and the gene influence files are not outputted. Furthermore, SINGE is killed when running run_SINGE_Aggregate.sh:

/usr/local/SINGE/run_SINGE_Aggregate.sh: line 30: 32 Killed "/usr/local/SINGE/SINGE_Aggregate" "GSE142016_RAW/SLE1/X_SCODE_data.mat" "GSE142016_RAW/SLE1/gene_list.mat" "Output"

Any clues as to why this may happen? Could it be a size problem?

BTW: It would also be helpful to discuss how to run SINGE in parallel on a cluster since I would like to run bigger data sets and with default hyperparameters.

@ElenaMerinoTejero
Copy link
Author

About the reduced data set size: There are 9082 unique genes, thus the size of the resulting X data matrix is 433x9082.

@agitter
Copy link
Member

agitter commented Sep 7, 2022

Are you still running SINGE from the Docker container? If you were able to generate adjacency matrices successfully, you should now be able to run SINGE.sh in Aggregate mode to generate the edge list and gene influence files. I have not previously seen SINGE fail at this stage with the behavior you described, so we'll have to help you debug this problem.

One idea would be to copy a small number, perhaps 2-4, of the adjacency matrices to a new directory for testing. If those can be aggregated successfully, then it may indicate the dataset size is an issue. If that still fails, you could zip those adjacency matrices and the input .mat files so we could try reproducing the issue in Docker.

We have an example of how we ran SINGE on a cluster using HTCondor in this directory of our supplemental repository. The basic idea is that instead of creating a single hyperparameters file and running SINGE with all hyperparameter combinations in a single batch, each combination is split into a separate job. Those jobs can be parallelized over different nodes in the cluster. Then, after all jobs complete, the SINGE aggregate step can run. We can work through the details with you depending on your cluster setup and whether you will be using Docker or running MATLAB directly.

@atuldeshpande
Copy link
Collaborator

In addition, the MATLAB crashes are usually an issue only for the first part of SINGE, which require glmnet. Since you already have successfully navigated that part, you can try running SINGE aggregate through the Matlab functions. (I wonder if the Aggregate trying to load 9000x9000 matrices and perform additions on them may be causing memory issues?)

@ElenaMerinoTejero
Copy link
Author

ElenaMerinoTejero commented Sep 12, 2022

@agitter Yes, still running SINGE from docker. I followed your suggestion and copied 1 of the 4 Adjacency Matrixes to a test output folder to run Aggregate mode. The error persists without producing the list of ranked edges and the gene influence files. Attached are the zipped files so you can reproduce the issue.
gene_list.mat.zip
X_SCODE_data.mat.zip
AdjMats.zip

With regards to running SINGE in a Cluster in parallel. I will be using docker and a Sonic HPC Cluster with the following characteristics. (https://www.ucd.ie/itservices/ourservices/researchit/researchcomputing/sonichpc/
In short, I will be able to use up to 48 cores, 50GB of file storage and 1.5TB of RAM.
It would be nice to hear suggestions on how to adapt the example in SINGE-supplementary to run on Sonic HPC. Is it possible to run several datasets? Should the input data structure (.mat files) be modified? How should the hyperparameters be specified now? Is there any wrapper script example to see how to run it with docker?

@ElenaMerinoTejero
Copy link
Author

@atuldeshpande I just tried to run aggregate mode on the 4 Adjacency matrixes through Matlab code and it did produce the Gene Influence and Ranked Edge List output files. Thanks for the suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants