Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

illegal instruction (core dumped) #22

Open
shuye2009 opened this issue Nov 27, 2024 · 11 comments
Open

illegal instruction (core dumped) #22

shuye2009 opened this issue Nov 27, 2024 · 11 comments

Comments

@shuye2009
Copy link

Hi SCimilarity Team:
I am trying to run the code in the tutorial on a linux cluster node with 128GB ram, and I got the following error:
model_path = "/cluster/projects/hardinggroup/Shuye/SCimilarity/model_v1.1"

cq = CellQuery(model_path)
Illegal instruction (core dumped)

Any idea?

Here is my base environment info (partial)
Python 3.12.2
pytorch-lightning 2.4.0
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127

@tony-kuo
Copy link
Collaborator

Hello. May I ask what version is your installed hnswlib?
We've found that it has a tendency to crash if the version is too old or wasn't installed properly.

@shuye2009
Copy link
Author

Hi Tony,

Thanks for replying. The version of my hnswlib is 0.8.0.
Would it help if I install SCimilarity in its own environment?

@tony-kuo
Copy link
Collaborator

Hello. I think its own environment is a good idea. That is what I usually do.
I usually use something like this conda yaml:

name: scimilarity
channels:
  - nvidia
  - pytorch
  - bioconda
  - conda-forge
dependencies:
  - python=3.10
  - ipykernel
  - ipython
  - ipywidgets
  - leidenalg # for scanpy clustering
  - pip

Then install scimilarity via pip.

I will test a few more python versions and update the install instructions to recommend some environments.

@shuye2009
Copy link
Author

Thanks for sharing your yaml script. I am trying with it now, will let you know the outcome.

@shuye2009
Copy link
Author

Thanks again, it worked out.

@shuye2009
Copy link
Author

Hi Tony,

It is very weird that it worked for a while, then stopped working aging at the line: cq = CellQuery(model_path)
Illegal instruction (core dumped). I run my script as a slurm job on a Linux cluster, is it possible that some nodes have CPU architecture that is compatible with the scimilarity package, and some others don't?

@tony-kuo
Copy link
Collaborator

That is possible. Hnswlib is compiled on your machine on install and it compiles for available SIMD instructions, so if CPUs don't have the same SIMD it might crash.

There are things to test this if you want.

  1. If it works on your local install then it is likely due to the cluster node differences.
  2. You can use the CellEmbedding class instead of CellQuery, which loads the model but not the knn index. This will test if it is hnswlib.

@tony-kuo
Copy link
Collaborator

To add to the previous, there are only 3 things used in scimilarity that may crash like a segfault. Pytorch, which doesn't usually crash silently, hnswlib, or tiledb. The latter two of which are C bindings, hence the segfaults. Tiledb and hnswlib older versions might be in compatible with the index built with newer versions. And hnswlib has a compile specific to architecture, as an additional complexity.

@javh
Copy link
Collaborator

javh commented Nov 30, 2024

If it does turn out to be a CPU architecture or C library issue, you could also check if your compute cluster admins have grouped different hardware purchases into different slurm queues (not uncommon). Your default queue might take any node, but there may be other queues that restrict only to nodes with certain memory/etc, which tend to align with CPU versions as they represent a bulk hardware purchase.

I've seen node specific crashes with numpy before too (though not with SCimilarity).

@shuye2009
Copy link
Author

shuye2009 commented Dec 1, 2024 via email

@shuye2009
Copy link
Author

Hi Tony, we have figured it out. It is a CPU architecture issue, it worked again on node with newer CPU. Best, Shuye

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants