illegal instruction (core dumped) #22

shuye2009 · 2024-11-27T21:43:46Z

Hi SCimilarity Team:
I am trying to run the code in the tutorial on a linux cluster node with 128GB ram, and I got the following error:
model_path = "/cluster/projects/hardinggroup/Shuye/SCimilarity/model_v1.1"

cq = CellQuery(model_path)
Illegal instruction (core dumped)

Any idea?

Here is my base environment info (partial)
Python 3.12.2
pytorch-lightning 2.4.0
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127

tony-kuo · 2024-11-28T21:12:05Z

Hello. May I ask what version is your installed hnswlib?
We've found that it has a tendency to crash if the version is too old or wasn't installed properly.

shuye2009 · 2024-11-29T14:04:34Z

Hi Tony,

Thanks for replying. The version of my hnswlib is 0.8.0.
Would it help if I install SCimilarity in its own environment?

tony-kuo · 2024-11-29T14:36:05Z

Hello. I think its own environment is a good idea. That is what I usually do.
I usually use something like this conda yaml:

name: scimilarity
channels:
  - nvidia
  - pytorch
  - bioconda
  - conda-forge
dependencies:
  - python=3.10
  - ipykernel
  - ipython
  - ipywidgets
  - leidenalg # for scanpy clustering
  - pip

Then install scimilarity via pip.

I will test a few more python versions and update the install instructions to recommend some environments.

shuye2009 · 2024-11-29T15:52:58Z

Thanks for sharing your yaml script. I am trying with it now, will let you know the outcome.

shuye2009 · 2024-11-29T20:34:57Z

Thanks again, it worked out.

shuye2009 · 2024-11-29T22:45:56Z

Hi Tony,

It is very weird that it worked for a while, then stopped working aging at the line: cq = CellQuery(model_path)
Illegal instruction (core dumped). I run my script as a slurm job on a Linux cluster, is it possible that some nodes have CPU architecture that is compatible with the scimilarity package, and some others don't?

tony-kuo · 2024-11-29T23:22:38Z

That is possible. Hnswlib is compiled on your machine on install and it compiles for available SIMD instructions, so if CPUs don't have the same SIMD it might crash.

There are things to test this if you want.

If it works on your local install then it is likely due to the cluster node differences.
You can use the CellEmbedding class instead of CellQuery, which loads the model but not the knn index. This will test if it is hnswlib.

tony-kuo · 2024-11-29T23:44:49Z

To add to the previous, there are only 3 things used in scimilarity that may crash like a segfault. Pytorch, which doesn't usually crash silently, hnswlib, or tiledb. The latter two of which are C bindings, hence the segfaults. Tiledb and hnswlib older versions might be in compatible with the index built with newer versions. And hnswlib has a compile specific to architecture, as an additional complexity.

javh · 2024-11-30T05:18:46Z

If it does turn out to be a CPU architecture or C library issue, you could also check if your compute cluster admins have grouped different hardware purchases into different slurm queues (not uncommon). Your default queue might take any node, but there may be other queues that restrict only to nodes with certain memory/etc, which tend to align with CPU versions as they represent a bulk hardware purchase.

I've seen node specific crashes with numpy before too (though not with SCimilarity).

shuye2009 · 2024-12-01T14:34:54Z

Thank you, Tony, for the detailed explanation. I have arranged with our admin to see if it is architecture related. Best, Shuye

…

On Sat, Nov 30, 2024, 12:19 a.m. Jason Vander Heiden < ***@***.***> wrote: If it does turn out to be a CPU architecture or C library issue, you could also check if your compute cluster admins have grouped different hardware purchases into different slurm queues (not uncommon). Your default queue might take any node, but there may be other queues that restrict only to nodes with certain memory/etc, which tend to align with CPU versions as they represent a bulk hardware purchase. I've seen node specific crashes with numpy before too (though not with SCimilarity). — Reply to this email directly, view it on GitHub <#22 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABRPJSQVBZPUKWS2QVUPEJL2DFDEZAVCNFSM6AAAAABSTYXPIWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBYHAZTMOBSGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

shuye2009 · 2024-12-01T16:09:48Z

Hi Tony, we have figured it out. It is a CPU architecture issue, it worked again on node with newer CPU. Best, Shuye

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

illegal instruction (core dumped) #22

illegal instruction (core dumped) #22

shuye2009 commented Nov 27, 2024

tony-kuo commented Nov 28, 2024

shuye2009 commented Nov 29, 2024

tony-kuo commented Nov 29, 2024

shuye2009 commented Nov 29, 2024

shuye2009 commented Nov 29, 2024

shuye2009 commented Nov 29, 2024

tony-kuo commented Nov 29, 2024

tony-kuo commented Nov 29, 2024

javh commented Nov 30, 2024 •

edited

Loading

shuye2009 commented Dec 1, 2024 via email

shuye2009 commented Dec 1, 2024

illegal instruction (core dumped) #22

illegal instruction (core dumped) #22

Comments

shuye2009 commented Nov 27, 2024

tony-kuo commented Nov 28, 2024

shuye2009 commented Nov 29, 2024

tony-kuo commented Nov 29, 2024

shuye2009 commented Nov 29, 2024

shuye2009 commented Nov 29, 2024

shuye2009 commented Nov 29, 2024

tony-kuo commented Nov 29, 2024

tony-kuo commented Nov 29, 2024

javh commented Nov 30, 2024 • edited Loading

shuye2009 commented Dec 1, 2024 via email

shuye2009 commented Dec 1, 2024

javh commented Nov 30, 2024 •

edited

Loading