-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
leiden and umap not reproducible on different CPUs #2014
Comments
Hey, please report back with containerized environments as discussed on Twitter. |
Hi, I updated the pipeline to use this singularity container. The problem persists. |
I tried also with On
On
|
Do you have any idea if the issue is with our implementation of these, or with the underlying libraries? E.g. if you just run UMAP directly from the UMAP library can you get exact reproducibility? I don't think we would be able to promise more reproducibility than what they offer, and irreproducibility across machines is a known issue there: lmcinnes/umap#525. |
That's a good point, and it is not: reducer = umap.UMAP(min_dist=0.5)
embedding = reducer.fit_transform(adata.obsm["X_scVI"])
adata.obsm["X_umap"] = embedding again produces stable results on only 3/4 CPUs. Ok, let's forget about UMAP. It's only a nice figure to get an overview of the data and I don't use it for downstream stuff. Irreproducible clustering, on the other hand, is quite a deal-breaker, as for instance cell-type annotations depend on it. I mean, why would I even bother releasing the source code of an analysis alongside the paper if it is not reproducible anyway? I found out a few more things:
adata.obsp["connectivities"] = np.round(adata.obsp["connectivities"], decimals=3)
adata.obsp["distances"] = np.round(adata.obsp["distances"], decimals=3) |
So should we add a deterministic flag for Leiden clustering which enforces |
I have yet to install the most latest |
@grst I don't think My guess is this is going to have to do with the CPU that gives different results being much older using a different instruction set that the other intel processors. This could be triggered by either use of any parallelism at all or Do you get the same graph out of |
💯
I have already been exporting those four env vars before running the analysis. Is there anything else that might be threaded? export MKL_NUM_THREADS="1"
export OPENBLAS_NUM_THREADS="1"
export OMP_NUM_THREADS="1"
export NUMBA_NUM_THREADS="1" |
threadpoolctl does a nice job of accounting for most possible options. But do you see more than a single thread being used? |
Just to be sure, I additionally included from threadpoolctl import threadpool_limits
threadpool_limits(1) I also confirmed that only a single thread was actually used. The results didn't change. |
Alright, if you would like to achieve reproducibility the next things to play around with would probably be these CPU feature flag variables for numba. In particular: |
another 💯 with In terms of speed I didn't notice a bit difference (maybe 10s slower, but that would require proper benchmarking ofc). |
What do you think the right thing to do here is? I don't think we can modify the global environment for our users by setting those numba flags by default. Is this just something we should document? |
How about creating a page about reproducibility in the docs, similar to the one by pytorch? It could gather all information around reproducibility with scanpy, such as
and also state the limits, e.g.
In addition to that, @Zethson, what do you think of creating a mlf-core template for single-cell analyses that sets the right defaults? |
Mhm, certainly a cool option, but nothing that I could tackle in the next weeks due to time constraints. I would start here with a reproducibility section in the documentation and maybe a "deterministic" switch in the Scanpy settings which sets all required numba flags. |
@grst, this would be great. Would you be up for writing this at some point? I'm thinking it could either go:
|
Yeah, not sure when I can make it, though. |
We could then also consider extending it with rapids single cell @Intron7 . More of a |
@Zethson |
I noticed that running the same single-cell analyses on different nodes of our HPC produces different results.
Starting from the same anndata object with a precomputed
X_scVI
latent representation, the UMAP and leiden-clustering looks different.On
On
Minimal code sample (that we can copy&paste without having any data)
A git repository with example data, notebook and a nextflow pipeline is available here:
https://github.com/grst/scanpy_reproducibility
A report of the analysis executed on four different CPU architectures is available here:
https://grst.github.io/scanpy_reproducibility/
Versions
The text was updated successfully, but these errors were encountered: