UMAP 32bits dispatch mechanism #6314

viclafargue · 2025-02-13T14:04:16Z

Answers #6310

cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh

cjnolet

LGTM. Thanks for finishing out this change.

codecov-commenter · 2025-02-13T19:34:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (branch-25.04@9c0166a). Learn more about missing BASE report.

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-25.04    #6314   +/-   ##
===============================================
  Coverage                ?   67.07%           
===============================================
  Files                   ?      202           
  Lines                   ?    13076           
  Branches                ?        0           
===============================================
  Hits                    ?     8771           
  Misses                  ?     4305           
  Partials                ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cjnolet · 2025-02-14T16:28:25Z

/merge

jcrist · 2025-02-17T16:12:28Z

This PR has introduced an overflow when nnz_t is int for large n_rows. Running with n_rows=130_000_000, n_neighbors=15 ran fine before this PR, but now hits an overflow somewhere in _get_graph. This causes a runtime error with Invalid input range, passed negative size to be raised, a quick grep shows this comes from overflow detection in the thrust codebase. I think the error is hit in coo_remove_zeros here. Reverting to always setting nnz_t as uint64_t fixes things.

Looking through the code, I'm not sure I understand why we'd want to switch on this type at all? No arrays have nnz_t type, it's just some local scalars in each method (although, I think this is a bug, we probably do want the COO types to have a larger index type for bigger inputs (where n_rows wouldn't fit in int), currently this is always int - fixing this is non-trivial though, whereas fixing the nnz_t type is). At least on CPU I wouldn't expect using int instead of uint64_t to matter for perf here, and the dispatching is both more complicated and has led to an overflow somewhere.

Reproducer:

import numpy as np
from cuml.manifold import UMAP

import rmm

rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())

model = UMAP(
    n_components=2,
    n_neighbors=15,
    build_algo="nn_descent",
    init="random",
    build_kwds={"nnd_n_clusters": 150},
)

# data is 130_000_000 x 512, float32
data = np.load("/path/to/large/data.npy", mmap_mode="r")
embeddings = model.fit_transform(data, data_on_host=True)

I don't see where the overflow is occurring exactly (n * n_neighbors still fits in an int), but it's definitely happening somewhere. Given our release timeline, unless someone can spy a quick fix, I'd prefer to revert this PR so we use uint64_t everywhere for now. From Victor's benchmarks before it doesn't look like this had a measurable perf effect anyway.

viclafargue · 2025-02-18T13:13:17Z

Thanks for noticing and reporting the issue.The crash happens at a thrust::max_element call over here. The reason is that the nnz value can reach a theoretical maximum of not n * n_neighbors, but 2 * n * n_neighbors (minus removed zeroes) as the COO matrix is symmetrized here. It looks like the tests I ran with n * n_neighbors > std::numeric_limits<int32_t>::max() did not ran into a crash because of the removal of some zeroes in the fuzzy simplicial set. The appropriate fix is to modify the condition in the function that decides the dispatching. I opened a PR for this : #6330.

jcrist · 2025-02-18T16:02:55Z

Thanks for tracking this down (and now that you point it out, I do remember the 2x factor post-symmetrize, but clearly both of us had forgotten it when looking at this dispatch code earlier).

I still am skeptical that the dispatch actually matters for perf here, and it does add complexity. Since it's already in I fixing the condition is the easier fix for now, but in the long run are we sure the complexity is beneficial?
How did you track down where the error was being raised? I tried to debug this a bit was unable to find an easy way to see where the overflow was hit. Looking to pick up any tips/skills you have for debugging cuml here.

viclafargue · 2025-02-18T17:41:30Z

Agree that the templating and dispatching mechanism might add unnecessary complexity in many places. The actually critical parts are :

The CUDA kernels that perform the optimizations since they are ran at very epoch and might possibly overuse registers (causing spilling to L1 shared memory cache if my understanding is correct)
Some of the calls to RAFT utilities, even though COO matrices use an uint64_t nnz by default

It might be interesting to do a follow-up PR to simplify things by initializing to uint64_t by default for most things, or even removing templating altogether.

Regarding tracking down errors, it is possible to build with debug symbols, otherwise a simpler way to debug is simply to add print statements. It can also be interesting to call RAFT_CUDA_TRY(cudaDeviceSynchronize()) to track down precisely where a CUDA call might have failed.

UMAP 32bits dispatch mechanism

1626fea

viclafargue requested review from a team as code owners February 13, 2025 14:04

viclafargue requested review from dantegd and bdice February 13, 2025 14:04

github-actions bot added Cython / Python Cython or Python issue CUDA/C++ labels Feb 13, 2025

viclafargue added bug Something isn't working non-breaking Non-breaking change labels Feb 13, 2025

cjnolet requested changes Feb 13, 2025

View reviewed changes

cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh Outdated Show resolved Hide resolved

Revert bug fixes

5e2d131

github-actions bot removed the Cython / Python Cython or Python issue label Feb 13, 2025

viclafargue added feature request New feature or request and removed bug Something isn't working labels Feb 13, 2025

cjnolet approved these changes Feb 13, 2025

View reviewed changes

cjnolet assigned viclafargue Feb 13, 2025

rapids-bot bot merged commit 8a5feaa into branch-25.04 Feb 14, 2025
73 checks passed

jcrist deleted the 32bits-umap-dispatch branch February 14, 2025 16:28

This was referenced Feb 14, 2025

[FEA] 32bits indices dispatch mechanism for UMAP #6310

Closed

Reduce peak memory in UMAP.fit/UMAP.fit_transform #6323

Open

viclafargue mentioned this pull request Feb 18, 2025

Correct UMAP dispatch trigger condition #6330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMAP 32bits dispatch mechanism #6314

UMAP 32bits dispatch mechanism #6314

viclafargue commented Feb 13, 2025 •

edited

Loading

cjnolet left a comment

codecov-commenter commented Feb 13, 2025

cjnolet commented Feb 14, 2025

jcrist commented Feb 17, 2025 •

edited

Loading

viclafargue commented Feb 18, 2025

jcrist commented Feb 18, 2025

viclafargue commented Feb 18, 2025

UMAP 32bits dispatch mechanism #6314

UMAP 32bits dispatch mechanism #6314

Conversation

viclafargue commented Feb 13, 2025 • edited Loading

cjnolet left a comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 13, 2025

Codecov Report

cjnolet commented Feb 14, 2025

jcrist commented Feb 17, 2025 • edited Loading

viclafargue commented Feb 18, 2025

jcrist commented Feb 18, 2025

viclafargue commented Feb 18, 2025

viclafargue commented Feb 13, 2025 •

edited

Loading

jcrist commented Feb 17, 2025 •

edited

Loading