Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMAP 32bits dispatch mechanism #6314

Merged
merged 2 commits into from
Feb 14, 2025
Merged

Conversation

viclafargue
Copy link
Contributor

@viclafargue viclafargue commented Feb 13, 2025

Answers #6310

@viclafargue viclafargue requested review from a team as code owners February 13, 2025 14:04
@viclafargue viclafargue requested review from dantegd and bdice February 13, 2025 14:04
@github-actions github-actions bot added Cython / Python Cython or Python issue CUDA/C++ labels Feb 13, 2025
@viclafargue viclafargue added bug Something isn't working non-breaking Non-breaking change labels Feb 13, 2025
@github-actions github-actions bot removed the Cython / Python Cython or Python issue label Feb 13, 2025
@viclafargue viclafargue added feature request New feature or request and removed bug Something isn't working labels Feb 13, 2025
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for finishing out this change.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (branch-25.04@9c0166a). Learn more about missing BASE report.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-25.04    #6314   +/-   ##
===============================================
  Coverage                ?   67.07%           
===============================================
  Files                   ?      202           
  Lines                   ?    13076           
  Branches                ?        0           
===============================================
  Hits                    ?     8771           
  Misses                  ?     4305           
  Partials                ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cjnolet
Copy link
Member

cjnolet commented Feb 14, 2025

/merge

@rapids-bot rapids-bot bot merged commit 8a5feaa into branch-25.04 Feb 14, 2025
73 checks passed
@jcrist jcrist deleted the 32bits-umap-dispatch branch February 14, 2025 16:28
@jcrist
Copy link
Member

jcrist commented Feb 17, 2025

This PR has introduced an overflow when nnz_t is int for large n_rows. Running with n_rows=130_000_000, n_neighbors=15 ran fine before this PR, but now hits an overflow somewhere in _get_graph. This causes a runtime error with Invalid input range, passed negative size to be raised, a quick grep shows this comes from overflow detection in the thrust codebase. I think the error is hit in coo_remove_zeros here. Reverting to always setting nnz_t as uint64_t fixes things.

Looking through the code, I'm not sure I understand why we'd want to switch on this type at all? No arrays have nnz_t type, it's just some local scalars in each method (although, I think this is a bug, we probably do want the COO types to have a larger index type for bigger inputs (where n_rows wouldn't fit in int), currently this is always int - fixing this is non-trivial though, whereas fixing the nnz_t type is). At least on CPU I wouldn't expect using int instead of uint64_t to matter for perf here, and the dispatching is both more complicated and has led to an overflow somewhere.

Reproducer:

import numpy as np
from cuml.manifold import UMAP

import rmm

rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())

model = UMAP(
    n_components=2,
    n_neighbors=15,
    build_algo="nn_descent",
    init="random",
    build_kwds={"nnd_n_clusters": 150},
)

# data is 130_000_000 x 512, float32
data = np.load("/path/to/large/data.npy", mmap_mode="r")
embeddings = model.fit_transform(data, data_on_host=True)

I don't see where the overflow is occurring exactly (n * n_neighbors still fits in an int), but it's definitely happening somewhere. Given our release timeline, unless someone can spy a quick fix, I'd prefer to revert this PR so we use uint64_t everywhere for now. From Victor's benchmarks before it doesn't look like this had a measurable perf effect anyway.

@viclafargue
Copy link
Contributor Author

Thanks for noticing and reporting the issue.The crash happens at a thrust::max_element call over here. The reason is that the nnz value can reach a theoretical maximum of not n * n_neighbors, but 2 * n * n_neighbors (minus removed zeroes) as the COO matrix is symmetrized here. It looks like the tests I ran with n * n_neighbors > std::numeric_limits<int32_t>::max() did not ran into a crash because of the removal of some zeroes in the fuzzy simplicial set. The appropriate fix is to modify the condition in the function that decides the dispatching. I opened a PR for this : #6330.

@jcrist
Copy link
Member

jcrist commented Feb 18, 2025

Thanks for tracking this down (and now that you point it out, I do remember the 2x factor post-symmetrize, but clearly both of us had forgotten it when looking at this dispatch code earlier).

  • I still am skeptical that the dispatch actually matters for perf here, and it does add complexity. Since it's already in I fixing the condition is the easier fix for now, but in the long run are we sure the complexity is beneficial?
  • How did you track down where the error was being raised? I tried to debug this a bit was unable to find an easy way to see where the overflow was hit. Looking to pick up any tips/skills you have for debugging cuml here.

@viclafargue
Copy link
Contributor Author

Agree that the templating and dispatching mechanism might add unnecessary complexity in many places. The actually critical parts are :

  1. The CUDA kernels that perform the optimizations since they are ran at very epoch and might possibly overuse registers (causing spilling to L1 shared memory cache if my understanding is correct)
  2. Some of the calls to RAFT utilities, even though COO matrices use an uint64_t nnz by default

It might be interesting to do a follow-up PR to simplify things by initializing to uint64_t by default for most things, or even removing templating altogether.

Regarding tracking down errors, it is possible to build with debug symbols, otherwise a simpler way to debug is simply to add print statements. It can also be interesting to call RAFT_CUDA_TRY(cudaDeviceSynchronize()) to track down precisely where a CUDA call might have failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA/C++ feature request New feature or request non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants