-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] T-SNE freezing in benchmarks #2358
Comments
I have a minimum reproducible example using from cuml.common import logger
from cuml.benchmark.runners import SpeedupComparisonRunner
from cuml.benchmark.algorithms import algorithm_by_name
algo = algorithm_by_name("TSNE")
algo.cuml_args["verbose"] = logger.level_trace
runner = SpeedupComparisonRunner(
bench_rows=[2**x for x in range(14, 17)],
bench_dims=[32, 256],
dataset_name="blobs",
input_type="numpy",
n_reps=3,
).run(algo, verbose=True, run_cpu=False) At first, it appears as if the algorithm is executing successfully until it suddenly locks out. Here's an example output:
I'm not even able to gracefully abort the Python shell when this locks up. |
Just had another freeze in a different place
|
Need checked-in stress tests that run TSNE multiple times back to back. |
Strangely, |
The denominator equation is just the squared Euclidean distance. Instead of computing the norms in one kernel and storing it, we can (a) save nRow * sizeof(float) bytes of memory, (b) save global loads/stores, and (c) eliminate a source of FP error that's causing lockups (see linked issues). Per code comment, this still includes a guard in case there are other sources of NaNs upstream. This compiles to just one `setp.ltu` instruction so is essentially free. Ref rapidsai#2358 Ref rapidsai#2565
The T-SNE benchmark is locking up, sometimes after the first and sometimes after the second benchmark. From time to time, it locks up before any of the benchmarks get completed.
I added some debug statements to determine that it's locking up in the C++ layer but I have yet to determine exactly where. It's locking up in both the 0.14 and 0.15 environments.
The text was updated successfully, but these errors were encountered: