[BUG] T-SNE freezing in benchmarks #2358

cjnolet · 2020-06-01T17:11:21Z

The T-SNE benchmark is locking up, sometimes after the first and sometimes after the second benchmark. From time to time, it locks up before any of the benchmarks get completed.

I added some debug statements to determine that it's locking up in the C++ layer but I have yet to determine exactly where. It's locking up in both the 0.14 and 0.15 environments.

cjnolet · 2020-06-11T18:08:49Z

I have a minimum reproducible example using branch-0.15:

from cuml.common import logger
from cuml.benchmark.runners import SpeedupComparisonRunner
from cuml.benchmark.algorithms import algorithm_by_name

algo = algorithm_by_name("TSNE")
algo.cuml_args["verbose"] = logger.level_trace

runner = SpeedupComparisonRunner(
    bench_rows=[2**x for x in range(14, 17)], 
    bench_dims=[32, 256],
    dataset_name="blobs",
    input_type="numpy",
    n_reps=3,
).run(algo, verbose=True, run_cpu=False)

At first, it appears as if the algorithm is executing successfully until it suddenly locks out. Here's an example output:

[D] [14:03:06.478284] cuml/common/logger.cpp:2563 Learning rate is adaptive. In TSNE paper, it has been shown that as n->inf, Barnes Hut works well if n_neighbors->30, learning_rate->20000, early_exaggeration->24.
[D] [14:03:06.478329] cuml/common/logger.cpp:2563 cuML uses an adpative method.n_neighbors decreases to 30 as n->inf. Likewise for the other params.
[D] [14:03:06.478350] cuml/common/logger.cpp:2563 New n_neighbors = 62, learning_rate = 10922.666666666666, exaggeration = 24.0
[D] [14:03:06.478372] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:57 Data size = (32768, 256) with dim = 2 perplexity = 30.000000
[W] [14:03:06.478379] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
[D] [14:03:06.478387] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:73 Getting distances.
[D] [14:03:07.088272] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:86 Now normalizing distances so exp(D) doesn't explode.
[D] [14:03:07.098275] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:94 Searching for optimal perplexity via bisection search.
[D] [14:03:07.123687] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:101 Perplexity sum = 32768.000000
[D] [14:03:07.143066] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:69 N_nodes = 81919 blocks = 80
[D] [14:03:07.152073] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:154 Start gradient updates!
[D] [14:03:13.751098] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:256 SymmetrizeTime = 187 (0)
DistancesTime = 6228 (10)
NormalizeTime = 129 (0)
PerplexityTime = 675 (1)
BoundingBoxKernel_time = 339 (1)
ClearKernel1_time  = 276 (0)
TreeBuildingKernel_time  = 2934 (5)
ClearKernel2_time  = 159 (0)
SummarizationKernel_time  = 1652 (3)
SortKernel_time  = 2379 (4)
RepulsionTime  = 29447 (47)
Reduction_time  = 77 (0)
attractive_time  = 17619 (28)
IntegrationKernel_time = 118 (0)
TOTAL TIME = 62219

I'm not even able to gracefully abort the Python shell when this locks up.

cjnolet · 2020-06-11T18:16:50Z

Just had another freeze in a different place

[D] [14:13:26.644823] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:299 Function Returning
[D] [14:13:26.646983] cuml/common/logger.cpp:2563 Learning rate is adaptive. In TSNE paper, it has been shown that as n->inf, Barnes Hut works well if n_neighbors->30, learning_rate->20000, early_exaggeration->24.
[D] [14:13:26.647013] cuml/common/logger.cpp:2563 cuML uses an adpative method.n_neighbors decreases to 30 as n->inf. Likewise for the other params.
[D] [14:13:26.647034] cuml/common/logger.cpp:2563 New n_neighbors = 62, learning_rate = 10922.666666666666, exaggeration = 24.0
[D] [14:13:26.647054] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:57 Data size = (32768, 32) with dim = 2 perplexity = 30.000000
[W] [14:13:26.647070] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
[D] [14:13:26.647078] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:73 Getting distances.
[D] [14:13:27.163649] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:86 Now normalizing distances so exp(D) doesn't explode.
[D] [14:13:27.180716] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:94 Searching for optimal perplexity via bisection search.
[D] [14:13:27.283234] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:101 Perplexity sum = 32768.000000
[D] [14:13:27.311925] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:69 N_nodes = 81919 blocks = 80
[D] [14:13:27.325307] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:154 Start gradient updates!

JohnZed · 2020-06-11T19:46:10Z

Need checked-in stress tests that run TSNE multiple times back to back.

cjnolet · 2020-06-11T21:41:55Z

Strangely, cuda-memcheck is also not showing spewing out any memory errors.

The denominator equation is just the squared Euclidean distance. Instead of computing the norms in one kernel and storing it, we can (a) save nRow * sizeof(float) bytes of memory, (b) save global loads/stores, and (c) eliminate a source of FP error that's causing lockups (see linked issues). Per code comment, this still includes a guard in case there are other sources of NaNs upstream. This compiles to just one `setp.ltu` instruction so is essentially free. Ref rapidsai#2358 Ref rapidsai#2565

cjnolet added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 1, 2020

cjnolet changed the title ~~[BUG] T-SNE freezing in benchmarks notebook~~ [BUG] T-SNE freezing in benchmarks Jun 10, 2020

cjnolet self-assigned this Jun 11, 2020

cjnolet removed the ? - Needs Triage Need team to review and classify label Jun 11, 2020

cjnolet removed their assignment Jun 11, 2020

JohnZed assigned drobison00 Jun 22, 2020

drobison00 mentioned this issue Jul 15, 2020

[BUG] tSNE Lock up #2565

Merged

cjnolet mentioned this issue Jul 24, 2020

[BUG] Better documentation on learning_rate_method and how to adjust perplexity and n_neighbors parameters #2600

Open

zbjornson mentioned this issue Jul 29, 2020

[REVIEW] Fix floating point precision error in tSNE #2617

Merged

cjnolet closed this as completed in #2565 Jul 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] T-SNE freezing in benchmarks #2358

[BUG] T-SNE freezing in benchmarks #2358

cjnolet commented Jun 1, 2020

cjnolet commented Jun 11, 2020 •

edited

Loading

cjnolet commented Jun 11, 2020

JohnZed commented Jun 11, 2020

cjnolet commented Jun 11, 2020

[BUG] T-SNE freezing in benchmarks #2358

[BUG] T-SNE freezing in benchmarks #2358

Comments

cjnolet commented Jun 1, 2020

cjnolet commented Jun 11, 2020 • edited Loading

cjnolet commented Jun 11, 2020

JohnZed commented Jun 11, 2020

cjnolet commented Jun 11, 2020

cjnolet commented Jun 11, 2020 •

edited

Loading