Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] T-SNE freezing in benchmarks #2358

Closed
cjnolet opened this issue Jun 1, 2020 · 4 comments · Fixed by #2565
Closed

[BUG] T-SNE freezing in benchmarks #2358

cjnolet opened this issue Jun 1, 2020 · 4 comments · Fixed by #2565
Assignees
Labels
bug Something isn't working

Comments

@cjnolet
Copy link
Member

cjnolet commented Jun 1, 2020

The T-SNE benchmark is locking up, sometimes after the first and sometimes after the second benchmark. From time to time, it locks up before any of the benchmarks get completed.

I added some debug statements to determine that it's locking up in the C++ layer but I have yet to determine exactly where. It's locking up in both the 0.14 and 0.15 environments.

@cjnolet cjnolet added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 1, 2020
@cjnolet cjnolet changed the title [BUG] T-SNE freezing in benchmarks notebook [BUG] T-SNE freezing in benchmarks Jun 10, 2020
@cjnolet cjnolet self-assigned this Jun 11, 2020
@cjnolet cjnolet removed the ? - Needs Triage Need team to review and classify label Jun 11, 2020
@cjnolet
Copy link
Member Author

cjnolet commented Jun 11, 2020

I have a minimum reproducible example using branch-0.15:

from cuml.common import logger
from cuml.benchmark.runners import SpeedupComparisonRunner
from cuml.benchmark.algorithms import algorithm_by_name

algo = algorithm_by_name("TSNE")
algo.cuml_args["verbose"] = logger.level_trace

runner = SpeedupComparisonRunner(
    bench_rows=[2**x for x in range(14, 17)], 
    bench_dims=[32, 256],
    dataset_name="blobs",
    input_type="numpy",
    n_reps=3,
).run(algo, verbose=True, run_cpu=False)

At first, it appears as if the algorithm is executing successfully until it suddenly locks out. Here's an example output:

[D] [14:03:06.478284] cuml/common/logger.cpp:2563 Learning rate is adaptive. In TSNE paper, it has been shown that as n->inf, Barnes Hut works well if n_neighbors->30, learning_rate->20000, early_exaggeration->24.
[D] [14:03:06.478329] cuml/common/logger.cpp:2563 cuML uses an adpative method.n_neighbors decreases to 30 as n->inf. Likewise for the other params.
[D] [14:03:06.478350] cuml/common/logger.cpp:2563 New n_neighbors = 62, learning_rate = 10922.666666666666, exaggeration = 24.0
[D] [14:03:06.478372] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:57 Data size = (32768, 256) with dim = 2 perplexity = 30.000000
[W] [14:03:06.478379] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
[D] [14:03:06.478387] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:73 Getting distances.
[D] [14:03:07.088272] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:86 Now normalizing distances so exp(D) doesn't explode.
[D] [14:03:07.098275] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:94 Searching for optimal perplexity via bisection search.
[D] [14:03:07.123687] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:101 Perplexity sum = 32768.000000
[D] [14:03:07.143066] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:69 N_nodes = 81919 blocks = 80
[D] [14:03:07.152073] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:154 Start gradient updates!
[D] [14:03:13.751098] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:256 SymmetrizeTime = 187 (0)
DistancesTime = 6228 (10)
NormalizeTime = 129 (0)
PerplexityTime = 675 (1)
BoundingBoxKernel_time = 339 (1)
ClearKernel1_time  = 276 (0)
TreeBuildingKernel_time  = 2934 (5)
ClearKernel2_time  = 159 (0)
SummarizationKernel_time  = 1652 (3)
SortKernel_time  = 2379 (4)
RepulsionTime  = 29447 (47)
Reduction_time  = 77 (0)
attractive_time  = 17619 (28)
IntegrationKernel_time = 118 (0)
TOTAL TIME = 62219



I'm not even able to gracefully abort the Python shell when this locks up.

@cjnolet
Copy link
Member Author

cjnolet commented Jun 11, 2020

Just had another freeze in a different place

[D] [14:13:26.644823] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:299 Function Returning
[D] [14:13:26.646983] cuml/common/logger.cpp:2563 Learning rate is adaptive. In TSNE paper, it has been shown that as n->inf, Barnes Hut works well if n_neighbors->30, learning_rate->20000, early_exaggeration->24.
[D] [14:13:26.647013] cuml/common/logger.cpp:2563 cuML uses an adpative method.n_neighbors decreases to 30 as n->inf. Likewise for the other params.
[D] [14:13:26.647034] cuml/common/logger.cpp:2563 New n_neighbors = 62, learning_rate = 10922.666666666666, exaggeration = 24.0
[D] [14:13:26.647054] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:57 Data size = (32768, 32) with dim = 2 perplexity = 30.000000
[W] [14:13:26.647070] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
[D] [14:13:26.647078] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:73 Getting distances.
[D] [14:13:27.163649] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:86 Now normalizing distances so exp(D) doesn't explode.
[D] [14:13:27.180716] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:94 Searching for optimal perplexity via bisection search.
[D] [14:13:27.283234] /home/cjnolet/workspace/cuml/cpp/src/tsne/tsne.cu:101 Perplexity sum = 32768.000000
[D] [14:13:27.311925] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:69 N_nodes = 81919 blocks = 80
[D] [14:13:27.325307] /home/cjnolet/workspace/cuml/cpp/src/tsne/barnes_hut.cuh:154 Start gradient updates!

@JohnZed
Copy link
Contributor

JohnZed commented Jun 11, 2020

Need checked-in stress tests that run TSNE multiple times back to back.

@cjnolet cjnolet removed their assignment Jun 11, 2020
@cjnolet
Copy link
Member Author

cjnolet commented Jun 11, 2020

Strangely, cuda-memcheck is also not showing spewing out any memory errors.

zbjornson added a commit to zbjornson/cuml that referenced this issue Jul 29, 2020
The denominator equation is just the squared Euclidean distance. Instead of computing the norms in one kernel and storing it, we can (a) save nRow * sizeof(float) bytes of memory, (b) save global loads/stores, and (c) eliminate a source of FP error that's causing lockups (see linked issues).

Per code comment, this still includes a guard in case there are other sources of NaNs upstream. This compiles to just one `setp.ltu` instruction so is essentially free.

Ref rapidsai#2358
Ref rapidsai#2565
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants