Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for multicore tsne #16

Closed
mikejiang opened this issue Mar 27, 2017 · 29 comments
Closed

support for multicore tsne #16

mikejiang opened this issue Mar 27, 2017 · 29 comments

Comments

@mikejiang
Copy link

Any plan to add support for parallel tsne? https://github.com/DmitryUlyanov/Multicore-TSNE

@jkrijthe
Copy link
Owner

Thank you for the suggestion. I was not planning on it, but it could be a nice addition. The project you linked to seems to require OpenMP, which I think an R package can not currently assume is present on all platforms, particularly macOS. Do you have any suggestions on a different way to get this to work?

@mikejiang
Copy link
Author

@jkrijthe
Copy link
Owner

I've been trying to implement the changes from https://github.com/DmitryUlyanov/Multicore-TSNE in a separate branch, but so far, I have not been able to get large improvements (only a few per cent improvement), so this is going to take some more testing to figure out whether the performance improvement is worth it.

@mikejiang
Copy link
Author

I implemented my own based on your R wrapper https://github.com/RGLab/Rtsne.multicore
which already sees 50% time saving with iris data

@jkrijthe
Copy link
Owner

I would expect most of those improvements to come from the use of quadtree.cpp, how big are the improvements from using OpenMP?

@mikejiang
Copy link
Author

I am only testing at R wrapper level. But you might be rightsince changing num_threads doesn't seem to make much difference.

@mikejiang
Copy link
Author

@DmitryUlyanov may be able to weight in on this

@mikejiang
Copy link
Author

mikejiang commented Mar 29, 2017

@gfinak, According to its document, it only speeds up the scenario where data has D>>N, yet majority of our use cases are rows >> columns, thus the major bottle neck at step 2 (i.e. Learning embedding) does not gain much from this multicore version

$ ./tsne_cpp 20 0.9
Using no_dims = 2, perplexity = 30.000000, and theta = 0.5
Computing input similarities...
Building tree...
Done in 0.2214 seconds (sparsity = 0.707175)!
Learning embedding...
Iteration 50: error is 45.849951 (50 iterations in 0.217287 seconds)
Iteration 100: error is 44.965177 (50 iterations in 0.160154 seconds)
...
Iteration 999: error is 0.123850 (50 iterations in 0.234315 seconds)
Fitting performed in 3.220189 seconds.
done

@jkrijthe
Copy link
Owner

jkrijthe commented Jul 7, 2017

I was finally able to have a look at this. After some experiments I decided to integrate the suggested changes in https://github.com/rappdw/tsne, to be able to keep most of the functionality of Rtsne intact, while providing a speed up. The main functionality that I lost is the ability for the user to specify an arbitrary output dimensionality at run time, other than those that are defined at compile time (1,2 and 3). While this could be added back later, I hope those three cases cover most of the use cases of tsne.

The changes are in this branch: https://github.com/jkrijthe/Rtsne/tree/openmp.

Benchmarks

Iris (149x4)

On iris I get about a 3x speedup on a single core, and a little help from using multiple threads (I have only 2 physical cores in this machine, so results for more than 2 threads may not be representative, as I am not sure what effect HyperThreading has):

New version:

Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
tsne_out <- Rtsne(mat, num_threads = 4) 268.3007 272.7952 288.3637 293.0806 297.0574 305.2486    10
tsne_out <- Rtsne(mat, num_threads = 3) 261.4605 263.5789 276.8929 272.7442 286.3434 309.8314    10
tsne_out <- Rtsne(mat, num_threads = 2) 242.5431 252.5582 284.0321 290.3844 300.3660 337.2284    10
tsne_out <- Rtsne(mat, num_threads = 1) 308.2431 314.8977 335.4389 323.9315 357.9258 374.1955    10

Old version (always only effectively uses 1 core):

  Unit: milliseconds
expr       min       lq     mean   median       uq      max neval
tsne_out <- Rtsne(mat, num_threads = 4)  999.5290 1032.207 1048.617 1045.712 1077.219 1100.675    10
tsne_out <- Rtsne(mat, num_threads = 3) 1018.1044 1035.353 1062.052 1050.393 1084.681 1130.307    10
tsne_out <- Rtsne(mat, num_threads = 2)  988.6694 1005.776 1043.140 1022.328 1072.930 1135.645    10
tsne_out <- Rtsne(mat, num_threads = 1) 1020.6595 1031.474 1036.125 1037.058 1043.347 1051.902    10

MNIST digits (10000x784)

Old 1 thread : 135.835 s
New 1 thread :  75.803 s
New 2 threads:  58.209 s
New 4 threads:  57.580 s

Before integrating this into the main branch I still need to check to what extend results are reproducible when different numbers of threads are used and to make sure I understand all the requirements to include OpenMP code in a CRAN package.

@rgranit
Copy link

rgranit commented Aug 9, 2018

So this feature is now merged? is there a flag I should one to utilize it?

@jkrijthe
Copy link
Owner

Unfortunately, no, it is still in the openmp branch, as I got caught up in other things during testing. Because of this, it has not been thoroughly tested yet, so I have not merged the changes. Another minor issue is that the change no longer allows for embeddings of dimension other than 1,2 or 3, breaking earlier possible (but never recommended) behaviour.

So to get the speed-up you currently have to install the version in the openmp branch.

I'm currently not sure when I will have time to test/merge into the main branch and send it to CRAN, but hopefully I can get to it soon. In the meantime, any testing/improvements are most welcome.

@LTLA
Copy link
Contributor

LTLA commented Oct 4, 2018

It seems that the changes on master (currently on 2fbc1ca) are not consistent when you change the number of threads, or even across runs. When I execute this code multiple times, I get different results:

library(Rtsne)
iris_unique <- unique(iris) # Remove duplicates
iris_matrix <- as.matrix(iris_unique[,1:4])

set.seed(42) # Setting a seed doesn't help here, if I execute this again.
tsne_out <- Rtsne(iris_matrix, num_threads=2, verbose=TRUE) # Run TSNE
plot(tsne_out$Y,col=iris_unique$Species)

The culprit seems to be TSNE<NDims>::computeGradient; commenting out the #pragma gives the consistent results across runs. I suspect that the reduction in the openMP'd for loop does not guarantee addition to sum_Q in the same order in a multi-threaded application compared to a single-threaded application, which results in differences in rounding-off errors that are amplified across iterations.

I don't know if there is any openMP setting that can solve this - static scheduling will provide consistency across runs but not necessarily the same results as the single-threaded application. The simplest solution would just be to save everything into an array and sum across the array in single-threaded mode.

@jkrijthe
Copy link
Owner

jkrijthe commented Oct 4, 2018

Do you get consistency across different numbers of threads with the solution you propose? And what would be the difference in terms of performance?

@LTLA
Copy link
Contributor

LTLA commented Oct 4, 2018

Some testing suggests that the proposed solution restores consistency to the application across runs and across numbers of threads. I've posted the modified function below (can make a PR explicitly if you want):

// Compute gradient of the t-SNE cost function (using Barnes-Hut algorithm)
template <int NDims>
void TSNE<NDims>::computeGradient(double* P, unsigned int* inp_row_P, unsigned int* inp_col_P, double* inp_val_P, double* Y, int N, int D, double* dC, double theta)
{
    // Construct space-partitioning tree on current map
    SPTree<NDims>* tree = new SPTree<NDims>(Y, N);

    // Compute all terms required for t-SNE gradient
    double* pos_f = (double*) calloc(N * D, sizeof(double));
    double* neg_f = (double*) calloc(N * D, sizeof(double));
    if(pos_f == NULL || neg_f == NULL) { Rcpp::stop("Memory allocation failed!\n"); }
    tree->computeEdgeForces(inp_row_P, inp_col_P, inp_val_P, N, pos_f);

    // Storing the output to sum in single-threaded mode; avoid randomness in rounding errors.
    std::vector<double> output(N);
    #pragma omp parallel for schedule(guided)
    for (int n = 0; n < N; n++) {
      output[n]=tree->computeNonEdgeForces(n, theta, neg_f + n * D);
    }

    double sum_Q = .0;
    for (int n=0; n<N; ++n) {
        sum_Q += output[n];
    }

    // Compute final t-SNE gradient
    for(int i = 0; i < N * D; i++) {
        dC[i] = pos_f[i] - (neg_f[i] / sum_Q);
    }

    free(pos_f);
    free(neg_f);
    delete tree;
}

The performance should not be much different; there's an extra memory allocation per call of computeGradient, which can be avoided by having output as a TSNE member variable. I guess there's also the possibility for cache misses when the computed value in each iteration needs to be stored in output, but this might be offset by the reduced complexity of the loop itself when reduction is no longer required. I'm not sure how these two factors would play off against each other - probably negligible in both cases.

@jkrijthe
Copy link
Owner

jkrijthe commented Oct 4, 2018

Thanks. PR would be nice, but I can also commit it if you prefer that.

@kbarylyuk
Copy link

Hi, I have tried using a value greater than 1 with the num_threads parameter but I don't seem to see any speed up when theta = 0. I do see a decrease in the running time when num_threads is changed from 1 to 2 (no further speed up when using 3 or 4 threads on my 4-core computer) if theta = 0.5. Also, I see only one thread in my Activity Monitor if I use, for example, theta = 0, num_threads = 2, but there are multiple threads when theta = 0.5.
Am I correct in thinking that using multiple threads only works at the Barnes-Hut approximation step?

@jkrijthe
Copy link
Owner

jkrijthe commented Dec 3, 2018

Yes, that is correct: currently only the approximate implementation can use multiple threads, while the exact does not.

@cpenaloza
Copy link

I implemented my own based on your R wrapper https://github.com/RGLab/Rtsne.multicore
which already sees 50% time saving with iris data

I've found that Rtsne.multicore is faster than using multiple threads within Rtsne,

library(microbenchmark)
library(Rtsne)
library(Rtsne.multicore)

microbenchmark(
   tout <- Rtsne.multicore(iris, perplexity = 10, num_threads = 16, check_duplicates = F), 
   tout <- Rtsne(iris, perplexity = 10, num_threads = 16, check_duplicates = F), 
   times = 10)

#>                      min        lq     mean    median      uq    max  neval cld
#> Rtsne.multicore   181.7699 190.2749 193.4733 191.6511 198.0421 206.6607    10  a 
#> Rtsne             213.6623 219.0838 234.7819 229.6211 238.5277 305.4631    10   b

Do you (@jkrijthe @mikejiang) know why this is happening?
Why is Rtsne.multicore deprecated if it is faster?
Which should I be using?

Please excuse the newbie questions and thank you for all your great work!

@LTLA
Copy link
Contributor

LTLA commented Dec 5, 2018

Without looking at the Rtsne.multicore code, I can't say too much. But we did have to modify some of the OpenMP code in Rtsne to avoid stochasticity from race conditions; this may explain the slight slow-down, possibly due to increased memory management in each thread. I am planning to overhaul the C++ back-end to improve upon some of the hacky workarounds that I put in - with @jkrijthe's blessing, of course.

@LTLA
Copy link
Contributor

LTLA commented Dec 8, 2018

An update on my previous post: it seems like what we have now is probably the best we're going to get. Rtsne.multicore uses reduction for the summation, which is fast but leads to irreproducibility issues across multiple runs even when the seed is set - see my Oct 4 comment above.

On a separate note, I tried to speed it up by moving some of the memory allocations out of the loops, but to my surprise, this had no effect at all (compiling on Rdevel with Clang 6.0). This result was quite unexpected, I'd always thought that performing repeated allocations within a loop was a Bad Thing. But... there you go. That was going to be my major speed-up strategy; so much for that.

In any case, the side product of this little misadventure was a more C++-idiomatic version of tsne.cpp (the original bhtsne code was effectively written as C with classes), which also avoids an unnecessary copy of the input data set during the NN search. It uses OpenMP more widely but this doesn't make much of a difference as most time seems to be spent in the serial SPTree construction per training iteration. As such, the modified version runs slightly faster/slower than the current Rtsne, depending on the mood of my laptop. Check out the fork if you want to play around with it.

@jkrijthe
Copy link
Owner

jkrijthe commented Dec 14, 2018 via email

@cpenaloza
Copy link

Thank you @LTLA!
I've switched to Rtsne, I need reproducibility.

However, I'm running into other problems and I'm not sure if I just have to keep scaling up computing power or if there is some "remove memory allocations out of loops" kind of fix.

Original data set is 300+ Million rows (5 dim), but I've been sub-sampling. The largest sample I've successfully run is ~50 K rows. Any higher and I get C stack usage errors. This was the latest one, testing a ~350 K row sub-sample (ulimit is set to "unlimited" and I'm running it on a 32 core, 244 GB RAM, and 240 GB HD instance).

Error: C stack usage 761581163188 is too close to the limit

Any idea how I can make this work?

@LTLA
Copy link
Contributor

LTLA commented Dec 19, 2018

It's probably a stack overflow caused by recursion, either in vptree.h or sptree.h. Is this occuring during the nearest-neighbour search or after it? Try seeing where the error occurs with verbose=TRUE.

I'm a bit surprised, though; the nodes of the tree are created via new so they shouldn't store their data on the stack. Which suggests that the stack frames are the problem... what does ulimit -s say?

@cpenaloza
Copy link

$ ulimit -s
8192
DOH! My bad @LTLA... I should have set ulimit -s unlimited in every call to screen. I'm running larger samples now... successfully so far (360 K+ rows), now running 3 M+ rows.
THANK YOU!

@jkrijthe
Copy link
Owner

Good! Since the main changes are in master now, I'm closing this issue. Specific suggestions for further improvements are, of course, welcome. Thanks all!

@bioturing
Copy link

Hi, forks

I still get the error C stack usage is too close to the limit when doing t-SNE for a large dataset 210,614x30 (the PCA matrix of a single cell dataset. File attached: https://drive.google.com/open?id=180iI8W49rgLPvgt6-GbhvmrwnyOWgk0t). My machine has 90GB of RAM. Below are the specs of the machine:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

I've already set the ulimit -s to unlimited. Any suggestion is appreciated. @LTLA @jkrijthe

@LTLA
Copy link
Contributor

LTLA commented Aug 6, 2019

I'm afraid I don't have much more guidance to give here. If ulimit -s is truly unlimited (and you restarted R in this unlimited state), then there shouldn't have been issues. Unless you are exhausting the "unlimited" stack in terms of its actual physical size, but that seems unlikely.

To give some background, the stack errors are most likely caused by the recursions during the VP tree nearest neighbors search and the SP tree calculation of the forces. For huge datasets, the tree is deep enough so that it blows the stack when it tries to recurse. Refactoring both of these to iterate rather than recurse would be... possible, but a real chore, as recursion is a really natural way to work with trees. I don't know whether anyone would have the appetite for this.

@jkrijthe
Copy link
Owner

jkrijthe commented Aug 7, 2019

I'm afraid I do not have any suggestions here.

@LTLA
Copy link
Contributor

LTLA commented Aug 7, 2019

Having thought about it a bit more, here's a long shot at solving the problem.

If the stack error is occurring during the VP tree search, you should be able to avoid it by supplying your own NN results. Have a look at scater::runTSNE to see how it's done. For example, you can compute NNs using an approximate method (e.g., Annoy, which still uses trees but these should be much shallower), then pass the NN indices and distances to Rtsne_neighbors.

You're out of luck if the stack error occurs in the SP tree. That said, I have just looked at the SP tree code and the recursion is not as bad as I thought - it just occurs at one function, and it may be a relatively straightforward weekend job to refactor this into an iterative algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants