-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi CPU / GPU capabilities? #37
Comments
Ran a bech on these params for 500k * 50 data:
Nothing more or less happened for hours. Then I just tried:
And after and hour or so it started using all the processors |
@snakers4 I don't think there is direct GPU support, but it is listed on the road map under no priority. UMAP source code does make use of numba, which can be used to target the GPU, but you would have to make those changes yourself. The current dataset I use is 120k * 15, but I am running an searching algorithm to find the best possible combination of HDBSCAN and UMAP parameters using multiprocessing. If you are using linux, I would open command line and type I hope this helps, Ralph |
There are a few things to deal with here. First, for your dataset (500,000) you will need code that is currently in a branch not merged with master. There are some issues once you get beyond 100,000 points that I have finally resolved but not merged. Second, it will use vectorized instructions as much as numba will let it -- numba is a library that can take numerical python code and compile it to LLVM, so whatever LLVM and numba can do will be happenining. Since I'm still working at the algorithmic level I have not concerned myself with that level of optimization. Third, if it is using all CPUs and taking a long time ... that means that the spectral embedding is busy failing since there isn't a large enough eigengap. Try using Fourth, try to keep Fifth, I do recommend trying on a sub-sample of your dataset first to make sure things are working before scaling up. Finally, UMAP ultimately relies on SGD for a portion of the computation. Of course, as these things go it is a bit of a custom SGD, so what is there now is essentially one I wrote myself from scratch. Presumably deep learning libraries that make use of GPUs can be suitably coerced into doing the right kind of optimization, but that is beyond my current knowledge (as it requires careful phrasing of the problem I suspect). If you would like to look into that we can discuss the possibilities further. |
Hi, first of all many thanks for you replies) For this dataset PCA (5000 => 2) + eyeballing the clusters worked fine for some reason. Also it may be worth reporting that my colleague reported running 2M * 10 datasets w/o problems just using standard params w/o any tweaking some time ago.
I guess I will wait for the code to be merged.
Tried this
Well, your advice helped:
As for the evaluation:
Just visually eyeballing the clusters worked - they are really similar.
I will report the visual composition of the above clusters a bit later.
Well, if you could share some minimum reading (i.e. 1 paper on this topic you based this portion of the SGD pipeline on) and point to the part of code with SGD - if I am up to the task I will think about applying my skills to that. In practice (work / competitions) I noticed that Adam is better than plain SGD in most of cases. Modern high level frameworks (pytorch) have this implemented + they have GPU support, which may be a nice feature. |
Visually evaluated UMAP embeddings. |
This has been resolved (or should be resolved) as of v0.2.0 (i.e. the branch got merged). With regard to the SGD part of the code you would want |
I have been playing around with testing various combinations of UMAP with multi processing in python, feel free to use this script if you want. |
The code looks readable, except for the numba decorator, which is black magic for me. I guess I will just end up comparing your latest release vs. just loading the data to GPU and optimizing using the same loss function. |
Downloaded the latest release via
Used these settings
Ran these benchmarks
I guess there is some stochasticity in this process Great job guys! |
Note that those are the gradients of the loss encoded by hand in that SGD. James Melville has a nice piece on UMAP loss here. |
As a side note I would highly recommend looking into datashader as a means to visualise the results rather than relying on KDEs. |
Yup, datashader turned out to be the most inspiring and underrated visualization tool I ever saw. The results are amazing:
Read the paper and a note, looks very interesting. I guess, if so, then porting to GPU may give only negligible performance boost. |
Of note is the fact that you can embed into higher dimensional spaces than just 2 -- if your goal is clustering rather than visualization then using a higher dimensional embedding can give the process more degrees of freedom to more accurately represent the data. So, for example, embedding to 4 or 8 dimensions may allow for better clustering. It is, at least, worth experimenting with. |
Ofc, I just presented 2-dimensional results for ease of interpretation |
One nice thing about deep learning libraries is that they're declarative, and will do automatic differentiation. This only requires computing the loss for the input variables, then calling Out-of-core computation (#62) + deep learning libraries would be easiest with a Dask parameter server (dask/dask-ml#171). |
@snakers4 Any luck porting to pytorch? Would be interested in this. |
So the good news is that Nvidia will have a GPU accelerated implementation of UMAP in their Rapids platform soon, and I believe most of the theoretical hurdles to multi-core CPU implementation have been resolved, and I hope the 0.5 release of UMAP will include multi-core/threaded support. So while we might not quite be there yet, we are certainly getting closer. |
Cool! Are you referencing the implementation at cuML/src/umap@993eb6? It looks like it got merged through rapidsai/cuml#261.
This is through lmcinnes/pynndescent#12, correct? |
Yes, and yes. With the latter I have also come up with a decent
multithreaded RP-tree implementation in numba (parallelising over trees,
and also within trees, and ideally dask extensible as well). I thought more
about the dask extension of lmcinnes/pynndescent#12
<lmcinnes/pynndescent#12> and there are a few more
hurdles to be overcome to make it support distributed arrays, but I believe
Matt Rocklin suggested some potential solutions that can be put in place.
The first thing I have to do is get 0.4 out the door so I can start working
on these features for 0.5.
…On Tue, Mar 19, 2019 at 7:00 PM Scott Sievert ***@***.***> wrote:
So the good news is that Nvidia will have a GPU accelerated implementation
of UMAP in their Rapids platform soon
Cool! Are you referencing the implementation at ***@***.***
<https://github.com/rapidsai/cuml/tree/993eb6b1c7a78a30ee32479962562292597b1be8/cuML/src/umap>?
It looks like it got merged through rapidsai/cuml#261
<rapidsai/cuml#261>.
will include multi-core/threaded support
This is through lmcinnes/pynndescent#12
<lmcinnes/pynndescent#12>, correct?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBZxGdDJ7gjzmIJkeGTjR27deBLnvks5vYWwfgaJpZM4Rt33Q>
.
|
Hi, Do you have any plan for barnes-hut approximation using quad/oct-tree for 2D/3D rather than RP-tree? It would be great to know. Thank you for awesome work! |
@khaled-rahman: The RP-Trees are in support of NN-Descent for (approximate) nearest neighbor computation, for which Barnes-Hut is not applicable. Where t-SNE uses Barnes-Hut is in the optimization step, and in contrast UMAP uses an approximation via negative sampling based approach to get an O(N) instead of O(N log N) Barnes-Hut algorithm. |
@lmcinnes : Thank you for quick response. I am studying your tool. What is the running time of spectral embedding in your code? I see it has dependency on sklearn.manifold which says the following: https://scikit-learn.org/stable/modules/manifold.html Though it is done only once, doesn't it have dependency on number of data-points? |
For Spark (PySpark) environments, maybe we could rely on the LSH-based Approximate Nearest Neighbor Search (https://spark.apache.org/docs/latest/ml-features#approximate-nearest-neighbor-search), however it only supports Euclidean and Jaccard distances at the moment. @lmcinnes |
@candalfigomoro The NN search is not currently pluggable at the top API level, but it is largely factored out in the code. For now that means you could likely monkey-patch things and define a new |
@lmcinnes Thank you for your reply. Another issue that I see is that, even if we implement a distributed NN search, UMAP currently expects a numpy array as input, so "fitting" out-of-memory datasets (e.g. HDFS distributed datasets with billions of rows) could be a problem. Maybe a workaround (although suboptimal) could be to "fit" UMAP on a random subsample of data and then apply a transform() on the remaining data in a massively parallel way. What do you think about this? |
If you're able to load your data as Dask DataFrames / Arrays, then the
"distributed transform" part is pretty well handled
by
https://ml.dask.org/modules/generated/dask_ml.wrappers.ParallelPostFit.html#dask_ml.wrappers.ParallelPostFit.
You'd
wrap your `UMAP` estimator in `ParallelPostFit` before calling `fit`. It
doesn't affect training at all, just predict, transform, score, etc.
That said, distributed training would still be cool, just a bunch more work
:)
…On Fri, Feb 14, 2020 at 2:51 AM candalfigomoro ***@***.***> wrote:
@lmcinnes <https://github.com/lmcinnes> Thank you for your reply. Another
issue that I see is that, even if we implement a distributed NN search,
UMAP currently expects a numpy array as input, so "fitting" out-of-memory
datasets (e.g. HDFS distributed datasets with billions of rows) could be a
problem. Maybe a workaround (although suboptimal) could be to "fit" UMAP on
a random subsample of data and then apply a transform() on the remaining
data in a massively parallel way. What do you think about this?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#37?email_source=notifications&email_token=AAKAOISN2HLLRPZ4W43LCQTRCZLSRA5CNFSM4ENXPXIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELYBRII#issuecomment-586160289>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIQXMK3VXKDYU5O2HCDRCZLSRANCNFSM4ENXPXIA>
.
|
@TomAugspurger "Unfortunately", I'm in a Spark (with pyspark) cluster/environment... no Dask for me. My inputs are Spark DataFrames (or RDDs). I can use UDFs to apply transforms, but I don't think there's a straightforward way to support the "fit()" part on out-of-memory datasets. |
@candalfigomoro I am currently using Spark cluster (pyspark) too. As you do, the fit of UMAP is done locally on the master node. For the transform I use rdd.map() and pass a partial function (functools) that already has the UMAP trained model. So the latter get shipped to the various nodes in the cluster and the transform is parallelized.Out of curiosity, how did you approach this? Thanks |
@candalfigomoro @mik1904 we just recently added a very similar capability to the GPU-accelerated UMAP in RAPIDS cuML, which uses Dask. https://github.com/rapidsai/cuml/blob/branch-0.14/python/cuml/dask/manifold/umap.py So far, initial evaluations are showing some very positive results, especially as the inference dataset grows very large. We are continuing to work on distributing the training as well, however the embarrassingly parallel inference is showing to be very useful thus far. |
I am having trouble getting UMAP to flood all available CPUs on a Google VM, in various stages I am about to use 2-4 of the available 32 cores. Is there anything one can do as of v0.4.x to increase CPU usage? |
Installing the latest version of pynndescent should help some, but even then it will depend, to a degree, on the datset itself. |
I didn't do it, but I think I'd do something similar to that (maybe I'd use a Pandas UDF instead of rdd.map()). For training on larger-than-memory datasets, maybe we could try something like this:
But it's not trivial. @lmcinnes Do you think it would make sense? |
I think that would make sense. I believe the UMAP implementation in CuML now has some level of support for multi-GPU / distributed computation, so that may also be an option worth investigating. |
Sorry for joining the conversation like this but has anyone here used any of the discussed implementations for multi-million sized feature sets yet? |
I have used this implementation for a (sparse) dataset of size 100,000,000 x 5,500,000 on a very large memory SMP using 64 cores. Obviously a dense dataset of that size would not be possible due to memory constraints, so you would need a distributed implementation to manage that. In terms of the CuML implementation: it has sparse support coming, so at that point it may scale to very large feature sizes. |
@lmcinnes
As you may have guessed I have several CPUs and GPUs at hand and I work with high-dimensional data.
Now I am benching a 500k * 5k => 500k * 2 vector vs. PCA (I need a high level clustering to filter my data to feed it further in the pipeline).
So a couple of questions:
The text was updated successfully, but these errors were encountered: