Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi CPU / GPU capabilities? #37

Open
snakers4 opened this issue Jan 26, 2018 · 35 comments
Open

Multi CPU / GPU capabilities? #37

snakers4 opened this issue Jan 26, 2018 · 35 comments

Comments

@snakers4
Copy link
Contributor

snakers4 commented Jan 26, 2018

@lmcinnes
As you may have guessed I have several CPUs and GPUs at hand and I work with high-dimensional data.
Now I am benching a 500k * 5k => 500k * 2 vector vs. PCA (I need a high level clustering to filter my data to feed it further in the pipeline).

So a couple of questions:

  1. Any plans on multi-CPU / GPU support?
  2. Does your implementation utilize vectorized operations (not really sure how embedding methods like T-SNE and UMAP work, I believe they minimize some kind of distance in high dimension space?) If so, can I help?
  3. Did you run benchmarks (like HDBSCAN) on large and huge datasets? If so, then is it feasible to expect 500k * 5k => 500k * 2 to finish in reasonable time, or should I do PCA => UMAP?
@snakers4
Copy link
Contributor Author

snakers4 commented Jan 26, 2018

Ran a bech on these params for 500k * 50 data:

umap = umap.UMAP(
    n_neighbors=200,
    min_dist=0.0,
    n_components=2,
    metric='euclidean'
    )
umap = umap.UMAP(
    n_neighbors=30,
    min_dist=0.0,
    n_components=2,
    metric='euclidean'
    )

Nothing more or less happened for hours.

Then I just tried:

umap = umap.UMAP(
    n_components=2
    )

And after and hour or so it started using all the processors

@100518832
Copy link

@snakers4 I don't think there is direct GPU support, but it is listed on the road map under no priority. UMAP source code does make use of numba, which can be used to target the GPU, but you would have to make those changes yourself.

The current dataset I use is 120k * 15, but I am running an searching algorithm to find the best possible combination of HDBSCAN and UMAP parameters using multiprocessing.

If you are using linux, I would open command line and type top which will show you the top running processes. This will let you know what python is doing, I am willing to bet that python with have one process running the cpu at 100% (or close too) when you use that large of a dataset.

I hope this helps,

Ralph

@lmcinnes
Copy link
Owner

There are a few things to deal with here.

First, for your dataset (500,000) you will need code that is currently in a branch not merged with master. There are some issues once you get beyond 100,000 points that I have finally resolved but not merged.

Second, it will use vectorized instructions as much as numba will let it -- numba is a library that can take numerical python code and compile it to LLVM, so whatever LLVM and numba can do will be happenining. Since I'm still working at the algorithmic level I have not concerned myself with that level of optimization.

Third, if it is using all CPUs and taking a long time ... that means that the spectral embedding is busy failing since there isn't a large enough eigengap. Try using init='random' to avoid this. I'm still trying to figure out the right way to handle that sort of issue.

Fourth, try to keep n_neighbors small; the larger it is, especially for large datasets, the worse the performance. Part of the efficiency of UMAP comes from its ability to make use of much smaller nearest neighbor sets than comparable algorithms.

Fifth, I do recommend trying on a sub-sample of your dataset first to make sure things are working before scaling up.

Finally, UMAP ultimately relies on SGD for a portion of the computation. Of course, as these things go it is a bit of a custom SGD, so what is there now is essentially one I wrote myself from scratch. Presumably deep learning libraries that make use of GPUs can be suitably coerced into doing the right kind of optimization, but that is beyond my current knowledge (as it requires careful phrasing of the problem I suspect). If you would like to look into that we can discuss the possibilities further.

@snakers4
Copy link
Contributor Author

Hi, first of all many thanks for you replies)

For this dataset PCA (5000 => 2) + eyeballing the clusters worked fine for some reason.
Data science is magic sometimes. I guess clusters are linearly separable. But this is a good benchmark nevertheless (see my comments below).

Also it may be worth reporting that my colleague reported running 2M * 10 datasets w/o problems just using standard params w/o any tweaking some time ago.

First, for your dataset (500,000) you will need code that is currently in a branch not merged with master

I guess I will wait for the code to be merged.

Try using init='random'
try to keep n_neighbors small
trying on a sub-sample of your dataset

Tried this

import umap    
    
umap = umap.UMAP(
    n_neighbors=5,
    n_components=2,
    metric='euclidean',
    init='random'
    )

umap_vectors = umap.fit_transform(pca_vectors[0:50000])

Well, your advice helped:

  • on 50k * 15 sample (500k * 15 => 50k * 15) - 59.0s
  • and on 100k * 15 sample - 1m 55.3s
  • and on 500k * 15 sample - 15m 19s

As for the evaluation:

  1. 500k*5000 => PCA 500k * 50 => HDBSCAN performed poorly producing only 2 bad clusters
  2. PCA produced this
    https://pics.spark-in.me/upload/b27a448c2f19e8c450dce2c20c57454b.jpg

Just visually eyeballing the clusters worked - they are really similar.

  1. Umap produced this
    50k https://pics.spark-in.me/upload/247670b5825c833f3c6964809468758f.png
    100k https://pics.spark-in.me/upload/cb77948b8c17ff262d7f60cd1b124f21.png
    500k https://pics.spark-in.me/upload/0780efa4b681b4290d13017a387ee7b1.png - I guess it broke here

I will report the visual composition of the above clusters a bit later.

Finally, UMAP ultimately relies on SGD for a portion of the computation. Of course, as these things go it is a bit of a custom SGD, so what is there now is essentially one I wrote myself from scratch. Presumably deep learning libraries that make use of GPUs can be suitably coerced into doing the right kind of optimization, but that is beyond my current knowledge (as it requires careful phrasing of the problem I suspect). If you would like to look into that we can discuss the possibilities further.

Well, if you could share some minimum reading (i.e. 1 paper on this topic you based this portion of the SGD pipeline on) and point to the part of code with SGD - if I am up to the task I will think about applying my skills to that.

In practice (work / competitions) I noticed that Adam is better than plain SGD in most of cases. Modern high level frameworks (pytorch) have this implemented + they have GPU support, which may be a nice feature.

@snakers4
Copy link
Contributor Author

Visually evaluated UMAP embeddings.
They are really great. They seem to be like PCA embeddings, but further refined into couple sub-categories.
But as you noticed, 100k+ fails.

@lmcinnes
Copy link
Owner

This has been resolved (or should be resolved) as of v0.2.0 (i.e. the branch got merged). With regard to the SGD part of the code you would want optimize_layout function. The most relevant paper would probably be the LargeVis paper which this bears some similarity to.

@100518832
Copy link

@snakers4

import pandas as pd # import pandas
import numpy as np # import numpy

from multiprocessing import Process # import from standard python libs for multi-core threading and processing 
import glob # import to iterate through files in directories 
import os.path # import for creating os path objects
import matplotlib.pyplot as plt # import used to create and save scatter plots
#from mpl_toolkits.mplot3d import Axes3D
from numpy import genfromtxt # numpy import to read txt files as csv's
import umap # the embedding algorithm, UMAP, this is the reduction method (or class) used
#from sklearn.decomposition import PCA 
#from sklearn.preprocessing import MinMaxScaler


def data_grab(path,filename,fraction): # this is the function that grads the dataset
    
    df = pd.read_csv(str(path)+str(filename)) # reading the dataset from a csv file and making it a pandas dataframe
    df = df.select_dtypes(include=[np.number]) # selecting only the columsn that contain float/int, columns with NaN, missing entries and strings are ignored
    df = df.sample(frac=fraction, random_state = 32).reset_index(drop = True) # this takes a sample of our dataset to use for embedding, the random sate controls the rand seed
    df = df.as_matrix() # this converts the dataframe into a matrix form, similar to the form that a numpy ndarray would look
     
    return df # return our 'fake' ndarray


def UMAP(n_neighbors, min_dist, metric, input_data): # this is the embedding function
    
    global fraction, filename # we need to keep the filename of our dataset for folder creation, same for fraction 
    
    try:
        data = umap.UMAP(n_neighbors = n_neighbors, min_dist = min_dist, metric = metric, n_components = 2).fit_transform(input_data)
    except:
        print('Embedding_Failed')
        return None 
    
    df_reduct = pd.DataFrame(data) # our lower embedded data is saved as a pandas dataframe
    
    return df_reduct.to_csv('%s_embedding_%s/embedding,%s,%s,%s.csv' %(filename,fraction,n_neighbors, min_dist, metric)) # we are saving our embedding to a csv


def savefig(file): # this function creates the scatter plots and saves them, currently the cmap, or colouring does not work with my distro of anaconda (install problem) you can add them if they work for you
    
        global fraction, filename # make sure we maintain our global variables for file saving
    
        fileparams = file.replace('%s_embedding_%s/embedding,'%(filename,fraction),'') # taking the embedding csv filename and repalcing the first part of the string
        # explanation 
        # since multiprocessing does not share variables between instances, if we try to make a global search dataframe (or any structure) each process will create its own copy
        # of it and append to it; therefore, at the end we have an empty dataframe because each multiprocess creates its own copy, instead each process will create a file of the lower embedding 
        # and in the name of the file will be the parameters used to generate this embedding, their is a way to get around this using multiprocessing queues; however, I do not know the lib well enough
        
        fileparams = fileparams.replace('.csv','')
        fileparams = fileparams.strip('')
        fileparams = fileparams.split(',') # this will leave use with a list of 3 strings, which rep. the 3 UMAP argument used
        
        if os.path.isfile('%s_embedding_%s/figures/Embedding-%s-%s-%s.pdf'%(filename, fraction, fileparams[0],fileparams[1],fileparams[2])) == True: # checking to see if the graph already exists, if so, we return nothing
            return None
        
        fig = plt.figure() # this is just creating the embedding data from the file, and saving the graph
        plt.title('Embedding-%s-%s-%s'%(fileparams[0],fileparams[1],fileparams[2]))
        embedding = genfromtxt(file, delimiter=',', skip_header = 1, usecols = (1,2))
        plt.scatter(embedding[:,0], embedding[:,1])
        fig.savefig('%s_embedding_%s/figures/Embedding-%s-%s-%s.pdf'%(filename, fraction, fileparams[0],fileparams[1],fileparams[2])) 
        plt.close()
        
        return None


if __name__ == '__main__': # this is the main execution 
    
    fraction = 0.01 # percentage of original dataset used for UMAP and HDBSCAN algorithms, must be between (0 and 1.0]
    
    path = '/home/USER/anaconda3/envs/ML/scripts/ML_IDS_Dashboard/app1/static/input_files/' # CHECK THIS PATH, THIS NEEDS TO BE THE PATH TO THE DATASET
    
    filename = 'Data.csv' # CHECK THIS FILENAME THIS NEED TO MATCH THE DATASET CSV FILENAME, ALSO TAKE NOTE NO LABELS ARE USED
    
    if os.path.exists('%s_embedding_%s'%(filename, fraction)) == False: # checking and creating folders
        os.makedirs('%s_embedding_%s'%(filename, fraction))
        
    if os.path.exists('%s_embedding_%s/log_files'%(filename, fraction)) == False:
        os.makedirs('%s_embedding_%s/log_files'%(filename, fraction))
        
    if os.path.exists('%s_embedding_%s/figures'%(filename, fraction)) == False:
        os.makedirs('%s_embedding_%s/figures'%(filename, fraction))

    input_data = data_grab(path,filename,fraction) # calling the datagrab function
    
    procs_UMAP = []
    n_neighbors = [15,20,40] # these parameters will be looped through
    min_dist = [0.01,0.05] # these parameters will be looped through
    metric_iso = ['correlation','euclidean','manhattan'] # these parameters will be looped through, feel free to add more if you want to check more

    procs_savefig = []


    for index, a in enumerate(n_neighbors): # these loops will create , 'processes' just instance calls of the function (refered to as targets)
        for index, b in enumerate(min_dist):
            for index, c in enumerate(metric_iso):
                if os.path.isfile('%s_embedding_%s/embedding,%s,%s,%s.csv' %(filename,fraction,a,b,c)) == True:
                    continue
                proc = Process(target=UMAP, args=(a, b, c, input_data))
                procs_UMAP.append(proc) # appending all of our processes to the list 
                proc.start() # start the process
                
            for proc in procs_UMAP:
                proc.join() # join all the processes currently in the list, this is looped over as it will connect new processes just appended to procs_UMAP to ones already started

    filenames = glob.glob('%s_embedding_%s/*.csv'%(filename, fraction)) # looping through the files in a directory 
    #data = np.random.rand(1260,4)
    
    for d in filenames: # tried to use multiprocessing on savefig function; however, the matplotlib package has some sort of caching that prevents saving multiple figures in a multiprocess 
        #strfile = str(d)
        #proc = Process(target=savefig, args=(strfile,))
        #procs_savefig.append(proc)
        #proc.start()
        savefig(d)        
        #for proc in procs_savefig:
            #proc.join()

I have been playing around with testing various combinations of UMAP with multi processing in python, feel free to use this script if you want.

@snakers4
Copy link
Contributor Author

snakers4 commented Feb 6, 2018

The code looks readable, except for the numba decorator, which is black magic for me.

I guess I will just end up comparing your latest release vs. just loading the data to GPU and optimizing using the same loss function.

@snakers4
Copy link
Contributor Author

snakers4 commented Feb 6, 2018

This has been resolved (or should be resolved) as of v0.2.0 (i.e. the branch got merged)

Downloaded the latest release via

wget https://github.com/lmcinnes/umap/archive/master.zip
unzip master.zip
rm master.zip
cd umap-master
sudo pip install -r requirements.txt
python setup.py install

Used these settings

import umap    
    
umap = umap.UMAP(
    n_neighbors=5,
    n_components=2,
    metric='euclidean',
    init='random'
    )

umap_vectors = umap.fit_transform(pca_vectors_processed)

Ran these benchmarks

I guess there is some stochasticity in this process
I will try just looking at the blobs a bit later

Great job guys!
Looks like even this version of UMAP is really useful on top of PCA.

@lmcinnes
Copy link
Owner

lmcinnes commented Feb 6, 2018

Note that those are the gradients of the loss encoded by hand in that SGD. James Melville has a nice piece on UMAP loss here.

@lmcinnes
Copy link
Owner

lmcinnes commented Feb 7, 2018

As a side note I would highly recommend looking into datashader as a means to visualise the results rather than relying on KDEs.

@snakers4
Copy link
Contributor Author

snakers4 commented Mar 7, 2018

As a side note I would highly recommend looking into datashader as a means to visualise the results rather than relying on KDEs.

Yup, datashader turned out to be the most inspiring and underrated visualization tool I ever saw.
In the end I ended up with doing this 5000-dimension CNN vectors => 15-dimension PCA => 2-dimension UMAP vectors => HDBSCAN for naive clustering

The results are amazing:

  • UMAP embedding
  • HDBSCAN + UMAP embedding
  • top1 and top2 most prominent clusters (I am feeding the clustered data into GAN, so having ~5-7 distinct clusters + noise is exactly what I was hoping for)
  • Also funnily enough there is a cluster of toilets and baths

LargeVis paper which this bears some similarity to.
Note that those are the gradients of the loss encoded by hand in that SGD. James Melville has a nice piece on UMAP loss here.

Read the paper and a note, looks very interesting.
Looks like I made a wrong assumption - since the task is unsupervised learning, autograd will not help - since we have to annotated data.

I guess, if so, then porting to GPU may give only negligible performance boost.
I hope I will try porting your code into PyTorch directly when I find time.

@lmcinnes
Copy link
Owner

lmcinnes commented Mar 7, 2018

Of note is the fact that you can embed into higher dimensional spaces than just 2 -- if your goal is clustering rather than visualization then using a higher dimensional embedding can give the process more degrees of freedom to more accurately represent the data. So, for example, embedding to 4 or 8 dimensions may allow for better clustering. It is, at least, worth experimenting with.

@snakers4
Copy link
Contributor Author

snakers4 commented Mar 8, 2018

higher dimensional spaces than just 2

Ofc, I just presented 2-dimensional results for ease of interpretation

@stsievert
Copy link

stsievert commented Jul 18, 2018

a custom SGD, so what is there now is essentially one I wrote myself from scratch. Presumably deep learning libraries that make use of GPUs can be suitably coerced into doing the right kind of optimization

One nice thing about deep learning libraries is that they're declarative, and will do automatic differentiation. This only requires computing the loss for the input variables, then calling optimizer.step(). Surprisingly, PyTorch is faster than NumPy when hand-coding the gradients. I think this is because of the graph it defines, so it can be parallelized and the IO is optimized. I walk through benchmarks/more motivation in a blog post.

Out-of-core computation (#62) + deep learning libraries would be easiest with a Dask parameter server (dask/dask-ml#171).

@LemonPi
Copy link

LemonPi commented Mar 19, 2019

@snakers4 Any luck porting to pytorch? Would be interested in this.

@lmcinnes
Copy link
Owner

So the good news is that Nvidia will have a GPU accelerated implementation of UMAP in their Rapids platform soon, and I believe most of the theoretical hurdles to multi-core CPU implementation have been resolved, and I hope the 0.5 release of UMAP will include multi-core/threaded support. So while we might not quite be there yet, we are certainly getting closer.

@stsievert
Copy link

So the good news is that Nvidia will have a GPU accelerated implementation of UMAP in their Rapids platform soon

Cool! Are you referencing the implementation at cuML/src/umap@993eb6? It looks like it got merged through rapidsai/cuml#261.

will include multi-core/threaded support

This is through lmcinnes/pynndescent#12, correct?

@lmcinnes
Copy link
Owner

lmcinnes commented Mar 19, 2019 via email

@khaled-rahman
Copy link

Hi,

Do you have any plan for barnes-hut approximation using quad/oct-tree for 2D/3D rather than RP-tree? It would be great to know. Thank you for awesome work!

@lmcinnes
Copy link
Owner

@khaled-rahman: The RP-Trees are in support of NN-Descent for (approximate) nearest neighbor computation, for which Barnes-Hut is not applicable. Where t-SNE uses Barnes-Hut is in the optimization step, and in contrast UMAP uses an approximation via negative sampling based approach to get an O(N) instead of O(N log N) Barnes-Hut algorithm.

@khaled-rahman
Copy link

khaled-rahman commented Jun 18, 2019

@lmcinnes : Thank you for quick response. I am studying your tool. What is the running time of spectral embedding in your code? I see it has dependency on sklearn.manifold which says the following: https://scikit-learn.org/stable/modules/manifold.html

Though it is done only once, doesn't it have dependency on number of data-points?
More specifically, if we exclude OPTIMIZEEMBEDDING algorithm from Algorithm 1 (i.e. UMAP algorithm) then what will be the asymptotical running time of Algorithm 1? Isn't it greater than O(n^2) now?

@candalfigomoro
Copy link
Contributor

For Spark (PySpark) environments, maybe we could rely on the LSH-based Approximate Nearest Neighbor Search (https://spark.apache.org/docs/latest/ml-features#approximate-nearest-neighbor-search), however it only supports Euclidean and Jaccard distances at the moment.

@lmcinnes
Is there an easy way to swap the NN search algorithm (so we can easily plug in a different algorithm such as the one available in Spark)?

@lmcinnes
Copy link
Owner

lmcinnes commented Feb 6, 2020

@candalfigomoro The NN search is not currently pluggable at the top API level, but it is largely factored out in the code. For now that means you could likely monkey-patch things and define a new umap.umap_.nearest_neighbors function that does what you want. Down the road I think it is most likely that it will adapt to make use of the KNeighborsTransformer API/approach in scikit-learn. At that point you would need to wrap whatever ANN solution you have in a suitable transformer. All of that is likely to happen in v0.5 when PyNNDescent becomes a required dependency and will be the default ANN search.

@candalfigomoro
Copy link
Contributor

@lmcinnes Thank you for your reply. Another issue that I see is that, even if we implement a distributed NN search, UMAP currently expects a numpy array as input, so "fitting" out-of-memory datasets (e.g. HDFS distributed datasets with billions of rows) could be a problem. Maybe a workaround (although suboptimal) could be to "fit" UMAP on a random subsample of data and then apply a transform() on the remaining data in a massively parallel way. What do you think about this?

@TomAugspurger
Copy link

TomAugspurger commented Feb 14, 2020 via email

@candalfigomoro
Copy link
Contributor

candalfigomoro commented Feb 14, 2020

@TomAugspurger "Unfortunately", I'm in a Spark (with pyspark) cluster/environment... no Dask for me. My inputs are Spark DataFrames (or RDDs). I can use UDFs to apply transforms, but I don't think there's a straightforward way to support the "fit()" part on out-of-memory datasets.

@mik1904
Copy link

mik1904 commented Mar 20, 2020

@candalfigomoro I am currently using Spark cluster (pyspark) too. As you do, the fit of UMAP is done locally on the master node. For the transform I use rdd.map() and pass a partial function (functools) that already has the UMAP trained model. So the latter get shipped to the various nodes in the cluster and the transform is parallelized.Out of curiosity, how did you approach this? Thanks

@cjnolet
Copy link

cjnolet commented Apr 17, 2020

@candalfigomoro @mik1904 we just recently added a very similar capability to the GPU-accelerated UMAP in RAPIDS cuML, which uses Dask.

https://github.com/rapidsai/cuml/blob/branch-0.14/python/cuml/dask/manifold/umap.py

So far, initial evaluations are showing some very positive results, especially as the inference dataset grows very large.

We are continuing to work on distributing the training as well, however the embarrassingly parallel inference is showing to be very useful thus far.

@ydennisy
Copy link

I am having trouble getting UMAP to flood all available CPUs on a Google VM, in various stages I am about to use 2-4 of the available 32 cores.

Is there anything one can do as of v0.4.x to increase CPU usage?

@lmcinnes
Copy link
Owner

lmcinnes commented Jan 4, 2021

Installing the latest version of pynndescent should help some, but even then it will depend, to a degree, on the datset itself.

@candalfigomoro
Copy link
Contributor

@candalfigomoro I am currently using Spark cluster (pyspark) too. As you do, the fit of UMAP is done locally on the master node. For the transform I use rdd.map() and pass a partial function (functools) that already has the UMAP trained model. So the latter get shipped to the various nodes in the cluster and the transform is parallelized.Out of curiosity, how did you approach this? Thanks

I didn't do it, but I think I'd do something similar to that (maybe I'd use a Pandas UDF instead of rdd.map()).

For training on larger-than-memory datasets, maybe we could try something like this:

  1. Use pyspark's LSH-based Approximate Nearest Neighbor Search for NN search, then
  2. Use ParametricUMAP with mini-batches of data

But it's not trivial. @lmcinnes Do you think it would make sense?

@lmcinnes
Copy link
Owner

I think that would make sense. I believe the UMAP implementation in CuML now has some level of support for multi-GPU / distributed computation, so that may also be an option worth investigating.

@matanox
Copy link

matanox commented Jan 17, 2021

Sorry for joining the conversation like this but has anyone here used any of the discussed implementations for multi-million sized feature sets yet?

@lmcinnes
Copy link
Owner

I have used this implementation for a (sparse) dataset of size 100,000,000 x 5,500,000 on a very large memory SMP using 64 cores. Obviously a dense dataset of that size would not be possible due to memory constraints, so you would need a distributed implementation to manage that. In terms of the CuML implementation: it has sparse support coming, so at that point it may scale to very large feature sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests