Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research on L1 distance metric #12

Closed
vmarkovtsev opened this issue Mar 2, 2017 · 11 comments
Closed

Research on L1 distance metric #12

vmarkovtsev opened this issue Mar 2, 2017 · 11 comments

Comments

@vmarkovtsev
Copy link

Hi!

This project kicks ass to our https://github.com/src-d/kmcuda to major extent. I think the best way for us to move on is to integrate the missing (for source{d}) parts to Faiss. Those are:

  1. Proper K-means centroid initialization instead of random sampling. This includes some high level API improvement.
  2. arccos over the scalar product aka angular / proper "cosine" distance. CUDA does not notice it in terms of performance.
  3. Python 3 support (well, that should be easy with Swig).

If you agree with these, I will start making PRs. If you don't, I will have to incorporate Faiss inside kmcuda as the second non-free backend. Of course, I would prefer 1.

@mdouze
Copy link
Contributor

mdouze commented Mar 2, 2017

Hi!

Thanks for your interest in Faiss. Could you consider the following points:

  1. The k-means can be initialized with arbitrary centroids, when they are set on input in the Clustering object. Also, in our experiments with kmeans++ initialization, we have not found it to be more useful than random in large dimension and many points. What are the tasks you think it would be useful for?

  2. There is a "spherical k-means" option in the Clustering object that normalizes the centroids at each iteration. Is this what you are referring to?

  3. Python 3 is already supported (although the examples are written in Python 2).

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Mar 2, 2017

  1. Sure, they can be imported. kmeans++ takes some time to finish and can be properly accelerated. We are using K-means for dimensionality reduction with large number of clusters, and kmeans++ shows substantial (up to 30%) faster convergence on our data.

  2. Kind of, but it's not enough to normalize the vectors. The problem is that the angular distance works better (again in our experiments). The naked scalar product (that is, cos \alpha) is not a distance metric by definition, it does not satisfy the triangle inequality. https://en.wikipedia.org/wiki/Cosine_similarity

  3. Swig generates python 2 native *.so by default and in that case I still have to manually edit the makefile. At least there must be py3 rule. I do understand that Facebook production is based on 2 but the rest of the folks have already switched. E.g. Python 3 compatibility? google/grumpy#1
    _swigfaiss.so: undefined symbol: PyCObject_Type and friends.

@jegou
Copy link
Contributor

jegou commented Mar 2, 2017

Regarding 1.: Matthijs and I are the author of the Yael library (http://yael.gforge.inria.fr/, CPU library), which included a k-means++ initialization. Our observation was that, on all our experimental data, the initialization was actually costly and not BLAS3-friendly, even though we put some effort into its parallelization. As Matthijs mentioned, the gain from k-means++ was not worth this cost in our experiments (very small). For a fixed amount of time, we found that it was better (w.r.t. loss) to add a few iterations to the regular k-means, which is why random selection was the default choice in our Matlab interface.
But I understand that this does not have to be generally true, especially if you look at the data at convergence (which we normally don't do considering our typical parameters, i.e., many centroids and input points).

Regarding 2.: as mentioned by Matthijs, Faiss includes a spherical k-means, targeting to maximize \sum_i <q(x_i) | x_i>

  1. Assignment step: On the sphere, cos=inner product. I assume that we agree that the use of arccos has not effect on the assignment of points to centroids because this function is monotonous: the closest centroid w.r.t. L2, or maximum cosine similarity, or maximum inner product are all equivalent on the sphere (with both the input vectors and centroids on the sphere).
  2. Update step: Take the centroid that minimizes the l2 loss, i.e., the mean of all vectors assigned to the centroid, and project it back on the sphere.

So, this is the algorithm described (Figure 1) in
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.111.8125&rep=rep1&type=pdf

We are aware that there are some variants of this algorithm, including one I co-authored that modifies the cost because the objective function is different (see Section 6.5 in https://arxiv.org/pdf/1412.3328.pdf). To which algorithm do you refer?

Thanks for your feedback!

@mdouze
Copy link
Contributor

mdouze commented Mar 3, 2017

Hi @vmarkovtsev,

Indeed the port to Python 3 was not complete. I added instructions to the INSTALL and makefile.inc.Linux on how to compile for py3. Thanks for pointing this out.

There are three possible levels of integration of other kmeans algorithms:

  1. re-use the Clustering and input another initialization
  2. re-use the Index object and the km_update_centroids function that does the M-step
  3. re-use only the Index object for fast assignent.

At which level do you think your changes to the clustering could be intergated?

Thanks

@vmarkovtsev
Copy link
Author

@jegou,

  1. I don't know what to say... As I said, we observed substantial (subjective) achieved quality and convergence improvements. I guess, if K-means iterations are blazingly fast, it can indeed be more profitable to simply add more iterations instead of improving the initialization to get the same metrics. Our implementation is slower, and it made sense to run kmeans++. Do you strictly target only very large uniformly distributed datasets? I could prepare a notebook to show what I mean in some future.

  2. Correct, I am referring to exactly the described algorithm. I didn't read this paper but it is implemented word-to-word in our library, albeit we take arccos. Indeed, since arccos in monotonous, we can swap arccos(a \cdot b) with -(a \cdot b).

@mdouze,
I forgot that I've also got L1 besides L2 and angular, though somewhat not ready for production. As I believe, L1 has nothing to do with inner products. I looked through the code and thought that L1 and L2 mean the other thing there (1-D and 2-D input). Am I right?

I guess my proposed additions in (1) fall in (1).

@jegou
Copy link
Contributor

jegou commented Mar 3, 2017

@vmarkovtsev
We do not target uniformly datasets but real-life features (large ones). At the time we made the test with k-means++, we had mostly experimented with various features extracted from images (SIFT, GIST, etc), such as those provided in the following benchmarks (widely used in indexing for approximate search benchmarking):
http://corpus-texmex.irisa.fr/

If you have done some experiments with k-means++ vs random init (large-scale, we are not much interested by dataset comprising less than 1 million input vectors and less than 10k centroids), I am of course curious of the results.
One technical point: one of the strength of k-means++ is to avoid random clusters in the first iterations. Random init is bad by default for spiky distributions, but this is easily compensated by detecting the case and employing a smart (and costless) splitting strategy, which we do in Faiss. I assume that, in the comparison between k-means++ and random init, that you carefully handle this point, otherwise IMO the comparison would not be conclusive.

@vmarkovtsev
Copy link
Author

I see, thanks. Our datasets are somewhat NLP-ish, https://data.world/vmarkovtsev are the examples.

@wickedfoo
Copy link
Contributor

Any implementation on the GPU aiming for pairwise L1 distance will likely be a lot slower, since there would need to be an equivalent to the GEMM kernel that calculates d_mn = sum_k |x_mk - y_kn| instead of d_mn = sum_k (x_mk * y_kn). The GEMM strategy is efficient because there is high memory reuse in the register file and shared memory via tiling.

However, cuBLAS GEMM itself is quite inefficient for small k (k being the dimension of the vectors). Full arithmetic throughput is only seen for really large k. Small k is also in the regime where GEMM starts to be memory bandwidth / latency bound rather than arithmetically bound.

A kernel could probably be written for L1 using a GEMM-like strategy, but it would take a lot of time to get right or tune appropriately for different architectures.

@jegou
Copy link
Contributor

jegou commented Mar 3, 2017

@vmarkovtsev: L1 is not derivable from an inner product.
However, many other kernels such as the intersection kernel (often good replacement for L1) can be approximated with explicit embeddings, a strategy which is more and more popular and you may be aware of. This can be done by Nystrom-like methods / KPCA. Personally, I am a big fan of the ones proposed by Andrea Vedaldi:
https://www.robots.ox.ac.uk/~vedaldi/assets/pubs/vedaldi11efficient.pdf

Employing such embeddings, you can map a vector x->phi(x), such that the inner product phi(x)^T phi(y) approximates your kernel k(x,y). Then you can simply search/cluster based on the vectors phi(x), for instance for a search application such as proposed in
https://hal.inria.fr/hal-00722635v2/document

@vmarkovtsev
Copy link
Author

Thank you everybody!

So, to summarize my initial 3 points.

  1. As soon as I provide comprehensive evidence that kmeans++ improves upon random init on datasets with >1M points and 10k centroids, maintainers are glad to allow the contribution. I doubt that I provide this evidence for 10k centroids, as my own experience was based on 8M samples (OK) and 1-2k centroids (not OK). So I revoke this.

  2. Cosine distance can be calculated without taking arccos and not losing any precision, so initial (2) is revoked. The alternative proposal of L1 is interesting to maintainers but it will take much work. I preserve this in the hope that I find the time to experiment with it (after all, I almost finished the impl in kmcuda). I need time to read these fancy stuff about Nystrom-like methods / KPCA to understand them properly.

  3. Python3 support was added as the result of the discussion - closed.

Thus I am renaming the topic to make it better reflect the intent.

@wickedfoo
Copy link
Contributor

Duplicate of #848

merging into a more general task, might support some of this for GPU over the next couple of months

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants