Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 #2568

Conversation

alexanderguzhva
Copy link
Contributor

Summary:
Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 9, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: aa595994b8fc2533008357336c931844a4799dbb
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 9, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: ddcadb5743d6f932ab479adf9526477d78864cdb
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 9, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 8ae4282b4848278e3c56864f8fb23ae654ae238e
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 10, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 29f2cd476ed9961a78adf3258731eaf98ce1008f
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 10, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 4db53e95397db6f5f90ca07258f24266cbd1ef9e
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 10, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 5ccb10bb7283e614026d79d67482ccba9f19b099
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 10, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 0271519f9ce389f11804dc18fc494d7f78a92a90
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 10, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: d5194fa9cf3c00f74db595d95293cd4bf2514b0b
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 10, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 654d8e1d25aed7cfc5e8f73841692df03612ab02
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 10, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 897653612fe9948b729e8e87854be58e9d543ad8
alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 14, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: 9ce681ef360daea11c3aa411fc19c415b6896b3c
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 14, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: f490d7e60f1c1b94a3f412a92f3d72ca8c5d8e1e
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 14, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: f6e81fe3db9647a761ac1552fba620155e316f36
alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 14, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: 2d15c8f73e46c8d86d3b9f632ed91c21386deeea
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 14, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: 47dd59a010a43cff858e83ff195a6688c7f1cbcc
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request Nov 14, 2022
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: 3a9a8343a6560ee788ec1d9614ef04afc0ea0861
…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: 19f6650619dad7113078f59224ea3186e97a75e0
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41166766

facebook-github-bot pushed a commit that referenced this pull request Jun 1, 2023
Summary:
- Use elementwise operation and reduction once instead of across-vector comparing operation twice
- Use already implemented supporting functions
- Unify semantics of `operator==` as same as `simd16uint16`
    - `operator==` of `simd8uint32` and `simd8float32` had been implemented on #2568, but these has not same semantics as `simd16uint16` (which had been implemented in a long time ago). For getting the vector equality as `bool` , now we should use `is_same_as` member function.
- Change `is_same_as` to accept any vector type as argument for `simdlib_neon`
    - `is_same_as` has supported any vector type on `simdlib_avx2` and `simdlib_emulated` already
- Remove unused function `simd16uint16::is_same` on `simdlib_avx2`
    - Is it typo of `is_same_as` ? Anyway it seems to be used unlikely

Pull Request resolved: #2885

Reviewed By: mdouze

Differential Revision: D46330666

Pulled By: alexanderguzhva

fbshipit-source-id: 0ea14f8e9a8bda78f24a655219dffe3e07fc110f
Thejas-bhat pushed a commit to blevesearch/faiss that referenced this pull request Sep 27, 2023
Summary:
- Use elementwise operation and reduction once instead of across-vector comparing operation twice
- Use already implemented supporting functions
- Unify semantics of `operator==` as same as `simd16uint16`
    - `operator==` of `simd8uint32` and `simd8float32` had been implemented on facebookresearch#2568, but these has not same semantics as `simd16uint16` (which had been implemented in a long time ago). For getting the vector equality as `bool` , now we should use `is_same_as` member function.
- Change `is_same_as` to accept any vector type as argument for `simdlib_neon`
    - `is_same_as` has supported any vector type on `simdlib_avx2` and `simdlib_emulated` already
- Remove unused function `simd16uint16::is_same` on `simdlib_avx2`
    - Is it typo of `is_same_as` ? Anyway it seems to be used unlikely

Pull Request resolved: facebookresearch#2885

Reviewed By: mdouze

Differential Revision: D46330666

Pulled By: alexanderguzhva

fbshipit-source-id: 0ea14f8e9a8bda78f24a655219dffe3e07fc110f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants