-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 #2568
Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 #2568
Conversation
This pull request was exported from Phabricator. Differential Revision: D41166766 |
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Differential Revision: D41166766 fbshipit-source-id: aa595994b8fc2533008357336c931844a4799dbb
This pull request was exported from Phabricator. Differential Revision: D41166766 |
71a7286
to
66c1990
Compare
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Differential Revision: D41166766 fbshipit-source-id: ddcadb5743d6f932ab479adf9526477d78864cdb
66c1990
to
79ffd85
Compare
This pull request was exported from Phabricator. Differential Revision: D41166766 |
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Differential Revision: D41166766 fbshipit-source-id: 8ae4282b4848278e3c56864f8fb23ae654ae238e
79ffd85
to
8eac0e8
Compare
This pull request was exported from Phabricator. Differential Revision: D41166766 |
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Differential Revision: D41166766 fbshipit-source-id: 29f2cd476ed9961a78adf3258731eaf98ce1008f
This pull request was exported from Phabricator. Differential Revision: D41166766 |
8eac0e8
to
200bbe3
Compare
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Differential Revision: D41166766 fbshipit-source-id: 4db53e95397db6f5f90ca07258f24266cbd1ef9e
200bbe3
to
7af9538
Compare
This pull request was exported from Phabricator. Differential Revision: D41166766 |
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Differential Revision: D41166766 fbshipit-source-id: 5ccb10bb7283e614026d79d67482ccba9f19b099
7af9538
to
49b4584
Compare
This pull request was exported from Phabricator. Differential Revision: D41166766 |
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Differential Revision: D41166766 fbshipit-source-id: 0271519f9ce389f11804dc18fc494d7f78a92a90
49b4584
to
ee7b4a1
Compare
This pull request was exported from Phabricator. Differential Revision: D41166766 |
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Differential Revision: D41166766 fbshipit-source-id: d5194fa9cf3c00f74db595d95293cd4bf2514b0b
This pull request was exported from Phabricator. Differential Revision: D41166766 |
ee7b4a1
to
d9d240e
Compare
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Differential Revision: D41166766 fbshipit-source-id: 654d8e1d25aed7cfc5e8f73841692df03612ab02
d9d240e
to
9b61d76
Compare
This pull request was exported from Phabricator. Differential Revision: D41166766 |
This pull request was exported from Phabricator. Differential Revision: D41166766 |
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Differential Revision: D41166766 fbshipit-source-id: 897653612fe9948b729e8e87854be58e9d543ad8
9b61d76
to
116003c
Compare
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Reviewed By: mdouze Differential Revision: D41166766 fbshipit-source-id: 9ce681ef360daea11c3aa411fc19c415b6896b3c
116003c
to
484cfb2
Compare
This pull request was exported from Phabricator. Differential Revision: D41166766 |
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Reviewed By: mdouze Differential Revision: D41166766 fbshipit-source-id: f490d7e60f1c1b94a3f412a92f3d72ca8c5d8e1e
This pull request was exported from Phabricator. Differential Revision: D41166766 |
484cfb2
to
5f5c556
Compare
This pull request was exported from Phabricator. Differential Revision: D41166766 |
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Reviewed By: mdouze Differential Revision: D41166766 fbshipit-source-id: f6e81fe3db9647a761ac1552fba620155e316f36
5f5c556
to
4a9f843
Compare
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Reviewed By: mdouze Differential Revision: D41166766 fbshipit-source-id: 2d15c8f73e46c8d86d3b9f632ed91c21386deeea
4a9f843
to
4b1f49d
Compare
This pull request was exported from Phabricator. Differential Revision: D41166766 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D41166766 |
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Reviewed By: mdouze Differential Revision: D41166766 fbshipit-source-id: 47dd59a010a43cff858e83ff195a6688c7f1cbcc
4b1f49d
to
4e1a27e
Compare
This pull request was exported from Phabricator. Differential Revision: D41166766 |
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Reviewed By: mdouze Differential Revision: D41166766 fbshipit-source-id: 3a9a8343a6560ee788ec1d9614ef04afc0ea0861
4e1a27e
to
eda4d2e
Compare
…research#2568) Summary: Pull Request resolved: facebookresearch#2568 Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM. Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16. The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default). Speeds up IVFPQ training as well. AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually. Reviewed By: mdouze Differential Revision: D41166766 fbshipit-source-id: 19f6650619dad7113078f59224ea3186e97a75e0
This pull request was exported from Phabricator. Differential Revision: D41166766 |
eda4d2e
to
0193265
Compare
Summary: - Use elementwise operation and reduction once instead of across-vector comparing operation twice - Use already implemented supporting functions - Unify semantics of `operator==` as same as `simd16uint16` - `operator==` of `simd8uint32` and `simd8float32` had been implemented on #2568, but these has not same semantics as `simd16uint16` (which had been implemented in a long time ago). For getting the vector equality as `bool` , now we should use `is_same_as` member function. - Change `is_same_as` to accept any vector type as argument for `simdlib_neon` - `is_same_as` has supported any vector type on `simdlib_avx2` and `simdlib_emulated` already - Remove unused function `simd16uint16::is_same` on `simdlib_avx2` - Is it typo of `is_same_as` ? Anyway it seems to be used unlikely Pull Request resolved: #2885 Reviewed By: mdouze Differential Revision: D46330666 Pulled By: alexanderguzhva fbshipit-source-id: 0ea14f8e9a8bda78f24a655219dffe3e07fc110f
Summary: - Use elementwise operation and reduction once instead of across-vector comparing operation twice - Use already implemented supporting functions - Unify semantics of `operator==` as same as `simd16uint16` - `operator==` of `simd8uint32` and `simd8float32` had been implemented on facebookresearch#2568, but these has not same semantics as `simd16uint16` (which had been implemented in a long time ago). For getting the vector equality as `bool` , now we should use `is_same_as` member function. - Change `is_same_as` to accept any vector type as argument for `simdlib_neon` - `is_same_as` has supported any vector type on `simdlib_avx2` and `simdlib_emulated` already - Remove unused function `simd16uint16::is_same` on `simdlib_avx2` - Is it typo of `is_same_as` ? Anyway it seems to be used unlikely Pull Request resolved: facebookresearch#2885 Reviewed By: mdouze Differential Revision: D46330666 Pulled By: alexanderguzhva fbshipit-source-id: 0ea14f8e9a8bda78f24a655219dffe3e07fc110f
Summary:
Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.
Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.
The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).
Speeds up IVFPQ training as well.
AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.
Differential Revision: D41166766