New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 #2568

Closed

alexanderguzhva wants to merge 1 commit into facebookresearch:main from alexanderguzhva:export-D41166766

Contributor

alexanderguzhva commented Nov 9, 2022

Summary:
Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8 ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

facebook-github-bot added CLA Signed fb-exported labels

Contributor

facebook-github-bot commented Nov 9, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

66c1990

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: aa595994b8fc2533008357336c931844a4799dbb

Contributor

facebook-github-bot commented Nov 9, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva force-pushed the export-D41166766 branch from 71a7286 to 66c1990 Compare

November 9, 2022 22:05

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

79ffd85

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: ddcadb5743d6f932ab479adf9526477d78864cdb

alexanderguzhva force-pushed the export-D41166766 branch from 66c1990 to 79ffd85 Compare

November 9, 2022 22:19

Contributor

facebook-github-bot commented Nov 9, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

8eac0e8

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 8ae4282b4848278e3c56864f8fb23ae654ae238e

alexanderguzhva force-pushed the export-D41166766 branch from 79ffd85 to 8eac0e8 Compare

November 9, 2022 22:31

Contributor

facebook-github-bot commented Nov 9, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

200bbe3

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 29f2cd476ed9961a78adf3258731eaf98ce1008f

Contributor

facebook-github-bot commented Nov 10, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva force-pushed the export-D41166766 branch from 8eac0e8 to 200bbe3 Compare

November 10, 2022 14:38

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

7af9538

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 4db53e95397db6f5f90ca07258f24266cbd1ef9e

alexanderguzhva force-pushed the export-D41166766 branch from 200bbe3 to 7af9538 Compare

November 10, 2022 14:49

Contributor

facebook-github-bot commented Nov 10, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

49b4584

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 5ccb10bb7283e614026d79d67482ccba9f19b099

alexanderguzhva force-pushed the export-D41166766 branch from 7af9538 to 49b4584 Compare

November 10, 2022 15:03

Contributor

facebook-github-bot commented Nov 10, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

ee7b4a1

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 0271519f9ce389f11804dc18fc494d7f78a92a90

alexanderguzhva force-pushed the export-D41166766 branch from 49b4584 to ee7b4a1 Compare

November 10, 2022 15:08

Contributor

facebook-github-bot commented Nov 10, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

d9d240e

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: d5194fa9cf3c00f74db595d95293cd4bf2514b0b

Contributor

facebook-github-bot commented Nov 10, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva force-pushed the export-D41166766 branch from ee7b4a1 to d9d240e Compare

November 10, 2022 15:46

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

9b61d76

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 654d8e1d25aed7cfc5e8f73841692df03612ab02

alexanderguzhva force-pushed the export-D41166766 branch from d9d240e to 9b61d76 Compare

November 10, 2022 16:16

Contributor

facebook-github-bot commented Nov 10, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

Contributor

facebook-github-bot commented Nov 10, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

116003c

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Significantly speeds up the training of PQx[1] indices for low-dimensional PQ vectors ( 1, 2, 4, 8  ), and the effect is higher for higher values of [1]. AVX512 provides additional overloads for dimensionality of 12 and 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Differential Revision: D41166766

fbshipit-source-id: 897653612fe9948b729e8e87854be58e9d543ad8

alexanderguzhva force-pushed the export-D41166766 branch from 9b61d76 to 116003c Compare

November 10, 2022 16:27

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

484cfb2

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: 9ce681ef360daea11c3aa411fc19c415b6896b3c

alexanderguzhva force-pushed the export-D41166766 branch from 116003c to 484cfb2 Compare

November 14, 2022 15:29

Contributor

facebook-github-bot commented Nov 14, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

5f5c556

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: f490d7e60f1c1b94a3f412a92f3d72ca8c5d8e1e

Contributor

facebook-github-bot commented Nov 14, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva force-pushed the export-D41166766 branch from 484cfb2 to 5f5c556 Compare

November 14, 2022 15:43

Contributor

facebook-github-bot commented Nov 14, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

4a9f843

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: f6e81fe3db9647a761ac1552fba620155e316f36

alexanderguzhva force-pushed the export-D41166766 branch from 5f5c556 to 4a9f843 Compare

November 14, 2022 16:32

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

4b1f49d

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: 2d15c8f73e46c8d86d3b9f632ed91c21386deeea

alexanderguzhva force-pushed the export-D41166766 branch from 4a9f843 to 4b1f49d Compare

November 14, 2022 16:36

Contributor

facebook-github-bot commented Nov 14, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

1 similar comment

Contributor

facebook-github-bot commented Nov 14, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

4e1a27e

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: 47dd59a010a43cff858e83ff195a6688c7f1cbcc

alexanderguzhva force-pushed the export-D41166766 branch from 4b1f49d to 4e1a27e Compare

November 14, 2022 16:49

Contributor

facebook-github-bot commented Nov 14, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva pushed a commit to alexanderguzhva/faiss that referenced this pull request


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

eda4d2e

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: 3a9a8343a6560ee788ec1d9614ef04afc0ea0861

alexanderguzhva force-pushed the export-D41166766 branch from 4e1a27e to eda4d2e Compare

November 14, 2022 19:53


          Speedup exhaustive_L2sqr_blas for AVX2, ARM NEON and AVX512 (facebook…

…research#2568)

Summary:
Pull Request resolved: facebookresearch#2568

Add a fused kernel for exhaustive_L2sqr_blas() call that combines a computation of dot product and the search for the nearest centroid. As a result, no temporary dot product values are written and read in RAM.

Speeds up the training of PQx[1] indices for dsub = 1, 2, 4, 8, and the effect is higher for higher values of [1]. AVX512 version provides additional overloads for dsub = 12, 16.

The speedup is also beneficial for higher values of pq.cp.max_points_per_centroid (which is 256 by default).

Speeds up IVFPQ training as well.

AVX512 kernel is not enabled, but I've seen it speeding up the training TWICE versus AVX2 version. So, please feel free to use it by enabling AVX512 manually.

Reviewed By: mdouze

Differential Revision: D41166766

fbshipit-source-id: 19f6650619dad7113078f59224ea3186e97a75e0

Contributor

facebook-github-bot commented Nov 14, 2022

This pull request was exported from Phabricator. Differential Revision: D41166766

alexanderguzhva force-pushed the export-D41166766 branch from eda4d2e to 0193265 Compare

November 14, 2022 21:37

facebook-github-bot closed this in

0b74765

wx257osn2 mentioned this pull request

Some changes to simdlib #2885

Closed

facebook-github-bot pushed a commit that referenced this pull request


          Some changes to simdlib (#2885)

9c88422

Summary:
- Use elementwise operation and reduction once instead of across-vector comparing operation twice
- Use already implemented supporting functions
- Unify semantics of `operator==` as same as `simd16uint16`
    - `operator==` of `simd8uint32` and `simd8float32` had been implemented on #2568, but these has not same semantics as `simd16uint16` (which had been implemented in a long time ago). For getting the vector equality as `bool` , now we should use `is_same_as` member function.
- Change `is_same_as` to accept any vector type as argument for `simdlib_neon`
    - `is_same_as` has supported any vector type on `simdlib_avx2` and `simdlib_emulated` already
- Remove unused function `simd16uint16::is_same` on `simdlib_avx2`
    - Is it typo of `is_same_as` ? Anyway it seems to be used unlikely

Pull Request resolved: #2885

Reviewed By: mdouze

Differential Revision: D46330666

Pulled By: alexanderguzhva

fbshipit-source-id: 0ea14f8e9a8bda78f24a655219dffe3e07fc110f

Thejas-bhat pushed a commit to blevesearch/faiss that referenced this pull request


          Some changes to simdlib (facebookresearch#2885)

fe2f572

Summary:
- Use elementwise operation and reduction once instead of across-vector comparing operation twice
- Use already implemented supporting functions
- Unify semantics of `operator==` as same as `simd16uint16`
    - `operator==` of `simd8uint32` and `simd8float32` had been implemented on facebookresearch#2568, but these has not same semantics as `simd16uint16` (which had been implemented in a long time ago). For getting the vector equality as `bool` , now we should use `is_same_as` member function.
- Change `is_same_as` to accept any vector type as argument for `simdlib_neon`
    - `is_same_as` has supported any vector type on `simdlib_avx2` and `simdlib_emulated` already
- Remove unused function `simd16uint16::is_same` on `simdlib_avx2`
    - Is it typo of `is_same_as` ? Anyway it seems to be used unlikely

Pull Request resolved: facebookresearch#2885

Reviewed By: mdouze

Differential Revision: D46330666

Pulled By: alexanderguzhva

fbshipit-source-id: 0ea14f8e9a8bda78f24a655219dffe3e07fc110f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported