refactor: faster IVF & PQ #328

whateveraname · 2024-01-31T10:33:05Z

Work done

add multi-threading support for IVF index building and K-Means
fix a bug where IVFPQ are not trained with IVF residuals
store codes in the layout that codes in the same cluster are placed together for better locality

TODO

use lookup table for distance computation in PQ search

usamoi · 2024-01-31T12:22:52Z

crates/service/src/utils/cells.rs

@@ -23,4 +23,8 @@ impl<T: ?Sized> SyncUnsafeCell<T> {
    pub fn get_mut(&mut self) -> &mut T {
        self.value.get_mut()
    }
+
+    pub fn get_ref(&self) -> &T {


Please do not expose this function.

usamoi · 2024-01-31T12:27:49Z

crates/service/src/algorithms/clustering/elkan_k_means.rs

-                if dis * dis < weight[j] {
-                    weight[j] = dis * dis;
+                unsafe {
+                    (&mut *lowerbound.get())[(j, i)] = dis;


It's absolutely unsound. Please use a structure that is like Square<Atomic<F32>>.

usamoi · 2024-01-31T12:31:01Z

crates/service/src/algorithms/ivf/ivf_naive.rs

-            if o.compare_exchange(next, i, Release, Relaxed).is_ok() {
-                break;
+        unsafe {
+            (&mut *idx.get())[i as usize] = result.1 as usize;


Unsound. Use idx: Vec<AtomicUsize> instead.

whateveraname · 2024-01-31T12:51:44Z

The multi-threaded parts are guaranteed to have no data races in logic. Do I still have to remove all these unsafe blocks?

whateveraname · 2024-02-02T06:15:44Z

Performance Benchmark

Dataset: gist-960-euclidean-l2, n = 1,000,000, d = 960

	build time (s)	rps	precision
IVF-naive	246	27.5	0.904
IVF-naive-opt	215	35.3	0.904
IVF-PQ-x4	3641	3.1	0.23
IVF-PQ-x4-opt	368	3.1	0.906
IVF-PQ-x16-opt	267	8.0	0.719

* Indices with 'opt' are optimized indices in this PR, num_threads in index build stage is set to 96
** All indices use the default build parameters and select the search parameter (nprobe) to reach 0.9 precision
*** IVF-PQ has low precision of 0.23 due to the bug that PQ is not trained with residual vectors

whateveraname · 2024-02-02T06:19:03Z

PTAL @usamoi

VoVAllen · 2024-02-02T06:47:14Z

Thanks. Can you fix the lint error in CI? Also what's the PQ ratio in your benchmark?

VoVAllen · 2024-02-02T06:48:39Z

Why does num_threads=96 seem to have little acceleration? Is it due to the kmeans computation not paralleled? How did you configure the num_threads?

whateveraname · 2024-02-02T06:49:47Z

Thanks. Can you fix the lint error in CI? Also what's the PQ ratio in your benchmark?

The CI error is cause by the unused struct SyncUnsafeCell which is not written by me. I cannot decide whether to remove it. The PQ ratio is x4.

whateveraname · 2024-02-02T06:52:18Z

Why does num_threads=96 seem to have little acceleration? Is it due to the kmeans computation not paralleled?

It seems to me that a lot of time is spent on I/O so the speed up for computation cannot contribute much to overall performance.

VoVAllen · 2024-02-02T07:00:44Z

Why does num_threads=96 seem to have little acceleration? Is it due to the kmeans computation not paralleled?

It seems to me that a lot of time is spent on I/O so the speed up for computation cannot contribute much to overall performance.

Doesn't make sense. 1M 960-dim vectors should be easy to fit in the memory.

Also can you try benchmarking with higher PQ ratio for the results?

usamoi · 2024-02-02T07:04:23Z

Thanks. Can you fix the lint error in CI? Also what's the PQ ratio in your benchmark?

The CI error is cause by the unused struct SyncUnsafeCell which is not written by me. I cannot decide whether to remove it. The PQ ratio is x4.

You can just remove unused code.

whateveraname · 2024-02-02T07:04:55Z

Why does num_threads=96 seem to have little acceleration? Is it due to the kmeans computation not paralleled?

It seems to me that a lot of time is spent on I/O so the speed up for computation cannot contribute much to overall performance.

Doesn't make sense. 1M 960-dim vectors should be easy to fit in the memory.

Also can you try benchmarking with higher PQ ratio for the results?

The index build time includes time to read the whole dataset from disk and time to save the whole index to disk, so it's not about whether vectors can fit in memory.

I will try for higher PQ ratio.

VoVAllen · 2024-02-02T07:11:33Z

Why does num_threads=96 seem to have little acceleration? Is it due to the kmeans computation not paralleled?

It seems to me that a lot of time is spent on I/O so the speed up for computation cannot contribute much to overall performance.

Doesn't make sense. 1M 960-dim vectors should be easy to fit in the memory.
Also can you try benchmarking with higher PQ ratio for the results?

The index build time includes time to read the whole dataset from disk and time to save the whole index to disk, so it's not about whether vectors can fit in memory.

I will try for higher PQ ratio.

It still doesn't make sense. Normal disk can achieve >500MB/s throughput for sequential write or read. 1M 960dim float is less than 4GB in space, which only accounts for less than 10s to read or write

VoVAllen

Also can you add some comments in the Kmeans and ivf part? Give a simple explanation on each part. And feel free to turn this PR in to ready mode, so we can start review

whateveraname · 2024-02-02T09:30:24Z

Performance Benchmark

Dataset: gist-960-euclidean-l2, n = 1,000,000, d = 960

build time (s) rps precision
IVF-naive 246 27.5 0.904
IVF-naive-opt 215 35.3 0.904
IVF-PQ-x4 3641 3.1 0.23
IVF-PQ-x4-opt 368 3.1 0.906
IVF-PQ-x16-opt 267 8.0 0.719

Indices with 'opt' are optimized indices in this PR, num_threads in index build stage is set to 96 ** All indices use the default build parameters and select the search parameter (nprobe) to reach 0.9 precision *** IVF-PQ has low precision of 0.23 due to the bug that PQ is not trained with residual vectors

Update Benchmark

Update build time for IVF-PQ-x4-opt run with the latest commit, which parallelizes over training for subquantizers. This can fully utilize computation resources.
Add result for IVF-PQ-x16-opt.

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

fix PQ training for IVF residuals Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

fix PQ training for IVF residuals Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

VoVAllen · 2024-02-06T06:00:06Z

PTAL @usamoi

usamoi · 2024-02-06T07:38:49Z

crates/service/src/algorithms/quantization/product.rs

        let width = self.dims.div_ceil(self.ratio);
        let s = i as usize * width as usize;
        let e = (i + 1) as usize * width as usize;
        &self.codes[s..e]
    }
+
+    pub fn set_codes(&mut self, codes: MmapArray<u8>) {


Do not expose it.

fixed, PTAL

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

…ecto.rs into refactor/faster-ivf-pq

VoVAllen · 2024-02-07T13:33:03Z

Please fix the CI check

whateveraname · 2024-02-07T13:33:48Z

The IVFPQ uses table lookup for distance computation in the search stage now. Currently only supports L2 distance. Will add support for IP distance next. Cosine distance will remain using the original distance computation method. The following table is the performance benchmark.

	build time (s)	rps	precision
IVF-PQ-x4-opt	368	3.08	0.906
IVF-PQ-x4-opt-table	341	5.27	0.901
IVF-PQ-x16-opt	267	8.05	0.719
IVF-PQ-x16-opt-table	251	19.26	0.720

whateveraname · 2024-02-07T13:53:50Z

The IVFPQ uses table lookup for distance computation in the search stage now. Currently only supports L2 distance. Will add support for IP distance next. Cosine distance will remain using the original distance computation method. The following table is the performance benchmark.

build time (s) rps precision
IVF-PQ-x4-opt 368 3.08 0.906
IVF-PQ-x4-opt-table 341 5.27 0.901
IVF-PQ-x16-opt 267 8.05 0.719
IVF-PQ-x16-opt-table 251 19.26 0.720

During IVFPQ search with by_residual, we compute
d = || x - y_C - y_R ||^2
where x is the query vector, y_C the coarse centroid, y_R the refined PQ centroid. The expression can be decomposed as:
d = || x - y_C ||^2 + || y_R ||^2 + 2 * (y_C | y_R) - 2 * (x | y_R)
--------- -------------------------- -----------
term 1 term 2 term 3
When using multiprobe, we use the following decomposition:

term 1 is the distance to the coarse centroid, that is computed during the 1st stage search.
term 2 can be precomputed, as it does not involve x.
term 3 is the classical non-residual distance table. Since y_R defined by a product quantizer, it is split across subvectors and stored separately for each subvector.

At search time, the tables for term 2 and term 3 are added up. This is faster when the length of the lists is > ksub * M.

ref: faiss

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

…ecto.rs into refactor/faster-ivf-pq

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

usamoi · 2024-02-18T03:30:19Z

Is it ready for merging?

whateveraname · 2024-02-21T03:14:26Z

Is it ready for merging?

It is ready for merging now

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

usamoi reviewed Jan 31, 2024

View reviewed changes

whateveraname force-pushed the refactor/faster-ivf-pq branch from 013bf2d to 7fef203 Compare February 2, 2024 05:50

VoVAllen requested a review from usamoi February 2, 2024 07:00

VoVAllen reviewed Feb 2, 2024

View reviewed changes

whateveraname marked this pull request as ready for review February 2, 2024 09:35

whateveraname and others added 11 commits February 3, 2024 10:33

add multi-threading for kmeans

ebfe90d

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

add multi-threading for IVF and PQ index build

4f29b9b

fix PQ training for IVF residuals Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

better memory layout for IVF

cc33815

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

fix codes

7fef203

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

remove unused code

ad0c76a

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

Merge branch 'main' into refactor/faster-ivf-pq

37d67ff

add multi-threading for kmeans

a13a590

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

add multi-threading for IVF and PQ index build

5ade8fe

fix PQ training for IVF residuals Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

better memory layout for IVF

fd5a16c

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

fix codes

feedefa

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

remove unused code

155b0b2

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

whateveraname added 2 commits February 3, 2024 14:11

parallelize over subquantizers for pq

f56e5cc

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

merge

63a0d87

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

usamoi reviewed Feb 6, 2024

View reviewed changes

whateveraname and others added 4 commits February 6, 2024 16:47

Merge branch 'main' into refactor/faster-ivf-pq

e47695c

add tale lookup search for ivfpq with l2 distance

187694b

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

Merge branch 'refactor/faster-ivf-pq' of github.com:whateveraname/pgv…

2cfed8b

…ecto.rs into refactor/faster-ivf-pq

Merge branch 'main' into refactor/faster-ivf-pq

3a44588

whateveraname added 3 commits February 8, 2024 18:06

refine codes

dab1185

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

Merge branch 'refactor/faster-ivf-pq' of github.com:whateveraname/pgv…

899540a

…ecto.rs into refactor/faster-ivf-pq

move comment to github

6c000ec

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

usamoi approved these changes Feb 21, 2024

View reviewed changes

usamoi added this pull request to the merge queue Feb 21, 2024

Merged via the queue into tensorchord:main with commit 3d1621b Feb 21, 2024
8 checks passed

whateveraname added 2 commits February 21, 2024 13:23

use table lookup for ip distance

a32f0eb

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

Merge branch 'main' into refactor/faster-ivf-pq

583b345

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: faster IVF & PQ #328

refactor: faster IVF & PQ #328

whateveraname commented Jan 31, 2024 •

edited

Loading

usamoi Jan 31, 2024 •

edited

Loading

usamoi Jan 31, 2024

usamoi Jan 31, 2024

whateveraname commented Jan 31, 2024

whateveraname commented Feb 2, 2024 •

edited

Loading

whateveraname commented Feb 2, 2024

VoVAllen commented Feb 2, 2024

VoVAllen commented Feb 2, 2024 •

edited

Loading

whateveraname commented Feb 2, 2024

whateveraname commented Feb 2, 2024

VoVAllen commented Feb 2, 2024

usamoi commented Feb 2, 2024

whateveraname commented Feb 2, 2024

VoVAllen commented Feb 2, 2024

VoVAllen left a comment •

edited

Loading

whateveraname commented Feb 2, 2024 •

edited

Loading

Performance Benchmark

VoVAllen commented Feb 6, 2024

usamoi Feb 6, 2024

whateveraname Feb 7, 2024

VoVAllen commented Feb 7, 2024

whateveraname commented Feb 7, 2024

whateveraname commented Feb 7, 2024 •

edited

Loading

usamoi commented Feb 18, 2024

whateveraname commented Feb 21, 2024

refactor: faster IVF & PQ #328

refactor: faster IVF & PQ #328

Conversation

whateveraname commented Jan 31, 2024 • edited Loading

Work done

TODO

usamoi Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

usamoi Jan 31, 2024

Choose a reason for hiding this comment

usamoi Jan 31, 2024

Choose a reason for hiding this comment

whateveraname commented Jan 31, 2024

whateveraname commented Feb 2, 2024 • edited Loading

Performance Benchmark

whateveraname commented Feb 2, 2024

VoVAllen commented Feb 2, 2024

VoVAllen commented Feb 2, 2024 • edited Loading

whateveraname commented Feb 2, 2024

whateveraname commented Feb 2, 2024

VoVAllen commented Feb 2, 2024

usamoi commented Feb 2, 2024

whateveraname commented Feb 2, 2024

VoVAllen commented Feb 2, 2024

VoVAllen left a comment • edited Loading

Choose a reason for hiding this comment

whateveraname commented Feb 2, 2024 • edited Loading

Performance Benchmark

Update Benchmark

VoVAllen commented Feb 6, 2024

usamoi Feb 6, 2024

Choose a reason for hiding this comment

whateveraname Feb 7, 2024

Choose a reason for hiding this comment

VoVAllen commented Feb 7, 2024

whateveraname commented Feb 7, 2024

whateveraname commented Feb 7, 2024 • edited Loading

usamoi commented Feb 18, 2024

whateveraname commented Feb 21, 2024

whateveraname commented Jan 31, 2024 •

edited

Loading

usamoi Jan 31, 2024 •

edited

Loading

whateveraname commented Feb 2, 2024 •

edited

Loading

VoVAllen commented Feb 2, 2024 •

edited

Loading

VoVAllen left a comment •

edited

Loading

whateveraname commented Feb 2, 2024 •

edited

Loading

whateveraname commented Feb 7, 2024 •

edited

Loading