Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: faster IVF & PQ #328

Merged
merged 22 commits into from
Feb 21, 2024

Conversation

whateveraname
Copy link
Contributor

@whateveraname whateveraname commented Jan 31, 2024

Work done

  • add multi-threading support for IVF index building and K-Means
  • fix a bug where IVFPQ are not trained with IVF residuals
  • store codes in the layout that codes in the same cluster are placed together for better locality

TODO

  • use lookup table for distance computation in PQ search

@@ -23,4 +23,8 @@ impl<T: ?Sized> SyncUnsafeCell<T> {
pub fn get_mut(&mut self) -> &mut T {
self.value.get_mut()
}

pub fn get_ref(&self) -> &T {
Copy link
Collaborator

@usamoi usamoi Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not expose this function.

if dis * dis < weight[j] {
weight[j] = dis * dis;
unsafe {
(&mut *lowerbound.get())[(j, i)] = dis;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's absolutely unsound. Please use a structure that is like Square<Atomic<F32>>.

if o.compare_exchange(next, i, Release, Relaxed).is_ok() {
break;
unsafe {
(&mut *idx.get())[i as usize] = result.1 as usize;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsound. Use idx: Vec<AtomicUsize> instead.

@whateveraname
Copy link
Contributor Author

The multi-threaded parts are guaranteed to have no data races in logic. Do I still have to remove all these unsafe blocks?

@whateveraname
Copy link
Contributor Author

whateveraname commented Feb 2, 2024

Performance Benchmark

Dataset: gist-960-euclidean-l2, n = 1,000,000, d = 960

build time (s) rps precision
IVF-naive 246 27.5 0.904
IVF-naive-opt 215 35.3 0.904
IVF-PQ-x4 3641 3.1 0.23
IVF-PQ-x4-opt 368 3.1 0.906
IVF-PQ-x16-opt 267 8.0 0.719

* Indices with 'opt' are optimized indices in this PR, num_threads in index build stage is set to 96
** All indices use the default build parameters and select the search parameter (nprobe) to reach 0.9 precision
*** IVF-PQ has low precision of 0.23 due to the bug that PQ is not trained with residual vectors

@whateveraname
Copy link
Contributor Author

PTAL @usamoi

@VoVAllen
Copy link
Member

VoVAllen commented Feb 2, 2024

Thanks. Can you fix the lint error in CI? Also what's the PQ ratio in your benchmark?

@VoVAllen
Copy link
Member

VoVAllen commented Feb 2, 2024

Why does num_threads=96 seem to have little acceleration? Is it due to the kmeans computation not paralleled? How did you configure the num_threads?

@whateveraname
Copy link
Contributor Author

Thanks. Can you fix the lint error in CI? Also what's the PQ ratio in your benchmark?

The CI error is cause by the unused struct SyncUnsafeCell which is not written by me. I cannot decide whether to remove it. The PQ ratio is x4.

@whateveraname
Copy link
Contributor Author

Why does num_threads=96 seem to have little acceleration? Is it due to the kmeans computation not paralleled?

It seems to me that a lot of time is spent on I/O so the speed up for computation cannot contribute much to overall performance.

@VoVAllen
Copy link
Member

VoVAllen commented Feb 2, 2024

Why does num_threads=96 seem to have little acceleration? Is it due to the kmeans computation not paralleled?

It seems to me that a lot of time is spent on I/O so the speed up for computation cannot contribute much to overall performance.

Doesn't make sense. 1M 960-dim vectors should be easy to fit in the memory.

Also can you try benchmarking with higher PQ ratio for the results?

@VoVAllen VoVAllen requested a review from usamoi February 2, 2024 07:00
@usamoi
Copy link
Collaborator

usamoi commented Feb 2, 2024

Thanks. Can you fix the lint error in CI? Also what's the PQ ratio in your benchmark?

The CI error is cause by the unused struct SyncUnsafeCell which is not written by me. I cannot decide whether to remove it. The PQ ratio is x4.

You can just remove unused code.

@whateveraname
Copy link
Contributor Author

Why does num_threads=96 seem to have little acceleration? Is it due to the kmeans computation not paralleled?

It seems to me that a lot of time is spent on I/O so the speed up for computation cannot contribute much to overall performance.

Doesn't make sense. 1M 960-dim vectors should be easy to fit in the memory.

Also can you try benchmarking with higher PQ ratio for the results?

The index build time includes time to read the whole dataset from disk and time to save the whole index to disk, so it's not about whether vectors can fit in memory.

I will try for higher PQ ratio.

@VoVAllen
Copy link
Member

VoVAllen commented Feb 2, 2024

Why does num_threads=96 seem to have little acceleration? Is it due to the kmeans computation not paralleled?

It seems to me that a lot of time is spent on I/O so the speed up for computation cannot contribute much to overall performance.

Doesn't make sense. 1M 960-dim vectors should be easy to fit in the memory.
Also can you try benchmarking with higher PQ ratio for the results?

The index build time includes time to read the whole dataset from disk and time to save the whole index to disk, so it's not about whether vectors can fit in memory.

I will try for higher PQ ratio.

It still doesn't make sense. Normal disk can achieve >500MB/s throughput for sequential write or read. 1M 960dim float is less than 4GB in space, which only accounts for less than 10s to read or write

Copy link
Member

@VoVAllen VoVAllen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you add some comments in the Kmeans and ivf part? Give a simple explanation on each part. And feel free to turn this PR in to ready mode, so we can start review

@whateveraname
Copy link
Contributor Author

whateveraname commented Feb 2, 2024

Performance Benchmark

Dataset: gist-960-euclidean-l2, n = 1,000,000, d = 960

build time (s) rps precision
IVF-naive 246 27.5 0.904
IVF-naive-opt 215 35.3 0.904
IVF-PQ-x4 3641 3.1 0.23
IVF-PQ-x4-opt 368 3.1 0.906
IVF-PQ-x16-opt 267 8.0 0.719

  • Indices with 'opt' are optimized indices in this PR, num_threads in index build stage is set to 96 ** All indices use the default build parameters and select the search parameter (nprobe) to reach 0.9 precision *** IVF-PQ has low precision of 0.23 due to the bug that PQ is not trained with residual vectors

Update Benchmark

  • Update build time for IVF-PQ-x4-opt run with the latest commit, which parallelizes over training for subquantizers. This can fully utilize computation resources.
  • Add result for IVF-PQ-x16-opt.

@whateveraname whateveraname marked this pull request as ready for review February 2, 2024 09:35
whateveraname and others added 11 commits February 3, 2024 10:33
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
fix PQ training for IVF residuals

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
fix PQ training for IVF residuals

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
@VoVAllen
Copy link
Member

VoVAllen commented Feb 6, 2024

PTAL @usamoi

let width = self.dims.div_ceil(self.ratio);
let s = i as usize * width as usize;
let e = (i + 1) as usize * width as usize;
&self.codes[s..e]
}

pub fn set_codes(&mut self, codes: MmapArray<u8>) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not expose it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, PTAL

@VoVAllen
Copy link
Member

VoVAllen commented Feb 7, 2024

Please fix the CI check

@whateveraname
Copy link
Contributor Author

The IVFPQ uses table lookup for distance computation in the search stage now. Currently only supports L2 distance. Will add support for IP distance next. Cosine distance will remain using the original distance computation method. The following table is the performance benchmark.

build time (s) rps precision
IVF-PQ-x4-opt 368 3.08 0.906
IVF-PQ-x4-opt-table 341 5.27 0.901
IVF-PQ-x16-opt 267 8.05 0.719
IVF-PQ-x16-opt-table 251 19.26 0.720

@whateveraname
Copy link
Contributor Author

whateveraname commented Feb 7, 2024

The IVFPQ uses table lookup for distance computation in the search stage now. Currently only supports L2 distance. Will add support for IP distance next. Cosine distance will remain using the original distance computation method. The following table is the performance benchmark.

build time (s) rps precision
IVF-PQ-x4-opt 368 3.08 0.906
IVF-PQ-x4-opt-table 341 5.27 0.901
IVF-PQ-x16-opt 267 8.05 0.719
IVF-PQ-x16-opt-table 251 19.26 0.720

During IVFPQ search with by_residual, we compute
  d = || x - y_C - y_R ||^2
where x is the query vector, y_C the coarse centroid, y_R the refined PQ centroid. The expression can be decomposed as:
  d = || x - y_C ||^2 + || y_R ||^2 + 2 * (y_C | y_R) - 2 * (x | y_R)
    ---------     --------------------------   -----------
     term 1          term 2       term 3
When using multiprobe, we use the following decomposition:

  • term 1 is the distance to the coarse centroid, that is computed during the 1st stage search.
  • term 2 can be precomputed, as it does not involve x.
  • term 3 is the classical non-residual distance table. Since y_R defined by a product quantizer, it is split across subvectors and stored separately for each subvector.

At search time, the tables for term 2 and term 3 are added up. This is faster when the length of the lists is > ksub * M.

ref: faiss

Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
@usamoi
Copy link
Collaborator

usamoi commented Feb 18, 2024

Is it ready for merging?

@whateveraname
Copy link
Contributor Author

Is it ready for merging?

It is ready for merging now

@usamoi usamoi added this pull request to the merge queue Feb 21, 2024
Merged via the queue into tensorchord:main with commit 3d1621b Feb 21, 2024
8 checks passed
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Signed-off-by: whateveraname <12011319@mail.sustech.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants