Wrong variable description in wiki Faiss building blocks: clustering, PCA, quantization? #2709

gitgithan · 2023-02-15T15:38:33Z

I'm refering to the code at https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization#example-pq-encoding--decoding

d = 32  # data dimension
cs = 4  # code size (bytes)

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

The 2nd and 3rd inputs in faiss.ProductQuantizer were cs and 8, however when checking https://github.com/bitsun/faiss-windows/blob/master/ProductQuantizer.h#L27 , the 3 inputs are d, M, nbits.

Problem

I think the 2nd parameter M is number of segments and not cs code size as expressed on the wiki

Other related questions

Why when i change 3rd input 8 to 4, the codes.shape drops from 4 columns to 2? I thought the number columns in codes is exactly the number of M (2nd input) and should not be varying with 3rd input?
Is there supposed to be some interaction between M and nbits that determines number of columns in codes?
Why is codes.max() 15 when nbits (3rd parameter) is 1 but 255 when nbits >2? I'm expecting codes.max to be exactly 2**nbits.

The text was updated successfully, but these errors were encountered:

mdouze · 2023-02-15T18:53:26Z

The codes are packed, see #2285

gitgithan · 2023-02-16T03:35:15Z

Thanks that explains all my shape problems.

`code_size` vs `M`

On code_size, i see inconsistent interpretations at
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.
In the PQ row and Bytes/Vector column of the Indexes table, there is ceil(M * nbit / 8). This contributed to me raising this issue, thinking code_size should be called M.
I also agree with the use of code_size in this statement Flat indexes just encode the vectors into codes of a fixed size and store them in an array of ntotal * code_size bytes.
However later down that page, index = faiss.IndexIVFPQ (coarse_quantizer, d, ncentroids, code_size, 8) I believe code_size here should be M instead.

https://github.com/facebookresearch/faiss/wiki/FAQ#is-pq-encoding-with-more-or-less-than-8-bits-per-components-supported.
I found the clearest disambiguation of code size and M in this statement on this page:
The code size of the M PQ indices is rounded up to a whole number of bytes, ie. PQ3x4 uses 2 bytes;

Verifying BitstringReader

To verify what BitstringReader is doing, i tried doing my own slicing of bytes by joining the packed codes from faiss and splitting every nbits, and expect to see the same numbers produced from BitstringReader and my code:

def convert_codes(codes, nbits):
    print('First Code: ', codes[0])
    
    binary = [f'{c:08b}' for c in codes[0]]  # always printing first code only as example
    print('Binary: ',binary)
    
    joined_binary = ''.join(binary)
    split_binary = [joined_binary[i:i+nbits] for i in range(0, len(joined_binary), nbits)]
    print('Split Binary: ',split_binary)
    
    return [int(b,2) for b in split_binary]

convert_codes(codes,nbits)
bs = faiss.BitstringReader(faiss.swig_ptr(codes[0]), codes.shape[1])
for i in range(cs): 
    print(bs.read(nbits))

Seeded random inputs

d = 32  # data dimension
cs = 4  # code size (bytes)
nbits = 8
# train set 
nt = 10000

np.random.seed(0)
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, nbits)
pq.train(xt)

What's strange is depending on nbits:

nbits = 8 i match faiss output

First Code:  [161 140   7 114]
Binary:  ['10100001', '10001100', '00000111', '01110010']
Split Binary:  ['10100001', '10001100', '00000111', '01110010']
[161, 140, 7, 114]
161
140
7
114

nbits = 4 i match the values but wrong order (Not sure what 2nd parameter of faiss.BitstringReader is doing?)

First Code:  [ 89 246]
Binary:  ['01011001', '11110110']
Split Binary:  ['0101', '1001', '1111', '0110']
[5, 9, 15, 6]
9
5
6
15

nbits = 3 i get completely different (both length and value) codes

First Code:  [82 15]
Binary:  ['01010010', '00001111']
Split Binary:  ['010', '100', '100', '000', '111', '1']
[2, 4, 4, 0, 7, 1]
2
2
5
7

What explains the previous 3 scenarios?
I feel it got to do with whether i prepend 0s or append 0s for values < 128 that don't take up a full 8 bits, but considering this I still can't explain above patterns.

Why pq.compute_codes?

Also in what cases do users use the codes from pq.compute_codes(x)? I only see them used in pq.decode(codes) so far. They are unintuitive because the numbers don't indicate which k-th centroid is in which segment , so why not directly expose the outputs of BitstringReader? Knowing which k-th centroid is in which segment would help make use of results from faiss.vector_to_array(pq.centroids).reshape(pq.M, pq.ksub, pq.dsub) to find relationships between each database vector and the all centroids from all segments.

Wiki improvements

Could the wiki have some comments added so new readers wouldn't face these same questions.

concept of packing codes with link to github issue ProductQuantizer compute_codes get wrong codes when nbits not 8 #2285
few of the mentions of code_size in wiki pages should actually be named M to prevent confusion and clear up inconsistencies
constraints on nbits value at Question on ProductQuantizer code sizes #2467 (comment) to explain why 8 is hardcoded

mdouze added the help wanted label Feb 15, 2023

mdouze added enhancement documentation and removed help wanted labels Feb 17, 2023

mdouze mentioned this issue Nov 15, 2023

IDSelectorNot causing "segmentation fault" when calling search() #3090

Closed

4 tasks

asadoughi added the Implementation label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong variable description in wiki Faiss building blocks: clustering, PCA, quantization? #2709

Wrong variable description in wiki Faiss building blocks: clustering, PCA, quantization? #2709

gitgithan commented Feb 15, 2023 •

edited

Loading

mdouze commented Feb 15, 2023

gitgithan commented Feb 16, 2023 •

edited

Loading

Wrong variable description in wiki Faiss building blocks: clustering, PCA, quantization? #2709

Wrong variable description in wiki Faiss building blocks: clustering, PCA, quantization? #2709

Comments

gitgithan commented Feb 15, 2023 • edited Loading

mdouze commented Feb 15, 2023

gitgithan commented Feb 16, 2023 • edited Loading

code_size vs M

Verifying BitstringReader

Why pq.compute_codes?

Wiki improvements

gitgithan commented Feb 15, 2023 •

edited

Loading

gitgithan commented Feb 16, 2023 •

edited

Loading

`code_size` vs `M`