Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong variable description in wiki Faiss building blocks: clustering, PCA, quantization? #2709

Open
gitgithan opened this issue Feb 15, 2023 · 2 comments

Comments

@gitgithan
Copy link

gitgithan commented Feb 15, 2023

I'm refering to the code at https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization#example-pq-encoding--decoding

d = 32  # data dimension
cs = 4  # code size (bytes)

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

The 2nd and 3rd inputs in faiss.ProductQuantizer were cs and 8, however when checking https://github.com/bitsun/faiss-windows/blob/master/ProductQuantizer.h#L27 , the 3 inputs are d, M, nbits.

Problem

I think the 2nd parameter M is number of segments and not cs code size as expressed on the wiki

Other related questions

  1. Why when i change 3rd input 8 to 4, the codes.shape drops from 4 columns to 2? I thought the number columns in codes is exactly the number of M (2nd input) and should not be varying with 3rd input?
  2. Is there supposed to be some interaction between M and nbits that determines number of columns in codes?
  3. Why is codes.max() 15 when nbits (3rd parameter) is 1 but 255 when nbits >2? I'm expecting codes.max to be exactly 2**nbits.
@mdouze
Copy link
Contributor

mdouze commented Feb 15, 2023

The codes are packed, see #2285

@gitgithan
Copy link
Author

gitgithan commented Feb 16, 2023

Thanks that explains all my shape problems.

code_size vs M

On code_size, i see inconsistent interpretations at
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.
In the PQ row and Bytes/Vector column of the Indexes table, there is ceil(M * nbit / 8). This contributed to me raising this issue, thinking code_size should be called M.
I also agree with the use of code_size in this statement Flat indexes just encode the vectors into codes of a fixed size and store them in an array of ntotal * code_size bytes.
However later down that page, index = faiss.IndexIVFPQ (coarse_quantizer, d, ncentroids, code_size, 8) I believe code_size here should be M instead.

https://github.com/facebookresearch/faiss/wiki/FAQ#is-pq-encoding-with-more-or-less-than-8-bits-per-components-supported.
I found the clearest disambiguation of code size and M in this statement on this page:
The code size of the M PQ indices is rounded up to a whole number of bytes, ie. PQ3x4 uses 2 bytes;

Verifying BitstringReader

To verify what BitstringReader is doing, i tried doing my own slicing of bytes by joining the packed codes from faiss and splitting every nbits, and expect to see the same numbers produced from BitstringReader and my code:

def convert_codes(codes, nbits):
    print('First Code: ', codes[0])
    
    binary = [f'{c:08b}' for c in codes[0]]  # always printing first code only as example
    print('Binary: ',binary)
    
    joined_binary = ''.join(binary)
    split_binary = [joined_binary[i:i+nbits] for i in range(0, len(joined_binary), nbits)]
    print('Split Binary: ',split_binary)
    
    return [int(b,2) for b in split_binary]

convert_codes(codes,nbits)
bs = faiss.BitstringReader(faiss.swig_ptr(codes[0]), codes.shape[1])
for i in range(cs): 
    print(bs.read(nbits))

Seeded random inputs

d = 32  # data dimension
cs = 4  # code size (bytes)
nbits = 8
# train set 
nt = 10000

np.random.seed(0)
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, nbits)
pq.train(xt)

What's strange is depending on nbits:

nbits = 8 i match faiss output

First Code:  [161 140   7 114]
Binary:  ['10100001', '10001100', '00000111', '01110010']
Split Binary:  ['10100001', '10001100', '00000111', '01110010']
[161, 140, 7, 114]
161
140
7
114

nbits = 4 i match the values but wrong order (Not sure what 2nd parameter of faiss.BitstringReader is doing?)

First Code:  [ 89 246]
Binary:  ['01011001', '11110110']
Split Binary:  ['0101', '1001', '1111', '0110']
[5, 9, 15, 6]
9
5
6
15

nbits = 3 i get completely different (both length and value) codes

First Code:  [82 15]
Binary:  ['01010010', '00001111']
Split Binary:  ['010', '100', '100', '000', '111', '1']
[2, 4, 4, 0, 7, 1]
2
2
5
7

What explains the previous 3 scenarios?
I feel it got to do with whether i prepend 0s or append 0s for values < 128 that don't take up a full 8 bits, but considering this I still can't explain above patterns.

Why pq.compute_codes?

Also in what cases do users use the codes from pq.compute_codes(x)? I only see them used in pq.decode(codes) so far. They are unintuitive because the numbers don't indicate which k-th centroid is in which segment , so why not directly expose the outputs of BitstringReader? Knowing which k-th centroid is in which segment would help make use of results from faiss.vector_to_array(pq.centroids).reshape(pq.M, pq.ksub, pq.dsub) to find relationships between each database vector and the all centroids from all segments.

Wiki improvements

Could the wiki have some comments added so new readers wouldn't face these same questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants