Can we retrieve transformed vectors from tokens (or words) as sparse embeddings? #69

lspataroG · 2024-10-18T10:15:30Z

lspataroG
Oct 18, 2024

Hello @xhluca,
First of all thank you so much for releasing this great library!
I have a question, is it currently supported to retrieve transformed vectors from tokens (or words)?
I am thinking in the style of Sklearn TfidfVectorizer transform. The reason I am asking this is that I would like to use the feature vectors directly. In particular having the functionality to infer sparse vectors from new queries/documents.

Thank you very much!

Answered by xhluca

Oct 18, 2024

Thank you for the kind words and glad to hear you enjoy the library!

Right now, it is possible to get a match score (float) between a query and a document; for D documents, that becomes a D-dimension numpy vector. To do this, simply using the retriever.get_scores function!

Here's the link:

bm25s/bm25s/__init__.py

Lines 502 to 514 in e1b39e5

     def get_scores(self, query_tokens_single: List[str], weight_mask=None) -> np.ndarray:  
   if not isinstance(query_tokens_single, list):  
   raise ValueError("The query_tokens must be a list of tokens.")  
    
   if isinstance(query_tokens_single[0], str):  
   query_tokens_ids = self.get_tokens_ids(query_tokens_single)  
   elif isinstance(

View full answer

xhluca · 2024-10-18T15:48:09Z

xhluca
Oct 18, 2024
Maintainer

Thank you for the kind words and glad to hear you enjoy the library!

Right now, it is possible to get a match score (float) between a query and a document; for D documents, that becomes a D-dimension numpy vector. To do this, simply using the retriever.get_scores function!

Here's the link:

bm25s/bm25s/__init__.py

Lines 502 to 514 in e1b39e5

    
           def get_scores(self, query_tokens_single: List[str], weight_mask=None) -> np.ndarray: 
        
               if not isinstance(query_tokens_single, list): 
        
                   raise ValueError("The query_tokens must be a list of tokens.") 
        
               if isinstance(query_tokens_single[0], str): 
        
                   query_tokens_ids = self.get_tokens_ids(query_tokens_single) 
        
               elif isinstance(query_tokens_single[0], int): 
        
                   # already are token IDs, no need to convert 
        
                   query_tokens_ids = query_tokens_single 
        
               else: 
        
                   raise ValueError("The query_tokens must be a list of tokens or a list of token IDs.") 
        
               return self.get_scores_from_ids(query_tokens_ids, weight_mask=weight_mask)

I apologize for the lack of docstrings - i haven't had chance to add that as i have been focused on documenting retrieve instead.

If you want the feature vector of a given document wrt to all words in the vocab, that's a bit tricky because bm25s doesn't actually store the sparse matrix, but rather then index pointer (indptr.csc.index.npy), indices (indices.csc.index.npy) and flat data array (data.csc.index.npy), which are stored in a dictionary attributed retriever.scores. You can see how it is saved here:

bm25s/bm25s/__init__.py

Lines 747 to 796 in e1b39e5

    
               def save( 
        
                   self, 
        
                   save_dir, 
        
                   corpus=None, 
        
                   data_name="data.csc.index.npy", 
        
                   indices_name="indices.csc.index.npy", 
        
                   indptr_name="indptr.csc.index.npy", 
        
                   vocab_name="vocab.index.json", 
        
                   params_name="params.index.json", 
        
                   nnoc_name="nonoccurrence_array.index.npy", 
        
                   corpus_name="corpus.jsonl", 
        
                   allow_pickle=False, 
        
               ): 
        
                   """ 
        
                   Save the BM25S index to the `save_dir` directory. This will save the scores array, 
        
                   the indices array, the indptr array, the vocab dictionary, and the parameters. 
        
                   Parameters 
        
                   ---------- 
        
                   save_dir : str 
        
                       The directory where the BM25S index will be saved. 
        
                   corpus : List[Dict] 
        
                       The corpus of documents. If provided, it will be saved to the `corpus` file. 
        
                   corpus_name : str 
        
                       The name of the file that will contain the corpus. 
        
                   data_name : str 
        
                       The name of the file that will contain the data array. 
        
                   indices_name : str 
        
                       The name of the file that will contain the indices array. 
        
                   indptr_name : str 
        
                       The name of the file that will contain the indptr array. 
        
                   vocab_name : str 
        
                       The name of the file that will contain the vocab dictionary. 
        
                   params_name : str 
        
                       The name of the file that will contain the parameters. 
        
                   nnoc_name : str 
        
                       The name of the file that will contain the non-occurrence array. 
        
                   allow_pickle : bool 
        
                       If True, the arrays will be saved using pickle. If False, the arrays will be saved 
        
                       in a more efficient format, but they will not be readable by older versions of numpy. 
        
                   """

Fortunately, you can reconstruct a scipy csc matrix by simply passing this:

# i might have inverted M and N, worth double checking here
M = "<num documents>"
N = "<vocab size>"

doc_matrix = scipy.sparse.csc_matrix((data, indices, indptr), shape=(M, N))

Now, you can access the vector for a document wrt all words by selecting your desired column.

0 replies

lspataroG · 2024-10-18T18:47:24Z

lspataroG
Oct 18, 2024
Author

@xhluca Thank you very much! So the second part of the answer is what I was after, I was able to get the feature vector of a given document with respect to the words that it contains. However I am trying to understand whether it is possible to compute that for a new document. The answer above is similar to running a sklearn TfidfVectorizer.fit_transform, now I need to be able to inference on new documents (ie. transform). Is this currently achievable? Thanks!

3 replies

xhluca Oct 18, 2024
Maintainer

Unfortunately it's not possible to add new documents. There's relevant discussions on the topic that delves into why it's not possible.

lspataroG Oct 18, 2024
Author

ok!, thank you :)

xhluca Oct 18, 2024
Maintainer

Here's one discussion: #25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we retrieve transformed vectors from tokens (or words) as sparse embeddings? #69

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

	def get_scores(self, query_tokens_single: List[str], weight_mask=None) -> np.ndarray:
	if not isinstance(query_tokens_single, list):
	raise ValueError("The query_tokens must be a list of tokens.")

	if isinstance(query_tokens_single[0], str):
	query_tokens_ids = self.get_tokens_ids(query_tokens_single)
	elif isinstance(

Can we retrieve transformed vectors from tokens (or words) as sparse embeddings? #69

lspataroG Oct 18, 2024

Replies: 2 comments · 3 replies

xhluca Oct 18, 2024 Maintainer

lspataroG Oct 18, 2024 Author

xhluca Oct 18, 2024 Maintainer

lspataroG Oct 18, 2024 Author

xhluca Oct 18, 2024 Maintainer

lspataroG
Oct 18, 2024

Replies: 2 comments 3 replies

xhluca
Oct 18, 2024
Maintainer

lspataroG
Oct 18, 2024
Author

xhluca Oct 18, 2024
Maintainer

lspataroG Oct 18, 2024
Author

xhluca Oct 18, 2024
Maintainer