-
Hello @xhluca, Thank you very much! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
Thank you for the kind words and glad to hear you enjoy the library! Right now, it is possible to get a match score (float) between a query and a document; for D documents, that becomes a D-dimension numpy vector. To do this, simply using the Here's the link: Lines 502 to 514 in e1b39e5 I apologize for the lack of docstrings - i haven't had chance to add that as i have been focused on documenting retrieve instead. If you want the feature vector of a given document wrt to all words in the vocab, that's a bit tricky because bm25s doesn't actually store the sparse matrix, but rather then index pointer ( Lines 747 to 796 in e1b39e5 Fortunately, you can reconstruct a scipy csc matrix by simply passing this: # i might have inverted M and N, worth double checking here
M = "<num documents>"
N = "<vocab size>"
doc_matrix = scipy.sparse.csc_matrix((data, indices, indptr), shape=(M, N)) Now, you can access the vector for a document wrt all words by selecting your desired column. |
Beta Was this translation helpful? Give feedback.
-
@xhluca Thank you very much! So the second part of the answer is what I was after, I was able to get the feature vector of a given document with respect to the words that it contains. However I am trying to understand whether it is possible to compute that for a new document. The answer above is similar to running a sklearn TfidfVectorizer.fit_transform, now I need to be able to inference on new documents (ie. transform). Is this currently achievable? Thanks! |
Beta Was this translation helpful? Give feedback.
Thank you for the kind words and glad to hear you enjoy the library!
Right now, it is possible to get a match score (float) between a query and a document; for D documents, that becomes a D-dimension numpy vector. To do this, simply using the
retriever.get_scores
function!Here's the link:
bm25s/bm25s/__init__.py
Lines 502 to 514 in e1b39e5