-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adds term frequency to index #149
Conversation
const token = trimTokens[i]; | ||
const frequency = getTokenFrequency(token, trimTokens); | ||
// @todo: replace `${token}:${frequency}` with a better form | ||
tokensWithFrequency.push(`${token}:${frequency}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this, I'm open to suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce memory usage, are you using a string rather than an object or a pair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I’m using a string to leave the tokenizer backward compatible, but I can always change this to plain Object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch sorry, I missed the point 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my two cents, but this does not look good to me for several reason, it's conflating different responsibilities, introducing a data structure in disguise as a string, and also hidden inefficiencies.
First of all, calculating the term frequency should not be a concern of the tokenize: its responsibility is merely to split a string into tokens, nothing more.
Moreover, as you note, using a string separated by :
does not sound like a good idea. Your instinct is right: you are representing a data structure as a string. While it may save some memory, it makes your solution more fragile and obfuscated. For the sake of backward compatibility, you are also binding to some arbitrary choices: you won't be able to support tokens that contain :
(not possible now, but should that be a design choice?).
The getTokenFrequency
function is also introducing an inefficiency: for each token, you go through all tokens, in order to count the frequency, which is a quadratic complexity operation. That's even repeated for every repetition of the token. Imagine a string composed of the same word 1000 times, separated by space: this would repeat the calculation for each repetition, performing 1.000.000 comparisons just to count the same term over and over.
A better architecture is to let the tokenizer tokenize, with no frequency counting, and have a clear pipeline of tokenize -> normalize -> add tokens to index
. Downstream of the tokenizer, the part of the pipeline that receives the tokens from the tokenizer and adds them to the inverted index is the one responsible for keeping the term frequencies. It would simply be called once for each token, keep a frequency counter for each token in the inverted index, and increment it for each token added. That's better separation of responsibility, and a linear time operation instead of quadratic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucaong I totally agree and your points are the real reason why this PR is currently stuck; I used these structures while prototyping and it's clear that they're not optimal. I already have solved most of these problems but I'm not ready yet to push everything (still need some fixes, but I am working on other Lyra stuff right now).
First of all, calculating the term frequency should not be a concern of the tokenize: its responsibility is merely to split a string into tokens, nothing more.
100% agree.
using a string separated by : does not sound like a good idea
knew that from the beginning, as you can see in the first comments!
The getTokenFrequency function is also introducing an inefficiency
already rewrote it, I actually found some of the cases you described
Thanks as always for your comments, they're actually super useful and really appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome :)
Just a heads-up. I'll review it tomorrow (I hope) |
export interface Node { | ||
id: string; | ||
key: string; | ||
word: string; | ||
parent: Nullable<string>; | ||
children: Record<string, string>; | ||
docs: string[]; | ||
docs: DocRef[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of an array of objects, what about a Map<docId, tf>
? It will make all the calls O(1) instead of O(N)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maps are not serializable to JSON, we would need to convert them into a serializable format and back 😔 other idea was to convert this array into a linked list and perform operations in O(log n) average via skip lists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a plain object for that using the key as docId
. Example:
{
"1321321-2": 3
}
In which step the serialization happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's another way I'd like to try! I think I'll make some benchmarks to decide how to proceed. I'm quite sure linked lists will perform slower when compared to iterating over object keys, but I wanna make sure.
The only real problem with the object approach is that every key can only have one value, and we may want to add more metadata in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serialization happens via https://github.com/lyrasearch/plugin-data-persistence, as it's a runtime-specific procedure (especially when you write on disk)
@@ -127,7 +127,7 @@ export function removeDocumentByWord(nodes: Nodes, node: Node, word: string, doc | |||
if (exact && node.end && nodeWord === word) { | |||
removeDocument(node, docID); | |||
|
|||
if (node.children?.size && docIDs.includes(docID)) { | |||
if (node.children?.size && docIDs.findIndex((doc) => doc.id === docID) > -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd extract the findIndex...
in a helper since you use it twice.
Reviewed. Other than what Rafael already said, LGTM! |
@@ -75,7 +75,7 @@ export function insert(nodes: Nodes, node: Node, word: string, docId: string): v | |||
|
|||
if (i === wordLength - 1) { | |||
node.end = true; | |||
node.docs.push(docId); | |||
node.docs.push({ id: docId, tf: frequency }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to me like an associative structure in disguise: each entry is basically a key (the ID) and a value (the frequency). Why not using an object or a map instead? An array also forces you to use findIndex
below, which is linear, instead of a constant time lookup in an object.
Introduction
This PR is 100% backward compatible.
Please consider this PR as a blueprint proposal for adding token frequencies (from now on, TF) to the Lyra index.
This is part of a larger enhancement that aims to provide better search results, custom weights, etc. to Lyra.
TF values will be used as part of a TF-IDF feature to replace the https://github.com/LyraSearch/plugin-token-relevance plugin.
Lyra stores its data in a big prefix tree, where each node is shaped as follows:
Please note that
docs
is an array ofstrings
, where eachstring
references a document containing theword
property (inverted index structure).this PR changes the Node structure above to this:
With that approach, when we traverse the tree to look up a word, we can start considering tokens appearing with a larger frequency as more priority (later on using the TF-IDF classification method).
Things to consider
There are a couple of things to consider before merging this PR:
Index size
Adding several (potentially millions) of new objects to every new node leads to a bigger index size. As an alternative, we can store data with strings such as
"mytoken:5"
, where5
is the frequency, but could potentially mess up custom tokenizers and stemmers.Using plain objects allows us to add new meta properties in the future, which is good to have.