-
-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove by word should consider other doc ids #45
Remove by word should consider other doc ids #45
Conversation
LGTM 💯 Bug #40 seems to be fixed. |
Thank you @lucaong, @thomscoder, and @mateonunez for working on this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes the issue, but for the "wrong reason" :) See the comment.
|
||
if (node.end || (exact && node.end && nodeWord === word)) { | ||
node.removeDoc(docID); | ||
|
||
if (node.children?.size) { | ||
if (node.children?.size && docIDs.has(docID)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes the issue, but for the wrong reason :)
docIDs.has(docID)
will always return false
here, because in the previous line node.removeDoc(docID)
removed the document from the set. As a result, node.end = false
is never executed, so the whole if
block here could be completely removed.
Never setting node.end = false
on one hand fixes the issue, and does not cause any correctness problem. On the other hand, it leaves a term in the trie, even when all documents associated to it were removed. This is fine, but will have performance impacts on large tries if a lot of documents are deleted: even when the set of documents associated to a term is empty, the term is never removed, so it is still considered in prefix and fuzzy search.
My recommendation: change the if
condition to something like node.children?.size && docIDs.size === 0
. That means: if after deleting this one document there are no more documents in the set pointed by this word, remove the word from the trie by making the node non-terminal.
Note that this still leaves one case uncovered, namely when the node is terminal but has no children. In that case, in order to really delete the term from the trie, one has to backtrack to the closest branching, and adjust the nodes to remove the "orphaned" suffix. This could be addressed in another issue, since again it does not affect correctness, but rather only performance in case of a large trie with many deleted terms (that without the proposed refactoring stay in the trie, even though they point to empty document sets).
I am confident that @micheleriva understands what I am pointing at, but otherwise I can offer some more explanation.
@@ -112,12 +112,12 @@ export class Trie { | |||
if (!word) return false; | |||
|
|||
function removeWord(node: TrieNode, _word: string, docID: string): boolean { | |||
const [nodeWord /**_docs*/] = node.getWord(); | |||
const [nodeWord, docIDs] = node.getWord(); | |||
|
|||
if (node.end || (exact && node.end && nodeWord === word)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR (rather to #41), but still worth noting: this condition is completely equivalent to just node.end
: that's because x || (x && y)
is the same as just x
. In the specific case, if node.end
is true, the condition short circuits before the ||
. If it is false, then the part after the ||
is also false, because it's a chain of &&
including node.end
. Therefore, the whole exact
logic can be removed with no effect.
Whether the resulting condition after removing the part after ||
is correct, I am not 100% sure (I would advise having more extensive tests), but for sure it is equivalent to the current one.
This PR would fix #40.
Thanks a lot to @lucaong for supporting me in this bug.
The
removeByWord
now considers if the docID exists in docIDs Set, then set thenode.end
asfalse
.The
exact
param will stay in the function for future implementations.