-
-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(lyra): prevent NaN values in TF-IDF metric. Close #192 #193
Conversation
// increase a token counter that may not yet exist | ||
tokenOccurrencies[propName][token] = (tokenOccurrencies[propName][token] ?? 0) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the following, taken from lyra.ts:
const propName = 'prop';
const token = 'doc-id';
const tokenOccurrencies = { [propName]: { } };
What happens if we increase a non-existing counter like follows?
tokenOccurrencies[propName][token]++;
Two things happen:
tokenOccurrencies[propName]
becomes{ 'doc-id': NaN }
NaN
is returned from the postfix increment expression
If we wanted to increment a counter that may not yet exist, we should instead run something like:
tokenOccurrencies[propName][token] = (tokenOccurrencies[propName][token] ?? 0) + 1;
which:
- sets the counter to 1 if it didn't exist
- increment the counter by 1 otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn you're right. Thank you so much!
const termOccurrencies = typeof lyraOccurrencies[term] === "number" | ||
? (lyraOccurrencies[term] ?? 0) | ||
: 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- anomalies in
termOccurrencies
' value makeidf
becomeNaN
orInfinity
, withtfIdf
resulting in aNaN
value in turn - when
termOccurrencies
is a string, object, or undefined ->idf
becomesNaN
->tfIdf
becomesNaN
- when
termOccurrencies
is 0 ->idf
becomesInfinity
->tfIdf
becomesNaN
@@ -648,15 +649,22 @@ export function search<S extends PropertiesSchema>( | |||
for (let j = 0; j < tokensLength; j++) { | |||
const term = tokens[j]; | |||
const documentIDs = getDocumentIDsFromSearch(lyra, { ...params, index, term, exact }); | |||
const termOccurrencies = lyraOccurrencies[term]; | |||
|
|||
// lyraOccurrencies[term] can be undefined, 0, string, or { [k: string]: number } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
termOccurrencies
's type is inferred asnumber
, but it's actuallynumber | string | { [k: string]: number } | undefined
I recommend the maintainers to verify why this inferred type is misleading after this PR gets merged.
// idf's denominator is shifted by 1 to avoid division by zero | ||
const idf = Math.log10(N / (1 + termOccurrencies)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is such a common scenario that even wikipedia mentions this "unit shift" trick (link):
If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that the TF-IDF scores are now slightly smaller in the snapshot tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Thanks a million for that, we used to add 1
and eventually removed it for testing purposes, but never re-introduced it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you so much for these fixes @jkomyno!
// increase a token counter that may not yet exist | ||
tokenOccurrencies[propName][token] = (tokenOccurrencies[propName][token] ?? 0) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn you're right. Thank you so much!
// idf's denominator is shifted by 1 to avoid division by zero | ||
const idf = Math.log10(N / (1 + termOccurrencies)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Thanks a million for that, we used to add 1
and eventually removed it for testing purposes, but never re-introduced it again
No description provided.