Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(lyra): prevent NaN values in TF-IDF metric. Close #192 #193

Merged
merged 1 commit into from
Nov 20, 2022

Conversation

jkomyno
Copy link
Contributor

@jkomyno jkomyno commented Nov 20, 2022

No description provided.

@jkomyno jkomyno requested a review from micheleriva November 20, 2022 00:40
Comment on lines +295 to +296
// increase a token counter that may not yet exist
tokenOccurrencies[propName][token] = (tokenOccurrencies[propName][token] ?? 0) + 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following, taken from lyra.ts:

const propName = 'prop';
const token = 'doc-id';
const tokenOccurrencies = { [propName]: { } };

What happens if we increase a non-existing counter like follows?

tokenOccurrencies[propName][token]++;

Two things happen:

  • tokenOccurrencies[propName] becomes { 'doc-id': NaN }
  • NaN is returned from the postfix increment expression

If we wanted to increment a counter that may not yet exist, we should instead run something like:

tokenOccurrencies[propName][token] = (tokenOccurrencies[propName][token] ?? 0) + 1;

which:

  • sets the counter to 1 if it didn't exist
  • increment the counter by 1 otherwise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn you're right. Thank you so much!

Comment on lines +654 to +656
const termOccurrencies = typeof lyraOccurrencies[term] === "number"
? (lyraOccurrencies[term] ?? 0)
: 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • anomalies in termOccurrencies' value make idf become NaN or Infinity, with tfIdf resulting in a NaN value in turn
  • when termOccurrencies is a string, object, or undefined -> idf becomes NaN -> tfIdf becomes NaN
  • when termOccurrencies is 0 -> idf becomes Infinity -> tfIdf becomes NaN

@@ -648,15 +649,22 @@ export function search<S extends PropertiesSchema>(
for (let j = 0; j < tokensLength; j++) {
const term = tokens[j];
const documentIDs = getDocumentIDsFromSearch(lyra, { ...params, index, term, exact });
const termOccurrencies = lyraOccurrencies[term];

// lyraOccurrencies[term] can be undefined, 0, string, or { [k: string]: number }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • termOccurrencies's type is inferred as number, but it's actually number | string | { [k: string]: number } | undefined

I recommend the maintainers to verify why this inferred type is misleading after this PR gets merged.

Comment on lines +664 to +665
// idf's denominator is shifted by 1 to avoid division by zero
const idf = Math.log10(N / (1 + termOccurrencies));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a common scenario that even wikipedia mentions this "unit shift" trick (link):

If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that the TF-IDF scores are now slightly smaller in the snapshot tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Thanks a million for that, we used to add 1 and eventually removed it for testing purposes, but never re-introduced it again

Copy link
Member

@micheleriva micheleriva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you so much for these fixes @jkomyno!

Comment on lines +295 to +296
// increase a token counter that may not yet exist
tokenOccurrencies[propName][token] = (tokenOccurrencies[propName][token] ?? 0) + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn you're right. Thank you so much!

Comment on lines +664 to +665
// idf's denominator is shifted by 1 to avoid division by zero
const idf = Math.log10(N / (1 + termOccurrencies));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Thanks a million for that, we used to add 1 and eventually removed it for testing purposes, but never re-introduced it again

@micheleriva micheleriva merged commit f036d0d into main Nov 20, 2022
@micheleriva micheleriva deleted the fix/nan-in-tfidf branch November 20, 2022 09:11
@toomuchdesign toomuchdesign mentioned this pull request Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants