Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds term frequency to index #149

Closed
wants to merge 4 commits into from
Closed

Conversation

micheleriva
Copy link
Member

@micheleriva micheleriva commented Oct 7, 2022

Introduction

This PR is 100% backward compatible.

Please consider this PR as a blueprint proposal for adding token frequencies (from now on, TF) to the Lyra index.
This is part of a larger enhancement that aims to provide better search results, custom weights, etc. to Lyra.

TF values will be used as part of a TF-IDF feature to replace the https://github.com/LyraSearch/plugin-token-relevance plugin.

Lyra stores its data in a big prefix tree, where each node is shaped as follows:

export interface Node {
  id: string;
  key: string;
  word: string;
  parent: Nullable<string>;
  children: Record<string, string>;
  docs: string[];
  end: boolean;
}

Please note that docs is an array of strings, where each string references a document containing the word property (inverted index structure).

this PR changes the Node structure above to this:

+ interface DocRef {
+   id: string;
+   tf: number;
+ }

export interface Node {
  id: string;
  key: string;
  word: string;
  parent: Nullable<string>;
  children: Record<string, string>;
+   docs: DocRef[];
-   docs: string[];
  end: boolean;
}

With that approach, when we traverse the tree to look up a word, we can start considering tokens appearing with a larger frequency as more priority (later on using the TF-IDF classification method).

Things to consider

There are a couple of things to consider before merging this PR:

Index size

Adding several (potentially millions) of new objects to every new node leads to a bigger index size. As an alternative, we can store data with strings such as "mytoken:5", where 5 is the frequency, but could potentially mess up custom tokenizers and stemmers.
Using plain objects allows us to add new meta properties in the future, which is good to have.

src/lyra.ts Outdated Show resolved Hide resolved
const token = trimTokens[i];
const frequency = getTokenFrequency(token, trimTokens);
// @todo: replace `${token}:${frequency}` with a better form
tokensWithFrequency.push(`${token}:${frequency}`);
Copy link
Member Author

@micheleriva micheleriva Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this, I'm open to suggestions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce memory usage, are you using a string rather than an object or a pair?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I’m using a string to leave the tokenizer backward compatible, but I can always change this to plain Object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch sorry, I missed the point 🙏

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my two cents, but this does not look good to me for several reason, it's conflating different responsibilities, introducing a data structure in disguise as a string, and also hidden inefficiencies.

First of all, calculating the term frequency should not be a concern of the tokenize: its responsibility is merely to split a string into tokens, nothing more.

Moreover, as you note, using a string separated by : does not sound like a good idea. Your instinct is right: you are representing a data structure as a string. While it may save some memory, it makes your solution more fragile and obfuscated. For the sake of backward compatibility, you are also binding to some arbitrary choices: you won't be able to support tokens that contain : (not possible now, but should that be a design choice?).

The getTokenFrequency function is also introducing an inefficiency: for each token, you go through all tokens, in order to count the frequency, which is a quadratic complexity operation. That's even repeated for every repetition of the token. Imagine a string composed of the same word 1000 times, separated by space: this would repeat the calculation for each repetition, performing 1.000.000 comparisons just to count the same term over and over.

A better architecture is to let the tokenizer tokenize, with no frequency counting, and have a clear pipeline of tokenize -> normalize -> add tokens to index. Downstream of the tokenizer, the part of the pipeline that receives the tokens from the tokenizer and adds them to the inverted index is the one responsible for keeping the term frequencies. It would simply be called once for each token, keep a frequency counter for each token in the inverted index, and increment it for each token added. That's better separation of responsibility, and a linear time operation instead of quadratic.

Copy link
Member Author

@micheleriva micheleriva Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucaong I totally agree and your points are the real reason why this PR is currently stuck; I used these structures while prototyping and it's clear that they're not optimal. I already have solved most of these problems but I'm not ready yet to push everything (still need some fixes, but I am working on other Lyra stuff right now).

First of all, calculating the term frequency should not be a concern of the tokenize: its responsibility is merely to split a string into tokens, nothing more.

100% agree.

using a string separated by : does not sound like a good idea

knew that from the beginning, as you can see in the first comments!

The getTokenFrequency function is also introducing an inefficiency

already rewrote it, I actually found some of the cases you described

Thanks as always for your comments, they're actually super useful and really appreciated

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome :)

@RafaelGSS
Copy link
Contributor

Just a heads-up. I'll review it tomorrow (I hope)

src/lyra.ts Outdated Show resolved Hide resolved
export interface Node {
id: string;
key: string;
word: string;
parent: Nullable<string>;
children: Record<string, string>;
docs: string[];
docs: DocRef[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an array of objects, what about a Map<docId, tf>? It will make all the calls O(1) instead of O(N)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maps are not serializable to JSON, we would need to convert them into a serializable format and back 😔 other idea was to convert this array into a linked list and perform operations in O(log n) average via skip lists

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a plain object for that using the key as docId. Example:

{
  "1321321-2": 3
}

In which step the serialization happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's another way I'd like to try! I think I'll make some benchmarks to decide how to proceed. I'm quite sure linked lists will perform slower when compared to iterating over object keys, but I wanna make sure.

The only real problem with the object approach is that every key can only have one value, and we may want to add more metadata in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serialization happens via https://github.com/lyrasearch/plugin-data-persistence, as it's a runtime-specific procedure (especially when you write on disk)

@@ -127,7 +127,7 @@ export function removeDocumentByWord(nodes: Nodes, node: Node, word: string, doc
if (exact && node.end && nodeWord === word) {
removeDocument(node, docID);

if (node.children?.size && docIDs.includes(docID)) {
if (node.children?.size && docIDs.findIndex((doc) => doc.id === docID) > -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd extract the findIndex... in a helper since you use it twice.

@ShogunPanda
Copy link
Contributor

Reviewed. Other than what Rafael already said, LGTM!

@@ -75,7 +75,7 @@ export function insert(nodes: Nodes, node: Node, word: string, docId: string): v

if (i === wordLength - 1) {
node.end = true;
node.docs.push(docId);
node.docs.push({ id: docId, tf: frequency });
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to me like an associative structure in disguise: each entry is basically a key (the ID) and a value (the frequency). Why not using an object or a map instead? An array also forces you to use findIndex below, which is linear, instead of a constant time lookup in an object.

@micheleriva micheleriva mentioned this pull request Nov 5, 2022
4 tasks
@micheleriva micheleriva deleted the feat/term-frequency branch January 17, 2023 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants