feat: adds term frequency to index #149

micheleriva · 2022-10-07T15:48:17Z

Introduction

This PR is 100% backward compatible.

Please consider this PR as a blueprint proposal for adding token frequencies (from now on, TF) to the Lyra index.
This is part of a larger enhancement that aims to provide better search results, custom weights, etc. to Lyra.

TF values will be used as part of a TF-IDF feature to replace the https://github.com/LyraSearch/plugin-token-relevance plugin.

Lyra stores its data in a big prefix tree, where each node is shaped as follows:

export interface Node {
  id: string;
  key: string;
  word: string;
  parent: Nullable<string>;
  children: Record<string, string>;
  docs: string[];
  end: boolean;
}

Please note that docs is an array of strings, where each string references a document containing the word property (inverted index structure).

this PR changes the Node structure above to this:

+ interface DocRef {
+   id: string;
+   tf: number;
+ }

export interface Node {
  id: string;
  key: string;
  word: string;
  parent: Nullable<string>;
  children: Record<string, string>;
+   docs: DocRef[];
-   docs: string[];
  end: boolean;
}

With that approach, when we traverse the tree to look up a word, we can start considering tokens appearing with a larger frequency as more priority (later on using the TF-IDF classification method).

Things to consider

There are a couple of things to consider before merging this PR:

Index size

Adding several (potentially millions) of new objects to every new node leads to a bigger index size. As an alternative, we can store data with strings such as "mytoken:5", where 5 is the frequency, but could potentially mess up custom tokenizers and stemmers.
Using plain objects allows us to add new meta properties in the future, which is good to have.

src/lyra.ts

micheleriva · 2022-10-07T15:50:50Z

src/tokenizer/index.ts

+      const token = trimTokens[i];
+      const frequency = getTokenFrequency(token, trimTokens);
+      // @todo: replace `${token}:${frequency}` with a better form
+      tokensWithFrequency.push(`${token}:${frequency}`);


I don't like this, I'm open to suggestions

To reduce memory usage, are you using a string rather than an object or a pair?

Here I’m using a string to leave the tokenizer backward compatible, but I can always change this to plain Object

ouch sorry, I missed the point 🙏

Just my two cents, but this does not look good to me for several reason, it's conflating different responsibilities, introducing a data structure in disguise as a string, and also hidden inefficiencies.

First of all, calculating the term frequency should not be a concern of the tokenize: its responsibility is merely to split a string into tokens, nothing more.

Moreover, as you note, using a string separated by : does not sound like a good idea. Your instinct is right: you are representing a data structure as a string. While it may save some memory, it makes your solution more fragile and obfuscated. For the sake of backward compatibility, you are also binding to some arbitrary choices: you won't be able to support tokens that contain : (not possible now, but should that be a design choice?).

The getTokenFrequency function is also introducing an inefficiency: for each token, you go through all tokens, in order to count the frequency, which is a quadratic complexity operation. That's even repeated for every repetition of the token. Imagine a string composed of the same word 1000 times, separated by space: this would repeat the calculation for each repetition, performing 1.000.000 comparisons just to count the same term over and over.

A better architecture is to let the tokenizer tokenize, with no frequency counting, and have a clear pipeline of tokenize -> normalize -> add tokens to index. Downstream of the tokenizer, the part of the pipeline that receives the tokens from the tokenizer and adds them to the inverted index is the one responsible for keeping the term frequencies. It would simply be called once for each token, keep a frequency counter for each token in the inverted index, and increment it for each token added. That's better separation of responsibility, and a linear time operation instead of quadratic.

@lucaong I totally agree and your points are the real reason why this PR is currently stuck; I used these structures while prototyping and it's clear that they're not optimal. I already have solved most of these problems but I'm not ready yet to push everything (still need some fixes, but I am working on other Lyra stuff right now).

First of all, calculating the term frequency should not be a concern of the tokenize: its responsibility is merely to split a string into tokens, nothing more.

100% agree.

using a string separated by : does not sound like a good idea

knew that from the beginning, as you can see in the first comments!

The getTokenFrequency function is also introducing an inefficiency

already rewrote it, I actually found some of the cases you described

Thanks as always for your comments, they're actually super useful and really appreciated

RafaelGSS · 2022-10-09T20:59:20Z

Just a heads-up. I'll review it tomorrow (I hope)

src/lyra.ts

RafaelGSS · 2022-10-10T12:03:38Z

src/prefix-tree/node.ts

 export interface Node {
  id: string;
  key: string;
  word: string;
  parent: Nullable<string>;
  children: Record<string, string>;
-  docs: string[];
+  docs: DocRef[];


Instead of an array of objects, what about a Map<docId, tf>? It will make all the calls O(1) instead of O(N)

Maps are not serializable to JSON, we would need to convert them into a serializable format and back 😔 other idea was to convert this array into a linked list and perform operations in O(log n) average via skip lists

You can use a plain object for that using the key as docId. Example:

{ "1321321-2": 3 }

In which step the serialization happens?

Yes that's another way I'd like to try! I think I'll make some benchmarks to decide how to proceed. I'm quite sure linked lists will perform slower when compared to iterating over object keys, but I wanna make sure.

The only real problem with the object approach is that every key can only have one value, and we may want to add more metadata in the future

Serialization happens via https://github.com/lyrasearch/plugin-data-persistence, as it's a runtime-specific procedure (especially when you write on disk)

ShogunPanda · 2022-10-11T11:06:49Z

src/prefix-tree/trie.ts

@@ -127,7 +127,7 @@ export function removeDocumentByWord(nodes: Nodes, node: Node, word: string, doc
  if (exact && node.end && nodeWord === word) {
    removeDocument(node, docID);

-    if (node.children?.size && docIDs.includes(docID)) {
+    if (node.children?.size && docIDs.findIndex((doc) => doc.id === docID) > -1) {


I'd extract the findIndex... in a helper since you use it twice.

ShogunPanda · 2022-10-11T11:08:55Z

Reviewed. Other than what Rafael already said, LGTM!

lucaong · 2022-10-27T11:57:49Z

src/prefix-tree/trie.ts

@@ -75,7 +75,7 @@ export function insert(nodes: Nodes, node: Node, word: string, docId: string): v

    if (i === wordLength - 1) {
      node.end = true;
-      node.docs.push(docId);
+      node.docs.push({ id: docId, tf: frequency });


This looks to me like an associative structure in disguise: each entry is basically a key (the ID) and a value (the frequency). Why not using an object or a map instead? An array also forces you to use findIndex below, which is linear, instead of a constant time lookup in an object.

feat: adds term frequency to index

c3e8239

micheleriva requested review from ShogunPanda and RafaelGSS October 7, 2022 15:48

micheleriva commented Oct 7, 2022

View reviewed changes

src/lyra.ts Outdated Show resolved Hide resolved

micheleriva commented Oct 7, 2022

View reviewed changes

RafaelGSS reviewed Oct 10, 2022

View reviewed changes

ShogunPanda reviewed Oct 11, 2022

View reviewed changes

micheleriva added 2 commits October 14, 2022 10:22

Merge branch 'main' into feat/term-frequency

b29c75a

perf: moves parseInt function to more performant shorthand

d94b16d

lucaong reviewed Oct 27, 2022

View reviewed changes

Merge branch 'main' into feat/term-frequency

e6ff55f

micheleriva mentioned this pull request Nov 5, 2022

Feat: adds TF-IDF #169

Merged

4 tasks

micheleriva closed this Nov 11, 2022

micheleriva deleted the feat/term-frequency branch January 17, 2023 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adds term frequency to index #149

feat: adds term frequency to index #149

micheleriva commented Oct 7, 2022 •

edited

Loading

micheleriva Oct 7, 2022 •

edited

Loading

stearm Oct 10, 2022

micheleriva Oct 10, 2022

stearm Oct 10, 2022

lucaong Oct 27, 2022

micheleriva Oct 27, 2022 •

edited

Loading

lucaong Oct 27, 2022

RafaelGSS commented Oct 9, 2022

RafaelGSS Oct 10, 2022

micheleriva Oct 10, 2022

RafaelGSS Oct 10, 2022

micheleriva Oct 10, 2022

micheleriva Oct 10, 2022

ShogunPanda Oct 11, 2022

ShogunPanda commented Oct 11, 2022

lucaong Oct 27, 2022

feat: adds term frequency to index #149

feat: adds term frequency to index #149

Conversation

micheleriva commented Oct 7, 2022 • edited Loading

Introduction

Things to consider

Index size

micheleriva Oct 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

micheleriva Oct 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RafaelGSS commented Oct 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShogunPanda commented Oct 11, 2022

Choose a reason for hiding this comment

micheleriva commented Oct 7, 2022 •

edited

Loading

micheleriva Oct 7, 2022 •

edited

Loading

micheleriva Oct 27, 2022 •

edited

Loading