Count term occurrencies in document #12

micheleriva · 2022-07-14T09:28:10Z

Is your feature request related to a problem? Please describe.
Right now, we're not considering how many times a term appears in a document.

For instance, given the following strings:

"It's alive! It's alive!"
"I saw them alive"

When searching for "alive", the first string should have priority as the term "alive" appears twice.

The text was updated successfully, but these errors were encountered:

mateonunez · 2022-07-14T15:15:02Z

That sounds great @micheleriva! I think it would be a good idea to implement the sort parameter (and thus the method), within the search function, to sort (using occurrencies just for the moment) the elements found. If sort is not specified, the selection would act as usual showing the data as insert order.

Occurrences are very important data in a search action, could be they selected as default?

micheleriva · 2022-07-14T16:56:41Z

@mateonunez could you write some pseudocode to explain how the sort function would work? I wouldn't make it optional, Lyra should always search by taking occurrences in consideration

mateonunez · 2022-07-14T17:55:06Z

@micheleriva I have prepared a branch with working (and incomplete) code for it.

The sort options accepted are: none (as usual) and relevance (by occurrences). The "relevance" could be accepted as default if you want to implement a search by occurrences natively.

There are 2 methods: getOccurrences(document, index, term) that return the count of the occurrences in the document (uses a Regex for it). The other method is sortResults(results, sort) with a simple switch to manage all accepted sorting algorithms.

micheleriva · 2022-07-15T07:15:36Z

@mateonunez thanks! Any idea on how that affects performances?

micheleriva · 2022-07-15T07:32:25Z

Also, in which scenario would I want to sort the results by "random" order?

mateonunez · 2022-07-15T08:38:11Z

Initially, this implementation affects a lot of the performance of the search action.

Current timing:

{
  elapsed: '114μs',
  hits: [
    {
      id: 'PmLDkPyPWlRPG5qFcmS_7',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft'
    }
  ],
  count: 1
}

With occurrences implementation:

{
  elapsed: '158μs',
  hits: [
    {
      id: 'qUMSJWBHvxJ0Islfwpfpz',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft',
      occurrences: 2
    }
  ],
  count: 1
}

That implementation cost ~27.8% more than the search time.

With the sort implementation:

{
  elapsed: '162μs',
  hits: [
    {
      id: 'KGpFwdn5mbXgEpCRXamhd',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft',
      occurrences: 2
    }
  ],
  count: 1
}

The first occurrencesTime cost about 40-50μs. Then the cost it's about 0-3μs:

    {
      id: 'kdNXamfKo_CCFpTATPOvk',
      quote: 'I saw them alive. Alive man. You got it? Alive.',
      author: 'H.P. Lovecraft',
      occurrences: 3,
      occurrencesTime: '46μs'
    },
    {
      id: 'DQ0mXcVNSnv4hJ8Du34k8',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft',
      occurrences: 2,
      occurrencesTime: '3μs'
    },
    {
      id: '_E5tQaXAXwuNmNFqSoCea',
      quote: 'I saw them alive. Alive man. You got it? Alive.',
      author: 'H.P. Lovecraft',
      occurrences: 3,
      occurrencesTime: '1μs'
    },
    {
      id: '7HqOPlGMRJtXXtKdh9eXp',
      quote: 'alive in the middle of the night.',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '1μs'
    },
    {
      id: 'JJ0Mec0ze_ivNFrZO4grh',
      quote: "Oh, I, oh I'm still alive",
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '1μs'
    },
    {
      id: 'gSfPxZwmoY2tcu05aywoS',
      quote: 'I feel alive',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '0μs'
    },
    {
      id: 'NbYe5iNSLkJBwKNRj97ti',
      quote: 'Alive is the only thing I know',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '0μs'
    },
    {
      id: 'fXGKFc6oFdVMFk4OBc7Og',
      quote: 'Alive is the only thing I know',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '0μs'
    }

Regarding the random (or first in first out) order I was imagining to a situation where we need the data as we usually handle queues. So the first document inserted in the database will be the first document selected in the search method.

mateonunez · 2022-07-15T11:42:33Z

I've tried it with 1M of documents. The result is amazing! @micheleriva

Without implementation:

{
  elapsed: '421ms',
  hits: [
    {
      id: 'ZpRIoxvyzosuI6yNW47cx',
      quote: 'I saw them alive. Alive man. Did you get it? Alive.',
      author: 'H.P. Lovecraft'
    }
  ],
  count: 1000000
}

With implementation:

{
  elapsed: '426ms',
  hits: [
    {
      id: '3PEwoThPgs0we8IBlxIoi',
      quote: 'I saw them alive. Alive man. Did you get it? Alive.',
      author: 'H.P. Lovecraft',
      occurrences: 3
    }
  ],
  count: 1000000
}

micheleriva · 2022-07-15T12:16:17Z

This is pretty rad. Would you mind opening a PR?

mateonunez · 2022-07-15T13:07:21Z

I could work on it, but there is just one question to understand: what should Lyra do in case the limit parameter is set? Should it compute all occurrences (sort them) and then truncate the output? This operation could require more stress.

Currently, Lyra limits the results quitting from the cycle that retrieves the documents, then the occurrences are computed.
This means that: if I add 10 consecutive documents (increasing the occurrence by 1 each cycle) and then set the limit to 2, Lyra will only consider the first 2 documents (the first documents inserted).

Simple Search with Relevance

{
  elapsed: '244μs',
  hits: [
    {
      id: 'DhQVMo6RI5IMpJyr8C5xf',
      quote: 'Repeat with me: alive, alive, alive',
      author: 'Oscar Wilde',
      occurrences: 3
    },
    {
      id: 'w6dIlijToS9GOaKEG6fBm',
      quote: 'Repeat with me: alive, alive',
      author: 'Oscar Wilde',
      occurrences: 2
    },
    {
      id: 'ME5jTyPK7slIZK9p0tu6u',
      quote: 'Repeat with me: alive',
      author: 'Oscar Wilde',
      occurrences: 1
    }
  ],
  count: 3
}

Search with Relevance and Limit set to 1:

{
  elapsed: '22μs',
  hits: [
    {
      id: 'ME5jTyPK7slIZK9p0tu6u',
      quote: 'Repeat with me: alive',
      author: 'Oscar Wilde',
      occurrences: 1
    }
  ],
  count: 3
}

What do you think about it?

micheleriva · 2022-07-15T16:16:08Z

@mateonunez linear search is not a good option; I think we might count word occurrences of a given term in a document during the indexing process and use this information to determine where which doc contains the same word the most.

This is also related to #13, as the exact order search is more important than the occurrences of a term in a document.

I think we could start with this issue and then move on with #13

micheleriva · 2022-07-15T16:25:17Z

Going "pseudo-logic":

  private async _insert({ doc, id, language }: QueueDocParams): Promise<void> {
    const index = this.index;
    this.docs.set(id, doc);

    function recursiveTrieInsertion(doc: object) {
      for (const key in doc) {
        if (typeof (doc as any)[key] === "object") {
          throw ERRORS.UNSUPPORTED_NESTED_PROPERTIES();
          // @todo nested properties will be supported with the nexrt versions
          // recursiveTrieInsertion((doc as any)[key]);
        } else if (typeof (doc as any)[key] === "string") {
          const requestedTrie = index.get(key);
          const tokens = tokenize((doc as any)[key], language);
++        const tokenFrequency = getTokenFrequency(tokens); // <-------- new function to count token frequency

          for (const token of tokens) {
--            requestedTrie?.insert(token, id);
++            requestedTrie?.insert(token, [id, frequency]); // <------ just an idea, it doesn't have to be an array at all costs
          }
        }
      }
    }

    recursiveTrieInsertion(doc);
  }

I would evolve around this idea to keep the time complexity as low as possible

mateonunez · 2022-07-15T17:48:14Z

Going "pseudo-logic":

  private async _insert({ doc, id, language }: QueueDocParams): Promise<void> {

    const index = this.index;

    this.docs.set(id, doc);



    function recursiveTrieInsertion(doc: object) {

      for (const key in doc) {

        if (typeof (doc as any)[key] === "object") {

          throw ERRORS.UNSUPPORTED_NESTED_PROPERTIES();

          // @todo nested properties will be supported with the nexrt versions

          // recursiveTrieInsertion((doc as any)[key]);

        } else if (typeof (doc as any)[key] === "string") {

          const requestedTrie = index.get(key);

          const tokens = tokenize((doc as any)[key], language);

++        const tokenFrequency = getTokenFrequency(tokens); // <-------- new function to count token frequency



          for (const token of tokens) {

--            requestedTrie?.insert(token, id);

++            requestedTrie?.insert(token, [id, frequency]); // <------ just an idea, it doesn't have to be an array at all costs

          }

        }

      }

    }



    recursiveTrieInsertion(doc);

  }

I would evolve around this idea to keep the time complexity as low as possible

Totally agree. I had thought that the best solution was to work directly on the composing Tries action.

I would like to play with your pseudo-logic and if the result is interesting I might do a PR, removing the "sort" in favour of #13.

DanieleFedeli · 2022-07-15T18:50:50Z

Cannot be implemented via sort of a score?

tf-idf can be a good candidate.

micheleriva · 2022-07-16T07:30:40Z

@DanieleFedeli yes, tf-idf could be the way to go. When inserting a new document, we do that by creating a queue and indexing one doc at a time. That means that on very large documents, we could spend some more time analyzing the tokens using tf-idf or a similar technique (Jaccard index, etc.)

micheleriva · 2022-07-16T07:51:53Z

FYI I'm working on a prototype with tf-idf, let's see how it goes

micheleriva · 2022-07-30T10:30:41Z

FYI, the work is proceeding on #63

tlahav · 2022-10-19T14:39:34Z

Should this ticket be closed?

micheleriva added the good first issue Good for newcomers label Jul 14, 2022

micheleriva added the v0.1.0 label Jul 16, 2022

micheleriva removed the v0.1.0 label Aug 1, 2022

micheleriva closed this as completed Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count term occurrencies in document #12

Count term occurrencies in document #12

micheleriva commented Jul 14, 2022

mateonunez commented Jul 14, 2022

micheleriva commented Jul 14, 2022

mateonunez commented Jul 14, 2022 •

edited

Loading

micheleriva commented Jul 15, 2022

micheleriva commented Jul 15, 2022

mateonunez commented Jul 15, 2022

mateonunez commented Jul 15, 2022

micheleriva commented Jul 15, 2022

mateonunez commented Jul 15, 2022 •

edited

Loading

micheleriva commented Jul 15, 2022

micheleriva commented Jul 15, 2022 •

edited

Loading

mateonunez commented Jul 15, 2022

DanieleFedeli commented Jul 15, 2022

micheleriva commented Jul 16, 2022

micheleriva commented Jul 16, 2022

micheleriva commented Jul 30, 2022

tlahav commented Oct 19, 2022

Count term occurrencies in document #12

Count term occurrencies in document #12

Comments

micheleriva commented Jul 14, 2022

mateonunez commented Jul 14, 2022

micheleriva commented Jul 14, 2022

mateonunez commented Jul 14, 2022 • edited Loading

micheleriva commented Jul 15, 2022

micheleriva commented Jul 15, 2022

mateonunez commented Jul 15, 2022

mateonunez commented Jul 15, 2022

micheleriva commented Jul 15, 2022

mateonunez commented Jul 15, 2022 • edited Loading

micheleriva commented Jul 15, 2022

micheleriva commented Jul 15, 2022 • edited Loading

mateonunez commented Jul 15, 2022

DanieleFedeli commented Jul 15, 2022

micheleriva commented Jul 16, 2022

micheleriva commented Jul 16, 2022

micheleriva commented Jul 30, 2022

tlahav commented Oct 19, 2022

mateonunez commented Jul 14, 2022 •

edited

Loading

mateonunez commented Jul 15, 2022 •

edited

Loading

micheleriva commented Jul 15, 2022 •

edited

Loading