Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count term occurrencies in document #12

Closed
micheleriva opened this issue Jul 14, 2022 · 17 comments
Closed

Count term occurrencies in document #12

micheleriva opened this issue Jul 14, 2022 · 17 comments
Labels
good first issue Good for newcomers

Comments

@micheleriva
Copy link
Member

Is your feature request related to a problem? Please describe.
Right now, we're not considering how many times a term appears in a document.

For instance, given the following strings:

  • "It's alive! It's alive!"
  • "I saw them alive"

When searching for "alive", the first string should have priority as the term "alive" appears twice.

@micheleriva micheleriva added the good first issue Good for newcomers label Jul 14, 2022
@mateonunez
Copy link
Collaborator

That sounds great @micheleriva! I think it would be a good idea to implement the sort parameter (and thus the method), within the search function, to sort (using occurrencies just for the moment) the elements found. If sort is not specified, the selection would act as usual showing the data as insert order.

Occurrences are very important data in a search action, could be they selected as default?

@micheleriva
Copy link
Member Author

@mateonunez could you write some pseudocode to explain how the sort function would work? I wouldn't make it optional, Lyra should always search by taking occurrences in consideration

@mateonunez
Copy link
Collaborator

mateonunez commented Jul 14, 2022

@micheleriva I have prepared a branch with working (and incomplete) code for it.

The sort options accepted are: none (as usual) and relevance (by occurrences). The "relevance" could be accepted as default if you want to implement a search by occurrences natively.

There are 2 methods: getOccurrences(document, index, term) that return the count of the occurrences in the document (uses a Regex for it). The other method is sortResults(results, sort) with a simple switch to manage all accepted sorting algorithms.

@micheleriva
Copy link
Member Author

@mateonunez thanks! Any idea on how that affects performances?

@micheleriva
Copy link
Member Author

Also, in which scenario would I want to sort the results by "random" order?

@mateonunez
Copy link
Collaborator

Initially, this implementation affects a lot of the performance of the search action.

Current timing:

{
  elapsed: '114μs',
  hits: [
    {
      id: 'PmLDkPyPWlRPG5qFcmS_7',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft'
    }
  ],
  count: 1
}

With occurrences implementation:

{
  elapsed: '158μs',
  hits: [
    {
      id: 'qUMSJWBHvxJ0Islfwpfpz',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft',
      occurrences: 2
    }
  ],
  count: 1
}

That implementation cost ~27.8% more than the search time.

With the sort implementation:

{
  elapsed: '162μs',
  hits: [
    {
      id: 'KGpFwdn5mbXgEpCRXamhd',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft',
      occurrences: 2
    }
  ],
  count: 1
}

The first occurrencesTime cost about 40-50μs. Then the cost it's about 0-3μs:

    {
      id: 'kdNXamfKo_CCFpTATPOvk',
      quote: 'I saw them alive. Alive man. You got it? Alive.',
      author: 'H.P. Lovecraft',
      occurrences: 3,
      occurrencesTime: '46μs'
    },
    {
      id: 'DQ0mXcVNSnv4hJ8Du34k8',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft',
      occurrences: 2,
      occurrencesTime: '3μs'
    },
    {
      id: '_E5tQaXAXwuNmNFqSoCea',
      quote: 'I saw them alive. Alive man. You got it? Alive.',
      author: 'H.P. Lovecraft',
      occurrences: 3,
      occurrencesTime: '1μs'
    },
    {
      id: '7HqOPlGMRJtXXtKdh9eXp',
      quote: 'alive in the middle of the night.',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '1μs'
    },
    {
      id: 'JJ0Mec0ze_ivNFrZO4grh',
      quote: "Oh, I, oh I'm still alive",
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '1μs'
    },
    {
      id: 'gSfPxZwmoY2tcu05aywoS',
      quote: 'I feel alive',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '0μs'
    },
    {
      id: 'NbYe5iNSLkJBwKNRj97ti',
      quote: 'Alive is the only thing I know',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '0μs'
    },
    {
      id: 'fXGKFc6oFdVMFk4OBc7Og',
      quote: 'Alive is the only thing I know',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '0μs'
    }

Regarding the random (or first in first out) order I was imagining to a situation where we need the data as we usually handle queues. So the first document inserted in the database will be the first document selected in the search method.

@mateonunez
Copy link
Collaborator

I've tried it with 1M of documents. The result is amazing! @micheleriva

Without implementation:

{
  elapsed: '421ms',
  hits: [
    {
      id: 'ZpRIoxvyzosuI6yNW47cx',
      quote: 'I saw them alive. Alive man. Did you get it? Alive.',
      author: 'H.P. Lovecraft'
    }
  ],
  count: 1000000
}

With implementation:

{
  elapsed: '426ms',
  hits: [
    {
      id: '3PEwoThPgs0we8IBlxIoi',
      quote: 'I saw them alive. Alive man. Did you get it? Alive.',
      author: 'H.P. Lovecraft',
      occurrences: 3
    }
  ],
  count: 1000000
}

@micheleriva
Copy link
Member Author

This is pretty rad. Would you mind opening a PR?

@mateonunez
Copy link
Collaborator

mateonunez commented Jul 15, 2022

I could work on it, but there is just one question to understand: what should Lyra do in case the limit parameter is set? Should it compute all occurrences (sort them) and then truncate the output? This operation could require more stress.

Currently, Lyra limits the results quitting from the cycle that retrieves the documents, then the occurrences are computed.
This means that: if I add 10 consecutive documents (increasing the occurrence by 1 each cycle) and then set the limit to 2, Lyra will only consider the first 2 documents (the first documents inserted).

Simple Search with Relevance

{
  elapsed: '244μs',
  hits: [
    {
      id: 'DhQVMo6RI5IMpJyr8C5xf',
      quote: 'Repeat with me: alive, alive, alive',
      author: 'Oscar Wilde',
      occurrences: 3
    },
    {
      id: 'w6dIlijToS9GOaKEG6fBm',
      quote: 'Repeat with me: alive, alive',
      author: 'Oscar Wilde',
      occurrences: 2
    },
    {
      id: 'ME5jTyPK7slIZK9p0tu6u',
      quote: 'Repeat with me: alive',
      author: 'Oscar Wilde',
      occurrences: 1
    }
  ],
  count: 3
}

Search with Relevance and Limit set to 1:

{
  elapsed: '22μs',
  hits: [
    {
      id: 'ME5jTyPK7slIZK9p0tu6u',
      quote: 'Repeat with me: alive',
      author: 'Oscar Wilde',
      occurrences: 1
    }
  ],
  count: 3
}

What do you think about it?

@micheleriva
Copy link
Member Author

@mateonunez linear search is not a good option; I think we might count word occurrences of a given term in a document during the indexing process and use this information to determine where which doc contains the same word the most.

This is also related to #13, as the exact order search is more important than the occurrences of a term in a document.

I think we could start with this issue and then move on with #13

@micheleriva
Copy link
Member Author

micheleriva commented Jul 15, 2022

Going "pseudo-logic":

  private async _insert({ doc, id, language }: QueueDocParams): Promise<void> {
    const index = this.index;
    this.docs.set(id, doc);

    function recursiveTrieInsertion(doc: object) {
      for (const key in doc) {
        if (typeof (doc as any)[key] === "object") {
          throw ERRORS.UNSUPPORTED_NESTED_PROPERTIES();
          // @todo nested properties will be supported with the nexrt versions
          // recursiveTrieInsertion((doc as any)[key]);
        } else if (typeof (doc as any)[key] === "string") {
          const requestedTrie = index.get(key);
          const tokens = tokenize((doc as any)[key], language);
++        const tokenFrequency = getTokenFrequency(tokens); // <-------- new function to count token frequency

          for (const token of tokens) {
--            requestedTrie?.insert(token, id);
++            requestedTrie?.insert(token, [id, frequency]); // <------ just an idea, it doesn't have to be an array at all costs
          }
        }
      }
    }

    recursiveTrieInsertion(doc);
  }

I would evolve around this idea to keep the time complexity as low as possible

@mateonunez
Copy link
Collaborator

Going "pseudo-logic":

  private async _insert({ doc, id, language }: QueueDocParams): Promise<void> {

    const index = this.index;

    this.docs.set(id, doc);



    function recursiveTrieInsertion(doc: object) {

      for (const key in doc) {

        if (typeof (doc as any)[key] === "object") {

          throw ERRORS.UNSUPPORTED_NESTED_PROPERTIES();

          // @todo nested properties will be supported with the nexrt versions

          // recursiveTrieInsertion((doc as any)[key]);

        } else if (typeof (doc as any)[key] === "string") {

          const requestedTrie = index.get(key);

          const tokens = tokenize((doc as any)[key], language);

++        const tokenFrequency = getTokenFrequency(tokens); // <-------- new function to count token frequency



          for (const token of tokens) {

--            requestedTrie?.insert(token, id);

++            requestedTrie?.insert(token, [id, frequency]); // <------ just an idea, it doesn't have to be an array at all costs

          }

        }

      }

    }



    recursiveTrieInsertion(doc);

  }

I would evolve around this idea to keep the time complexity as low as possible

Totally agree. I had thought that the best solution was to work directly on the composing Tries action.

I would like to play with your pseudo-logic and if the result is interesting I might do a PR, removing the "sort" in favour of #13.

@DanieleFedeli
Copy link
Contributor

Cannot be implemented via sort of a score?

tf-idf can be a good candidate.

@micheleriva
Copy link
Member Author

@DanieleFedeli yes, tf-idf could be the way to go. When inserting a new document, we do that by creating a queue and indexing one doc at a time. That means that on very large documents, we could spend some more time analyzing the tokens using tf-idf or a similar technique (Jaccard index, etc.)

@micheleriva
Copy link
Member Author

FYI I'm working on a prototype with tf-idf, let's see how it goes

@micheleriva
Copy link
Member Author

FYI, the work is proceeding on #63

@micheleriva micheleriva removed the v0.1.0 label Aug 1, 2022
@tlahav
Copy link

tlahav commented Oct 19, 2022

Should this ticket be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants