-
-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count term occurrencies in document #12
Comments
That sounds great @micheleriva! I think it would be a good idea to implement the Occurrences are very important data in a search action, could be they selected as default? |
@mateonunez could you write some pseudocode to explain how the |
@micheleriva I have prepared a branch with working (and incomplete) code for it. The There are 2 methods: |
@mateonunez thanks! Any idea on how that affects performances? |
Also, in which scenario would I want to sort the results by "random" order? |
Initially, this implementation affects a lot of the performance of the search action. Current timing: {
elapsed: '114μs',
hits: [
{
id: 'PmLDkPyPWlRPG5qFcmS_7',
quote: "It's alive! It's alive!",
author: 'H.P. Lovecraft'
}
],
count: 1
} With occurrences implementation: {
elapsed: '158μs',
hits: [
{
id: 'qUMSJWBHvxJ0Islfwpfpz',
quote: "It's alive! It's alive!",
author: 'H.P. Lovecraft',
occurrences: 2
}
],
count: 1
} That implementation cost ~27.8% more than the search time. With the sort implementation: {
elapsed: '162μs',
hits: [
{
id: 'KGpFwdn5mbXgEpCRXamhd',
quote: "It's alive! It's alive!",
author: 'H.P. Lovecraft',
occurrences: 2
}
],
count: 1
} The first {
id: 'kdNXamfKo_CCFpTATPOvk',
quote: 'I saw them alive. Alive man. You got it? Alive.',
author: 'H.P. Lovecraft',
occurrences: 3,
occurrencesTime: '46μs'
},
{
id: 'DQ0mXcVNSnv4hJ8Du34k8',
quote: "It's alive! It's alive!",
author: 'H.P. Lovecraft',
occurrences: 2,
occurrencesTime: '3μs'
},
{
id: '_E5tQaXAXwuNmNFqSoCea',
quote: 'I saw them alive. Alive man. You got it? Alive.',
author: 'H.P. Lovecraft',
occurrences: 3,
occurrencesTime: '1μs'
},
{
id: '7HqOPlGMRJtXXtKdh9eXp',
quote: 'alive in the middle of the night.',
author: 'H.P. Lovecraft',
occurrences: 1,
occurrencesTime: '1μs'
},
{
id: 'JJ0Mec0ze_ivNFrZO4grh',
quote: "Oh, I, oh I'm still alive",
author: 'H.P. Lovecraft',
occurrences: 1,
occurrencesTime: '1μs'
},
{
id: 'gSfPxZwmoY2tcu05aywoS',
quote: 'I feel alive',
author: 'H.P. Lovecraft',
occurrences: 1,
occurrencesTime: '0μs'
},
{
id: 'NbYe5iNSLkJBwKNRj97ti',
quote: 'Alive is the only thing I know',
author: 'H.P. Lovecraft',
occurrences: 1,
occurrencesTime: '0μs'
},
{
id: 'fXGKFc6oFdVMFk4OBc7Og',
quote: 'Alive is the only thing I know',
author: 'H.P. Lovecraft',
occurrences: 1,
occurrencesTime: '0μs'
} Regarding the |
I've tried it with 1M of documents. The result is amazing! @micheleriva Without implementation: {
elapsed: '421ms',
hits: [
{
id: 'ZpRIoxvyzosuI6yNW47cx',
quote: 'I saw them alive. Alive man. Did you get it? Alive.',
author: 'H.P. Lovecraft'
}
],
count: 1000000
} With implementation: {
elapsed: '426ms',
hits: [
{
id: '3PEwoThPgs0we8IBlxIoi',
quote: 'I saw them alive. Alive man. Did you get it? Alive.',
author: 'H.P. Lovecraft',
occurrences: 3
}
],
count: 1000000
} |
This is pretty rad. Would you mind opening a PR? |
I could work on it, but there is just one question to understand: what should Lyra do in case the Currently, Lyra limits the results quitting from the cycle that retrieves the documents, then the occurrences are computed. Simple Search with Relevance {
elapsed: '244μs',
hits: [
{
id: 'DhQVMo6RI5IMpJyr8C5xf',
quote: 'Repeat with me: alive, alive, alive',
author: 'Oscar Wilde',
occurrences: 3
},
{
id: 'w6dIlijToS9GOaKEG6fBm',
quote: 'Repeat with me: alive, alive',
author: 'Oscar Wilde',
occurrences: 2
},
{
id: 'ME5jTyPK7slIZK9p0tu6u',
quote: 'Repeat with me: alive',
author: 'Oscar Wilde',
occurrences: 1
}
],
count: 3
} Search with Relevance and Limit set to 1: {
elapsed: '22μs',
hits: [
{
id: 'ME5jTyPK7slIZK9p0tu6u',
quote: 'Repeat with me: alive',
author: 'Oscar Wilde',
occurrences: 1
}
],
count: 3
} What do you think about it? |
@mateonunez linear search is not a good option; I think we might count word occurrences of a given term in a document during the indexing process and use this information to determine where which doc contains the same word the most. This is also related to #13, as the exact order search is more important than the occurrences of a term in a document. I think we could start with this issue and then move on with #13 |
Going "pseudo-logic": private async _insert({ doc, id, language }: QueueDocParams): Promise<void> {
const index = this.index;
this.docs.set(id, doc);
function recursiveTrieInsertion(doc: object) {
for (const key in doc) {
if (typeof (doc as any)[key] === "object") {
throw ERRORS.UNSUPPORTED_NESTED_PROPERTIES();
// @todo nested properties will be supported with the nexrt versions
// recursiveTrieInsertion((doc as any)[key]);
} else if (typeof (doc as any)[key] === "string") {
const requestedTrie = index.get(key);
const tokens = tokenize((doc as any)[key], language);
++ const tokenFrequency = getTokenFrequency(tokens); // <-------- new function to count token frequency
for (const token of tokens) {
-- requestedTrie?.insert(token, id);
++ requestedTrie?.insert(token, [id, frequency]); // <------ just an idea, it doesn't have to be an array at all costs
}
}
}
}
recursiveTrieInsertion(doc);
} I would evolve around this idea to keep the time complexity as low as possible |
Totally agree. I had thought that the best solution was to work directly on the composing Tries action. I would like to play with your pseudo-logic and if the result is interesting I might do a PR, removing the "sort" in favour of #13. |
Cannot be implemented via sort of a score? tf-idf can be a good candidate. |
@DanieleFedeli yes, |
FYI I'm working on a prototype with |
FYI, the work is proceeding on #63 |
Should this ticket be closed? |
Is your feature request related to a problem? Please describe.
Right now, we're not considering how many times a term appears in a document.
For instance, given the following strings:
"It's alive! It's alive!"
"I saw them alive"
When searching for
"alive"
, the first string should have priority as the term"alive"
appears twice.The text was updated successfully, but these errors were encountered: