Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query matching terms in a set #1539

Merged
merged 8 commits into from
Sep 28, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/query/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ mod range_query;
mod regex_query;
mod reqopt_scorer;
mod scorer;
mod set_query;
mod term_query;
mod union;
mod weight;
Expand Down Expand Up @@ -58,6 +59,7 @@ pub use self::score_combiner::{
DisjunctionMaxCombiner, ScoreCombiner, SumCombiner, SumWithCoordsCombiner,
};
pub use self::scorer::Scorer;
pub use self::set_query::TermSetQuery;
pub use self::term_query::TermQuery;
pub use self::union::Union;
#[cfg(test)]
Expand Down
198 changes: 198 additions & 0 deletions src/query/set_query.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
use std::collections::BTreeSet;

use tantivy_fst::raw::CompiledAddr;
use tantivy_fst::{Automaton, Map};

use crate::query::{AutomatonWeight, Query, Weight};
use crate::schema::Field;
use crate::{Searcher, Term};

/// A Term Set Query matches all of the documents containing any of the Term provided
///
/// Terms not using the right Field are discared.
#[derive(Debug, Clone)]
pub struct TermSetQuery {
field: Field,
terms: BTreeSet<Term>,
}

impl TermSetQuery {
/// Create a Term Set Query
pub fn new(field: Field, terms: BTreeSet<Term>) -> Self {
Copy link
Collaborator

@adamreichold adamreichold Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTreeSet is a good data structure for a set that is continuously changing while always staying sorted and using the sorting to speed element access. In this case, it seems that the set is accessed only once to produce a weight.

Maybe it would be nicer to simply require terms: T where T: IntoIterator<Item=Term>, collect this into a Vec (where terms.into_iter().collect::<Vec<_>>() would not allocate if the iterator is created from a Vec) and sort and deduplicate this Vec once in this constructor?

I think this could yield nicer API and simpler and hence faster code, but then again it could be insignificant and thereby not worth it especially if one expects the terms to be already presented as a BTreeSet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the BTreeSet personally.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the BTreeSet personally.

I would not say that I dislike it (or any other data structure for that matter), just that it brings more to the table than is required here. I do think it is arguably more complex than say Vec which is why I tried to suggest the simplest possible data structure for this particular task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should I do then? Personally I prefer using a BTreeSet, but I understand an IntoIterator is easier to provide in general

Copy link
Collaborator

@shikhar shikhar Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a potential user of this query*, if the type was a BTreeSet we would have to create one just for this purpose, and the query just needs immutable sorted & deduped data for which Vec suffices. I would vote for IntoIterator.

* currently doing union boolean query over tens of IDs -- looks like this should be much more efficient!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntoIterator it is then...

and then within the function:
IntoIterator -> HashSet -> Vec -> Sort.

TermSetQuery { field, terms }
}

fn specialized_weight(
&self,
searcher: &Searcher,
) -> crate::Result<AutomatonWeight<SetDfaWrapper>> {
let field_entry = searcher.schema().get_field_entry(self.field);
let field_type = field_entry.field_type();
if !field_type.is_indexed() {
let error_msg = format!("Field {:?} is not indexed.", field_entry.name());
return Err(crate::TantivyError::SchemaError(error_msg));
}
let field_type = field_type.value_type();

// In practice this won't fail because:
// - we are writing to memory, so no IoError
// - BTreeSet are ordered, and we limit ourselves to values with a fixed prefix (which we
trinity-1686a marked this conversation as resolved.
Show resolved Hide resolved
// strip), so Map::from_iter get values in order
let map = Map::from_iter(
self.terms
.iter()
.filter(|key| key.field() == self.field && key.typ() == field_type)
.map(|key| (key.value_bytes(), 0)),
)
.map_err(|e| std::io::Error::new(std::io::ErrorKind::Other, e))?;

Ok(AutomatonWeight::new(self.field, SetDfaWrapper(map)))
}
}

impl Query for TermSetQuery {
fn weight(
&self,
searcher: &Searcher,
_scoring_enabled: bool,
) -> crate::Result<Box<dyn Weight>> {
Ok(Box::new(self.specialized_weight(searcher)?))
trinity-1686a marked this conversation as resolved.
Show resolved Hide resolved
}
}

struct SetDfaWrapper(Map<Vec<u8>>);

impl Automaton for SetDfaWrapper {
type State = Option<CompiledAddr>;

fn start(&self) -> Self::State {
Some(self.0.as_ref().root().addr())
}
fn is_match(&self, state: &Self::State) -> bool {
state
trinity-1686a marked this conversation as resolved.
Show resolved Hide resolved
.map(|s| self.0.as_ref().node(s).is_final())
.unwrap_or(false)
}
fn accept(&self, state: &Self::State, byte: u8) -> Self::State {
state.and_then(|state| {
trinity-1686a marked this conversation as resolved.
Show resolved Hide resolved
let state = self.0.as_ref().node(state);
let transition = state.find_input(byte)?;
Some(state.transition_addr(transition))
})
}

fn can_match(&self, state: &Self::State) -> bool {
state.is_some()
}
}

#[cfg(test)]
mod tests {
use std::collections::BTreeSet;

use crate::collector::TopDocs;
use crate::query::TermSetQuery;
use crate::schema::{Schema, TEXT};
use crate::{assert_nearly_equals, Index, Term};

#[test]
pub fn test_term_set_query() -> crate::Result<()> {
let mut schema_builder = Schema::builder();
let field1 = schema_builder.add_text_field("field1", TEXT);
let field2 = schema_builder.add_text_field("field1", TEXT);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
{
let mut index_writer = index.writer_for_tests()?;
index_writer.add_document(doc!(
field1 => "doc1",
field2 => "val1",
))?;
index_writer.add_document(doc!(
field1 => "doc2",
field2 => "val2",
))?;
index_writer.add_document(doc!(
field1 => "doc3",
field2 => "val3",
))?;
index_writer.commit()?;
}
let reader = index.reader()?;
let searcher = reader.searcher();

{
// single element
let mut terms = BTreeSet::new();
terms.insert(Term::from_field_text(field1, "doc1"));

let term_set_query = TermSetQuery::new(field1, terms);
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(2))?;
assert_eq!(top_docs.len(), 1, "Expected 1 document");
let (score, _) = top_docs[0];
assert_nearly_equals!(1.0, score);
}

{
// single element, absent
let mut terms = BTreeSet::new();
terms.insert(Term::from_field_text(field1, "doc4"));

let term_set_query = TermSetQuery::new(field1, terms);
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(1))?;
assert!(top_docs.is_empty(), "Expected 0 document");
}

{
// multiple elements
let mut terms = BTreeSet::new();
terms.insert(Term::from_field_text(field1, "doc1"));
terms.insert(Term::from_field_text(field1, "doc2"));

let term_set_query = TermSetQuery::new(field1, terms);
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(2))?;
assert_eq!(top_docs.len(), 2, "Expected 2 documents");
for (score, _) in top_docs {
assert_nearly_equals!(1.0, score);
}
}

{
// single element, wrong field
let mut terms = BTreeSet::new();
terms.insert(Term::from_field_text(field1, "doc1"));

let term_set_query = TermSetQuery::new(field2, terms);
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(1))?;
assert!(top_docs.is_empty(), "Expected 0 document");
}
{
// multiple elements, mixed fields
let mut terms = BTreeSet::new();
terms.insert(Term::from_field_text(field1, "doc1"));
terms.insert(Term::from_field_text(field2, "val2"));

let term_set_query = TermSetQuery::new(field1, terms.clone());
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(2))?;

assert_eq!(top_docs.len(), 1, "Expected 1 document");
let (score, _) = top_docs[0];
assert_nearly_equals!(1.0, score);

let term_set_query = TermSetQuery::new(field2, terms.clone());
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(2))?;

assert_eq!(top_docs.len(), 1, "Expected 1 document");
let (score, _) = top_docs[0];
assert_nearly_equals!(1.0, score);
let term_set_query = TermSetQuery::new(field1, terms.clone());
let top_docs = searcher.search(&term_set_query, &TopDocs::with_limit(2))?;

assert_eq!(top_docs.len(), 1, "Expected 1 document");
let (score, _) = top_docs[0];
assert_nearly_equals!(1.0, score);
}

Ok(())
}
}