Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query matching terms in a set #1539

Merged
merged 8 commits into from
Sep 28, 2022
Merged

Add query matching terms in a set #1539

merged 8 commits into from
Sep 28, 2022

Conversation

trinity-1686a
Copy link
Contributor

fix the 2nd half of #1494


impl TermSetQuery {
/// Create a Term Set Query
pub fn new(field: Field, terms: BTreeSet<Term>) -> Self {
Copy link
Collaborator

@adamreichold adamreichold Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTreeSet is a good data structure for a set that is continuously changing while always staying sorted and using the sorting to speed element access. In this case, it seems that the set is accessed only once to produce a weight.

Maybe it would be nicer to simply require terms: T where T: IntoIterator<Item=Term>, collect this into a Vec (where terms.into_iter().collect::<Vec<_>>() would not allocate if the iterator is created from a Vec) and sort and deduplicate this Vec once in this constructor?

I think this could yield nicer API and simpler and hence faster code, but then again it could be insignificant and thereby not worth it especially if one expects the terms to be already presented as a BTreeSet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the BTreeSet personally.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the BTreeSet personally.

I would not say that I dislike it (or any other data structure for that matter), just that it brings more to the table than is required here. I do think it is arguably more complex than say Vec which is why I tried to suggest the simplest possible data structure for this particular task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should I do then? Personally I prefer using a BTreeSet, but I understand an IntoIterator is easier to provide in general

Copy link
Collaborator

@shikhar shikhar Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a potential user of this query*, if the type was a BTreeSet we would have to create one just for this purpose, and the query just needs immutable sorted & deduped data for which Vec suffices. I would vote for IntoIterator.

* currently doing union boolean query over tens of IDs -- looks like this should be much more efficient!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntoIterator it is then...

and then within the function:
IntoIterator -> HashSet -> Vec -> Sort.

@codecov-commenter
Copy link

codecov-commenter commented Sep 24, 2022

Codecov Report

Merging #1539 (9648495) into main (d641979) will decrease coverage by 0.08%.
The diff coverage is 95.91%.

@@            Coverage Diff             @@
##             main    #1539      +/-   ##
==========================================
- Coverage   93.92%   93.83%   -0.09%     
==========================================
  Files         249      251       +2     
  Lines       45903    46264     +361     
==========================================
+ Hits        43114    43414     +300     
- Misses       2789     2850      +61     
Impacted Files Coverage Δ
src/query/mod.rs 100.00% <ø> (ø)
src/query/set_query.rs 95.91% <95.91%> (ø)
src/fastfield/bytes/reader.rs 70.58% <0.00%> (-6.84%) ⬇️
src/indexer/delete_queue.rs 94.76% <0.00%> (-3.46%) ⬇️
src/fastfield/multivalued/reader.rs 91.07% <0.00%> (-2.51%) ⬇️
src/directory/mmap_directory.rs 90.37% <0.00%> (-0.71%) ⬇️
common/src/serialize.rs 85.89% <0.00%> (-0.09%) ⬇️
common/src/bitset.rs 98.48% <0.00%> (-0.02%) ⬇️
src/schema/mod.rs 100.00% <0.00%> (ø)
src/fastfield/mod.rs 99.71% <0.00%> (ø)
... and 11 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

src/query/set_query.rs Outdated Show resolved Hide resolved
src/query/set_query.rs Outdated Show resolved Hide resolved
src/query/set_query.rs Outdated Show resolved Hide resolved
src/query/set_query.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pleae have a look at the clippy stuff too.

src/query/set_query.rs Outdated Show resolved Hide resolved
src/query/set_query.rs Outdated Show resolved Hide resolved
support using different fields
use less and_then and more if-let
@trinity-1686a
Copy link
Contributor Author

Should I depreciate BooleanQuery::new_multiterms_query? It does the same thing, but is probably slower as soon as there is more than one term per field being searched

impl TermSetQuery {
/// Create a Term Set Query
pub fn new<T: IntoIterator<Item = Term>>(terms: T) -> Self {
let mut terms_map: HashMap<_, Vec<_>> = terms
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let mut terms_map = HashMap::default();
for term in terms {
terms_map.entry(field).or_default().push(term);
}


is much easier to read IMHO

src/query/set_query.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants