Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExistsQuery for finding all documents where a certain field exists #1833

Closed
Hodkinson opened this issue Jan 28, 2023 · 6 comments
Closed

Comments

@Hodkinson
Copy link

Sometimes it's necessary to find all documents that simply contain a field, regardless of its value

An ExistsQuery could be added that allows querying for documents containing a certain field.

Currently RegexQuery::from_pattern(".*", field)? can be used for this, and probably a RangeQuery for numeric fields.

Hodkinson pushed a commit to Hodkinson/tantivy that referenced this issue Jan 28, 2023
Hodkinson pushed a commit to Hodkinson/tantivy that referenced this issue Jan 28, 2023
Hodkinson pushed a commit to Hodkinson/tantivy that referenced this issue Jan 30, 2023
Hodkinson pushed a commit to Hodkinson/tantivy that referenced this issue Jan 30, 2023
Hodkinson pushed a commit to Hodkinson/tantivy that referenced this issue Jan 30, 2023
Hodkinson pushed a commit to Hodkinson/tantivy that referenced this issue Jan 30, 2023
@shikhar
Copy link
Collaborator

shikhar commented Jan 31, 2023

Another approach could be to model with a separate field that has a keyword indicating that this data is present. This should perform better since don't need to iterate over the term dictionary and all the postings to accumulate a bitset eagerly at query time. You can then simply use a TermQuery that will access the postings list for this keyword.

@PSeitz
Copy link
Contributor

PSeitz commented Jan 31, 2023

We should have an ExistsQuery similar to AllQuery, that can be optionally negated for a not exists.

There are two ways to get this:

  1. Similar to FuzzyTermQuery or RegexQuery. Build a BitSetDocSet upfront by scanning all terms and get their docs from the posting list.
  2. Use the upcoming fast field index with support for null values. The fast field index can tell which docs have values. In contrast it can be lazy iterated, no need to construct a BitSetDocSet.

@fulmicoton
Copy link
Collaborator

This has been demanded by several quickwit users already.
@shikhar solution is the way to go here:
Add an inverted list for the existence of the field. The place where to store this field is TBD.
It could be a special _existence field , where the values are the (field_name, field_type) for instance.
The posting list format is a tiny bit lame, as it still takes 1 bit per doc/field if when the field is saturated.

(@PSeitz not sure why you want a fast field for this...)

@PSeitz
Copy link
Contributor

PSeitz commented Feb 2, 2023

There are different ways/places where to store that information and fast field index already has that info. With the typing in columnar, you could also cover exists_ip, exists_string, exists_number, which will be more important when combining multiple data source into a single index via JSON fields.

@williamho
Copy link

this exists now, as of #2160:

though it doesn't seem like it supports the "optionally negated for a not exists" functionality mentioned above (in #1833 (comment)), which would be nice to have

@fulmicoton
Copy link
Collaborator

@williamh thank you for spotting this! Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants