-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Value Support for Binary DocValues [LUCENE-10666] #11702
Comments
Navneet Verma (migrated from JIRA) Hi, I was following on the PR and I would like to work on Multi-Value Support for Binary DocValues issue(LUCENE-10666) and was wondering if there are any concerns. |
I haven't seen any objections, and it makes sense to me that we may want to have multiple values here, analogous to other doc values types. |
The historical objection against multi-value binary support is that it could be easily implemented on top of binary doc values. So multi-value binary support would add API surface and push more complexity on codecs and We've been following this approach of encoding multi-valued fields in a single I wonder if there are intermediate options worth exploring, like adding tooling to make it easier to encode multiple values into a single |
The use-case here is also not great, talking about a doc having multiple locations. Its a pet peeve of mine, I don't think we should add a new major docvalues type for such crap :) |
My muscle memory (and gmail filters) is stuck at jira :) So I missed these.
Curious why you think a document can't have multiple locations? Why wouldn't the geo (wkt, json, wkb, protobuf) specification then not have Multi geometry types? The reason they do is because multi location can exist for a single document. It happens all the time, especially in data science applications where multiple observations are collected concurrently in a single document scan (RADAR, Multi Hypothesis Tracking). I should be able to have a multi value doc value for running facets, aggregations, spark jobs over the data stored in the lucene segment instead of trying to hack together a single encoding that stores all of the observations at once, and then post filter after decoding that entire binary value.
Except in the case of shape doc values this isn't syntactic sugar. The way the centroid is computed for a multi value shape is based on weighted area of the individual geometries, and in this case the way the centroid is computed and stored in a shape doc value for a multi shape geometry is a hack because of this limitation. Assuming this is syntactic sugar just feels like a lazy way to not support multi-value companion for binary doc values. :/ |
I was involved in a previous issue that is related to this one. The problem was a drop of performance when scanning Therefore, we created a prototype that implements multi-valued binary docvalues which works well. However, having some support for this use case directly in Lucene is preferable, be it a new docvalues or some tooling as proposed by @jpountz . Performance issues of scanning multi-valued binary data is probably something that would affect other use cases, e.g., the ESQL query language/engine proposed by Elastic. |
I don't think ESQL is going to be different from existing faceting support: it will still want to use ordinals when it makes sense such as grouping by term. It will still be up to users to configure their mappings correctly for the sort of aggregation that they plan to run: |
because in the real world objects can only exist in one place a a time. That's an actual fact. And the way it works in the search engine, doing things like sorting by distance, really only makes sense with single valued fields. This is why i hate all multi-valued docvalues, because its always so ambiguous. If i have 3 locations for the doc and i'm sorting by distance, which one should i use? etc etc. If someone wants to encode multiple values into a binary docvalues, nothing is stopping them. they can encode integer/byte length up front, do a vint-like encoding, whatever they want. |
Except in geo search / analysis this depends on spatial resolution of the source data; real world geo data is not precise and often ends up with multiple documents in the same location. Analysis mechanisms (e.g., aggregations) help to dedup or further analyze and score these documents. Clearly (not through a hack) supporting these use cases along with supporting coverage areas as a multi geometry shape only makes Lucene stronger. I don't think there's anything wrong w/ supporting standards like RFC 7946 in our encoding.
It depends on spatial resolution. Besides, we support these use cases already, we don't support the multi-shape use case above without an unnecessarily bloated hackey encoding.
It's software, yes. Nothing is also stopping anyone from encoding the bible and calling it a geo_shape. |
@scampi Would you be willing to contribute your multi-valued binary doc value implementation here? I think having multi-value parity with other doc value types is good to support multiple use cases like this. Per concerns already raised it would be good to slap warnings in the API doc that communicate potential trappy performance issues. |
@jpountz This may be correct for aggregate operation, however, if you wish to support join operation in ESQL at some point, then you'll need to perform a scan of the binary values and not of the ordinal values (as they are not compatible with a join operation).
@nknize We do not see any problem in sharing the code, but our implementation is based on the Elasticsearch framework (on the |
just want to point out that objects exist in many different places in space-time |
@rendel I haven't looked at that implementation since Elastic relicensed so my knowledge is dated. I presume you still use the ALv2 version? In other words, does it need anything other than the byte array list and corresponding |
@nknize Yes, that is similar to this implementation. |
#11690 introduces a binary doc value format for shapes. Since the geometries are decomposed into triangles the binary docvalue encoding can technically support multi shapes and geometry collections in a single doc value format, however it feels this is hacking around the limitation of supporting multi-values for binary doc values. With multi-value binary individual geometries can be stored in their own binary doc value w/ multiple binary per doc. I'd like to open this issue to explore adding multi-value support to binary doc values. Are there concerns, limitations, traps?
Migrated from LUCENE-10666 by Nick Knize (@nknize), 2 votes, updated Jul 29 2022
The text was updated successfully, but these errors were encountered: