[Feature Request] Add mapping information for single-/multi-valued fields #16420
Labels
enhancement
Enhancement or improvement to existing feature or request
Indexing
Indexing, Bulk Indexing and anything related to indexing
Search:Query Capabilities
Is your feature request related to a problem? Please describe
Fields in an OpenSearch index are all allowed to be multivalued. Any
keyword
field will also accept an array ofkeyword
values. This all works, because under the hood, Lucene doesn't really make a distinction between adding a field once and adding it multiple times.Unfortunately, for cases that try to project a fixed schema (like the SQL plugin or the proposed join support in core), it's useful to make a distinction between a field that represents a single
keyword
and one that represents an array ofkeyword
s. We could treat every field as an array, but a lot of fields would come out as arrays of length 1 (since, at least in my experience, the majority of fields are single-valued).Describe the solution you'd like
It would be great if we could add a property in a mapping that conveys whether a field is single- or multi-valued. Unfortunately, from a backwards compatibility standpoint, we can't just add a new required property, since we would break all existing index mappings.
My suggestion is that we add an optional
multivalued
property for field mappings. Essentially, this property would have three possible values:true
, meaning the field should be treated as an array,false
, meaning that the field only has a single value -- a document with multiple values for the field will be rejected -- ornull
, meaning that we don't know. This means the field was dynamically added to the mapping or the field was specified in a mapping without a value for themultivalued
property.I would also suggest that if a document specifies multiple values for a field where
multivalued
isnull
, we should update the mapping to setmultivalued
totrue
. (Maybe we can't do that if dynamic mapping changes are disabled.)Going forward, if we add this property in OpenSearch 2.x, maybe we can make it mandatory for new indices created in OpenSearch 3.0. (Of course, we would still need to support the OpenSearch 2.x
null
behavior, at least until OpenSearch 4.0 is released.) Starting in OpenSearch 3.0, we could dynamically infer the property from the first document containing a given field (which would require a bit of work, since we would need to distinguish between"fieldA":"foo"
and"fieldA":["foo"]
, where the former would be single-valued and the latter would be multivalued).Related component
Indexing
Describe alternatives you've considered
I was chatting with @anirudha today about an idea of making it a search-time problem, since it's at search time that knowing the schema is useful (since indexing "just works" right now). Essentially, you could take a hint at search time to force an interpretation for a field.
You could also make a best effort to guess whether a field has multiple values by inspecting a sample of documents (the first 500?). Since you may want the coordinator to get a response from each shard with the same interpretation, you could do a preliminary search phase (kind of like
can_match
) to ask each shard to vote on the arity of each field. If any shard says a field is multivalued, we would interpret it as multivalued.Additional context
I'm categorizing this as "Indexing", but the property is mostly useful at search time. I think I'll add the "Search:Query Capabilities" label too.
The text was updated successfully, but these errors were encountered: