Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add mapping information for single-/multi-valued fields #16420

Open
msfroh opened this issue Oct 22, 2024 · 2 comments · May be fixed by #16601
Open

[Feature Request] Add mapping information for single-/multi-valued fields #16420

msfroh opened this issue Oct 22, 2024 · 2 comments · May be fixed by #16601
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Search:Query Capabilities

Comments

@msfroh
Copy link
Collaborator

msfroh commented Oct 22, 2024

Is your feature request related to a problem? Please describe

Fields in an OpenSearch index are all allowed to be multivalued. Any keyword field will also accept an array of keyword values. This all works, because under the hood, Lucene doesn't really make a distinction between adding a field once and adding it multiple times.

Unfortunately, for cases that try to project a fixed schema (like the SQL plugin or the proposed join support in core), it's useful to make a distinction between a field that represents a single keyword and one that represents an array of keywords. We could treat every field as an array, but a lot of fields would come out as arrays of length 1 (since, at least in my experience, the majority of fields are single-valued).

Describe the solution you'd like

It would be great if we could add a property in a mapping that conveys whether a field is single- or multi-valued. Unfortunately, from a backwards compatibility standpoint, we can't just add a new required property, since we would break all existing index mappings.

My suggestion is that we add an optional multivalued property for field mappings. Essentially, this property would have three possible values:

  1. true, meaning the field should be treated as an array,
  2. false, meaning that the field only has a single value -- a document with multiple values for the field will be rejected -- or
  3. null, meaning that we don't know. This means the field was dynamically added to the mapping or the field was specified in a mapping without a value for the multivalued property.

I would also suggest that if a document specifies multiple values for a field where multivalued is null, we should update the mapping to set multivalued to true. (Maybe we can't do that if dynamic mapping changes are disabled.)

Going forward, if we add this property in OpenSearch 2.x, maybe we can make it mandatory for new indices created in OpenSearch 3.0. (Of course, we would still need to support the OpenSearch 2.x null behavior, at least until OpenSearch 4.0 is released.) Starting in OpenSearch 3.0, we could dynamically infer the property from the first document containing a given field (which would require a bit of work, since we would need to distinguish between "fieldA":"foo" and "fieldA":["foo"], where the former would be single-valued and the latter would be multivalued).

Related component

Indexing

Describe alternatives you've considered

I was chatting with @anirudha today about an idea of making it a search-time problem, since it's at search time that knowing the schema is useful (since indexing "just works" right now). Essentially, you could take a hint at search time to force an interpretation for a field.

You could also make a best effort to guess whether a field has multiple values by inspecting a sample of documents (the first 500?). Since you may want the coordinator to get a response from each shard with the same interpretation, you could do a preliminary search phase (kind of like can_match) to ask each shard to vote on the arity of each field. If any shard says a field is multivalued, we would interpret it as multivalued.

Additional context

I'm categorizing this as "Indexing", but the property is mostly useful at search time. I think I'll add the "Search:Query Capabilities" label too.

@msfroh msfroh added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 22, 2024
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Oct 22, 2024
@RS146BIJAY
Copy link
Contributor

@msfroh We evaluated this feature as a part of triage meeting and this seems a nice feature to add in OpenSearch. Looking forward to more discussion on this.

@normanj-bitquill
Copy link

This would be useful for the SQL plugin.

normanj-bitquill added a commit to Bit-Quill/OpenSearch that referenced this issue Nov 8, 2024
* Can only be used for field types that support multiple values
* If a field has the multivalued property, then new documents must have an array for its value

Signed-off-by: Norman Jordan <norman.jordan@improving.com>
@normanj-bitquill normanj-bitquill linked a pull request Nov 8, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Search:Query Capabilities
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

3 participants