Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queryable object fields #25312

Closed
clintongormley opened this issue Jun 20, 2017 · 19 comments
Closed

Queryable object fields #25312

clintongormley opened this issue Jun 20, 2017 · 19 comments
Assignees
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@clintongormley
Copy link
Contributor

Often we have large object fields with many sub-fields, only a few of which are needed for aggregations, sorting, or highlighting. Today, we create fields for all sub-fields, but we could greatly reduce the number of required fields if we make object fields queryable.

We would need a specialiased analyzer which can accept JSON and transform an object like:

{
    "status": "active",
    "age": 25,
    "city": "New York"
}

into the following terms:

"status:active", "age:25", "city:new", "city:york"

Then you could search for active statuses with:

{
  "match": {
    "my_object": "status:active"
  }
}

or

{
  "query_string": {
    "query": "my_object:\"status:active\""
  }
}

We could possibly even support searching for "New York" vs "New City of York" with:

{
  "match_phrase": {
    "my_object": "New York"
  }
}

which would be rewritten as my_object:"city:new city:york"

If we wanted to be able to aggregate on the age field, the object field could be mapped as:

{
  "my_object": {
    "type": "object",
    "index": true,
    "dynamic": false,
    "properties": {
      "age": {
        "type": "integer"
      }
    }
  }
}

With this mapping, only the my_object.age sub-field would have its own Lucene field (or Elasticsearch field) and the rest of the object would be queryable via the my_object field.

This could even be made to work on the whole document by allowing the _source field to be configurable.

@clintongormley clintongormley added :Search Foundations/Mapping Index mappings, including merging and defining field types discuss >feature labels Jun 20, 2017
@clintongormley
Copy link
Contributor Author

We could also index terms with and without the field prefix, (eg status:active, active) to make it work more like a normal field

@nik9000
Copy link
Member

nik9000 commented Jun 20, 2017

I wonder if you could make this more transparent so the normal thing works:

PUT idx/_mapping
{
  "doc": {
    "properties": {
      "my_object": {
        "type": "keyvalue",
        "value_type": "keyword"
      }
    }
  }
}

POST idx/_search
{
  "match": {
    "my_object.status": "active"
  }
}

The prefixing would be an implementation detail of keyvalue. I'd be inclined to only support keyword and text style fields in there. That'd let you tell someone "stick whatever you want in this field and we'll index it and you won't pay a cost every time you introduce a new field". I feel like giving numbers their own lucene field makes this less foolproof. I think for something like this more foolproof is better than more featureful.

@s1monw
Copy link
Contributor

s1monw commented Jun 21, 2017

I wonder if you could make this more transparent so the normal thing works:

I think we should not make it complicated restrictive. It's very simple, we decide how to index and make things work without any special type. you just mark the mapping as "dynamic": false and "indexed" : true. I don't necessarily understand what is trappy here, we don't support these parameters there yet.

The prefixing would be an implementation detail of keyvalue. I'd be inclined to only support keyword and text style fields in there. That'd let you tell someone "stick whatever you want in this field and we'll index it and you won't pay a cost every time you introduce a new field".

I don't get what you mean here, I assume you want to barf on numerics? I think we should just use make text out of it and be done with it. We can add some position inc gaps to make phrases work but from this perspective it's really just one big pile of text.

I feel like giving numbers their own lucene field makes this less foolproof. I think for something like this more foolproof is better than more featureful.

you know they can't use any kind of numeric special things here since we won't support aggregations etc. Let's keep it simple and don't have many exceptions. That's what I'd like though.
I really don't want to get into the number / date / lat,long detection game when it's coming as a string and next time as a number it's going to be a bad user experience and this is support to fix stuff that is changing on a regular basis ie. twitter data.

@nik9000
Copy link
Member

nik9000 commented Jun 21, 2017

I think we should just use make text out of it and be done with it.

That is what I mean.

The way I read the original proposal I thought that we would automatically give numerics their own Lucene field somehow and I didn't like that. I agree that we shouldn't do numerics.

In Clint's example you can pull out numeric fields. I wonder if we can exclude those fields from the indexed object fields.

Could we make queries to the indexed objects look "normal" like I had in my example?

I think it'd be nice to have an example of configuring the field type of the index - whether the strings are analyzed like text or keyword and how you'd set up multifields. And setting up the analyzer/normalizer/etc.

@s1monw
Copy link
Contributor

s1monw commented Jun 22, 2017

The way I read the original proposal I thought that we would automatically give numerics their own Lucene field somehow and I didn't like that. I agree that we shouldn't do numerics.
In Clint's example you can pull out numeric fields. I wonder if we can exclude those fields from the indexed object fields.

we are on the same page here. let not be smart but simple

@clintongormley
Copy link
Contributor Author

Discussed in FixItFriday. We're going to start simple and see what feedback we get. We will index each term alone plus with a path prefix (eg path\0term), so users can query my_object:term but we could also automatically convert my_object.some.path:term to a query for my_object:some.path\0term.

@roncohen
Copy link

This possibly has a nice use-case in APM. We allow users to send up big blobs of custom objects which are not currently indexed. This would be a nice way to make those documents searchable without running the risk of field explosion (Thanks for the ping @ruflin!)

@javanna javanna removed their assignment Nov 15, 2017
@jpountz
Copy link
Contributor

jpountz commented Mar 14, 2018

cc @elastic/es-search-aggs

@jtibshirani jtibshirani self-assigned this May 17, 2018
@jtibshirani
Copy link
Contributor

jtibshirani commented May 28, 2018

+1 to @roncohen's comment about handling a potentially unbounded number of unique field names. I've come across a related use case in previous experience: a spreadsheet program where users can create sheet templates with arbitrary column names, and want to be able to search within columns by name.

@jtibshirani
Copy link
Contributor

jtibshirani commented Aug 15, 2018

I’m now getting started on this in earnest. My main open question is whether it makes sense to add this functionality to objects, as opposed to creating a new data type as @nik9000 alluded to.

Under the current proposal, an object field would be made queryable as follows:

"my_object": {
    "type": "object",
    "dynamic": false,
    "index": true,
    "boost": 0.5,
    "properties": { … }
}

There are a few issues to ponder with this approach. First, it’s a bit subtle that setting dynamic: false and index: true is what indicates that the object is also a field mapping, and allows for other field mapping entries to be supplied. The mappings under my_object could also be misinterpreted as providing defaults for the mappings under properties.

Additionally, mixing in concrete field mappings can make the behavior less clear:

"my_object": {
    "type": "object",
    "dynamic": false,
    "index": true,
    "boost": 0.5,
    "properties": {
        "status": { … }
    }
}

Do we still index the un-prefixed values for status into the object field, so that a search for my_object:active will work (but not index the prefixed values)? If the queryable object field was introduced first, then the status field was introduced much later, which one do we end up searching in a query for my_object.status, as there is now data split across two fields?

Finally, this syntax looks tricky to support given how the mapping + document parsing code is currently designed. In particular, an object mapper must now also function as a field mapper in certain contexts.

To avoid these problems, I wonder if it would better to create a new field type, something similar to the following:

"http-headers": {
    "type": "key_value",
    "index": true,
    "boost": 0.5
 }

This directly covers the use cases around handling opaque blobs of data. If certain important keys are known in advance (and should be made available for aggregations, etc.), they can be pulled into a separate field, with no special relation to the object field. We could maybe provide a mechanism similar to copy_to to help users to ‘promote’ certain keys into their own fields (thanks @colings86 for this suggestion).

@jpountz
Copy link
Contributor

jpountz commented Aug 21, 2018

Do we still index the un-prefixed values for status into the object field
which one do we end up searching in a query for my_object.status, as there is now data split across two fields?

These are compelling arguments towards a dedicated indexed object field indeed. I guess we could still make it work on object fields by adding the list of fields that are indexed as kv-pairs to the mapping, but that would also defeat the purpose of this feature?

@jpountz
Copy link
Contributor

jpountz commented Aug 21, 2018

Or alternatively, we could prevent (both dynamic and explicit) mapping updates to object fields that are indexed so that we could ensure that data is never split across two fields? To be clear, I'm not actually recommending it and still need to weigh pros and cons, I'm mostly adding it to make sure that we explore all options.

@jtibshirani
Copy link
Contributor

We had a discussion offline and decided to create a new leaf field type for the reasons outlined above. As @jpountz mentioned we didn't think it made sense to add a new field mapping for each key, as this would not solve a major use case of the feature, which is to prevent mapping explosion.

Other conclusions from the discussion can be found on the meta-issue: #33003 (comment)

@jsoriano
Copy link
Member

Would this allow to store objects with dots in their field names?

On beats we have some cases were we would benefit of being able to store key-value objects (string to string) with dots in the keys, like in the subfields of docker.container.labels.* mentioned in the design issue where dots are pretty usual. We'd like to be able to store events with fields like:

{
  "docker": {
    "container": {
      "labels": {
        "com.docker.swarm.task": "foo",
        "com.docker.swarm.task.id": "xxxxx",
        "com.docker.swarm.task.name": "yyyyy"
      }
    }
  }
}

On queries, in principle, these fields would be used only for filtering.

At the moment users face mapping errors when they try to store data like this, and the only workarounds they have are to replace the dots with other characters (we offer this "dedotting" in some places), or to rename and/or drop the conflicting fields. This is not a very good user experience (see this topic in discuss for example), and makes them to lose the original names of these labels, or to completely discard some of them.

It'd be great if this new type could cover this case 🙂

@jtibshirani
Copy link
Contributor

Hi @jsoriano, as currently designed the new field type would support this sort of data. For example, com.docker.swarm.task.name would be treated as a single key, and you could issue a simple query on the field labels.com.docker.swarm.task.name.

Also, you're not suggesting this in your comment, but just to be really clear -- this field type shouldn't be used as a general approach to handling dots in field names. It supports a much more restrictive set of search functionality than normal fields, and should only be used if it's the right fit for the particular data.

@jsoriano
Copy link
Member

jsoriano commented Dec 3, 2018

Hi, an update about its possible use in Beats after some conversations offline.

I have started a PR (elastic/beats#9286) with the changes that would be needed, and after trying it a little bit it works quite well for our case.
In principle we'd need to introduce these changes on 7.0 as they would be breaking. Alternative options for our problems with these fields would be breaking too (like "dedoting" by default), so we'd have to make a decision for 7.0 in any case.

There are two main cons about using this type already on 7.0 for labels:

  • It would be an experimental feature. But I think we could live with that, our current situation is quite prone to problems and this could be a good improvement in any case.
  • Lack of terms aggregations. Even if we don't use them directly in our solutions at the moment, it can be seen as a show-stopper soon as some users expect to be able to group by labels (see this topic for example).
    We could live with workarounds (like renaming fields from beats, or duplicating them with different types) till aggregations are implemented, but then it wouldn't be such an improvement for these cases.
    We could delay the adoption of this type till terms aggregation is implemented, but this could mean a breaking change for us in 7.x, or having to wait till 8.0.

If at the end we don't use it for labels, we could still consider using this type for kubernetes annotations. We are not storing them by default now to avoid loads of dynamic field mappings, this type would help on this. And terms aggregation is less required there.

I have also opened a discussion about the possible use of these fields in ECS (elastic/ecs#198).

@clintongormley
Copy link
Contributor Author

@jsoriano Aggregations will not be implemented for queryable object fields. I don't think you should make plans based on this field type, it is too different from normal fields and will never support a number of features that you would expect, eg discoverability of the existence of the field via an API.

I think the correct way to deal with fields like:

        "com.docker.swarm.task": "foo",
        "com.docker.swarm.task.id": "xxxxx",
        "com.docker.swarm.task.name": "yyyyy"

is either to dedot them, or to rewrite com.docker.swarm.task to something like com.docker.swarm.task.main.

That way, these fields end up benefiting from all the features already supported.

@jtibshirani
Copy link
Contributor

Aggregations will not be implemented for queryable object fields.

Just a note that while aggregations are not planned for the first version of the feature, I don't think they're out of the question, and I'm investigating if it'd be possible to support some simple aggregation types like terms.

@jtibshirani
Copy link
Contributor

The initial version of the feature was merged in #42541 and backported to 7.3. I filed a new issue #43805 to track follow-up improvements.

@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

9 participants