Queryable object fields #25312

clintongormley · 2017-06-20T10:13:07Z

Often we have large object fields with many sub-fields, only a few of which are needed for aggregations, sorting, or highlighting. Today, we create fields for all sub-fields, but we could greatly reduce the number of required fields if we make object fields queryable.

We would need a specialiased analyzer which can accept JSON and transform an object like:

{
    "status": "active",
    "age": 25,
    "city": "New York"
}

into the following terms:

"status:active", "age:25", "city:new", "city:york"

Then you could search for active statuses with:

{
  "match": {
    "my_object": "status:active"
  }
}

or

{
  "query_string": {
    "query": "my_object:\"status:active\""
  }
}

We could possibly even support searching for "New York" vs "New City of York" with:

{
  "match_phrase": {
    "my_object": "New York"
  }
}

which would be rewritten as my_object:"city:new city:york"

If we wanted to be able to aggregate on the age field, the object field could be mapped as:

{
  "my_object": {
    "type": "object",
    "index": true,
    "dynamic": false,
    "properties": {
      "age": {
        "type": "integer"
      }
    }
  }
}

With this mapping, only the my_object.age sub-field would have its own Lucene field (or Elasticsearch field) and the rest of the object would be queryable via the my_object field.

This could even be made to work on the whole document by allowing the _source field to be configurable.

The text was updated successfully, but these errors were encountered:

clintongormley · 2017-06-20T11:30:38Z

We could also index terms with and without the field prefix, (eg status:active, active) to make it work more like a normal field

nik9000 · 2017-06-20T13:10:10Z

I wonder if you could make this more transparent so the normal thing works:

PUT idx/_mapping
{
  "doc": {
    "properties": {
      "my_object": {
        "type": "keyvalue",
        "value_type": "keyword"
      }
    }
  }
}

POST idx/_search
{
  "match": {
    "my_object.status": "active"
  }
}

The prefixing would be an implementation detail of keyvalue. I'd be inclined to only support keyword and text style fields in there. That'd let you tell someone "stick whatever you want in this field and we'll index it and you won't pay a cost every time you introduce a new field". I feel like giving numbers their own lucene field makes this less foolproof. I think for something like this more foolproof is better than more featureful.

s1monw · 2017-06-21T07:36:28Z

I wonder if you could make this more transparent so the normal thing works:

I think we should not make it complicated restrictive. It's very simple, we decide how to index and make things work without any special type. you just mark the mapping as "dynamic": false and "indexed" : true. I don't necessarily understand what is trappy here, we don't support these parameters there yet.

The prefixing would be an implementation detail of keyvalue. I'd be inclined to only support keyword and text style fields in there. That'd let you tell someone "stick whatever you want in this field and we'll index it and you won't pay a cost every time you introduce a new field".

I don't get what you mean here, I assume you want to barf on numerics? I think we should just use make text out of it and be done with it. We can add some position inc gaps to make phrases work but from this perspective it's really just one big pile of text.

I feel like giving numbers their own lucene field makes this less foolproof. I think for something like this more foolproof is better than more featureful.

you know they can't use any kind of numeric special things here since we won't support aggregations etc. Let's keep it simple and don't have many exceptions. That's what I'd like though.
I really don't want to get into the number / date / lat,long detection game when it's coming as a string and next time as a number it's going to be a bad user experience and this is support to fix stuff that is changing on a regular basis ie. twitter data.

nik9000 · 2017-06-21T13:12:12Z

I think we should just use make text out of it and be done with it.

That is what I mean.

The way I read the original proposal I thought that we would automatically give numerics their own Lucene field somehow and I didn't like that. I agree that we shouldn't do numerics.

In Clint's example you can pull out numeric fields. I wonder if we can exclude those fields from the indexed object fields.

Could we make queries to the indexed objects look "normal" like I had in my example?

I think it'd be nice to have an example of configuring the field type of the index - whether the strings are analyzed like text or keyword and how you'd set up multifields. And setting up the analyzer/normalizer/etc.

s1monw · 2017-06-22T14:46:16Z

The way I read the original proposal I thought that we would automatically give numerics their own Lucene field somehow and I didn't like that. I agree that we shouldn't do numerics.
In Clint's example you can pull out numeric fields. I wonder if we can exclude those fields from the indexed object fields.

we are on the same page here. let not be smart but simple

clintongormley · 2017-06-30T14:12:01Z

Discussed in FixItFriday. We're going to start simple and see what feedback we get. We will index each term alone plus with a path prefix (eg path\0term), so users can query my_object:term but we could also automatically convert my_object.some.path:term to a query for my_object:some.path\0term.

roncohen · 2017-11-15T13:52:16Z

This possibly has a nice use-case in APM. We allow users to send up big blobs of custom objects which are not currently indexed. This would be a nice way to make those documents searchable without running the risk of field explosion (Thanks for the ping @ruflin!)

jpountz · 2018-03-14T09:08:41Z

cc @elastic/es-search-aggs

jtibshirani · 2018-05-28T20:23:53Z

+1 to @roncohen's comment about handling a potentially unbounded number of unique field names. I've come across a related use case in previous experience: a spreadsheet program where users can create sheet templates with arbitrary column names, and want to be able to search within columns by name.

jtibshirani · 2018-08-15T19:48:28Z

I’m now getting started on this in earnest. My main open question is whether it makes sense to add this functionality to objects, as opposed to creating a new data type as @nik9000 alluded to.

Under the current proposal, an object field would be made queryable as follows:

"my_object": {
    "type": "object",
    "dynamic": false,
    "index": true,
    "boost": 0.5,
    "properties": { … }
}

There are a few issues to ponder with this approach. First, it’s a bit subtle that setting dynamic: false and index: true is what indicates that the object is also a field mapping, and allows for other field mapping entries to be supplied. The mappings under my_object could also be misinterpreted as providing defaults for the mappings under properties.

Additionally, mixing in concrete field mappings can make the behavior less clear:

"my_object": {
    "type": "object",
    "dynamic": false,
    "index": true,
    "boost": 0.5,
    "properties": {
        "status": { … }
    }
}

Do we still index the un-prefixed values for status into the object field, so that a search for my_object:active will work (but not index the prefixed values)? If the queryable object field was introduced first, then the status field was introduced much later, which one do we end up searching in a query for my_object.status, as there is now data split across two fields?

Finally, this syntax looks tricky to support given how the mapping + document parsing code is currently designed. In particular, an object mapper must now also function as a field mapper in certain contexts.

To avoid these problems, I wonder if it would better to create a new field type, something similar to the following:

"http-headers": {
    "type": "key_value",
    "index": true,
    "boost": 0.5
 }

This directly covers the use cases around handling opaque blobs of data. If certain important keys are known in advance (and should be made available for aggregations, etc.), they can be pulled into a separate field, with no special relation to the object field. We could maybe provide a mechanism similar to copy_to to help users to ‘promote’ certain keys into their own fields (thanks @colings86 for this suggestion).

jpountz · 2018-08-21T09:32:08Z

Do we still index the un-prefixed values for status into the object field
which one do we end up searching in a query for my_object.status, as there is now data split across two fields?

These are compelling arguments towards a dedicated indexed object field indeed. I guess we could still make it work on object fields by adding the list of fields that are indexed as kv-pairs to the mapping, but that would also defeat the purpose of this feature?

jpountz · 2018-08-21T09:43:35Z

Or alternatively, we could prevent (both dynamic and explicit) mapping updates to object fields that are indexed so that we could ensure that data is never split across two fields? To be clear, I'm not actually recommending it and still need to weigh pros and cons, I'm mostly adding it to make sure that we explore all options.

jtibshirani · 2018-09-07T19:07:07Z

We had a discussion offline and decided to create a new leaf field type for the reasons outlined above. As @jpountz mentioned we didn't think it made sense to add a new field mapping for each key, as this would not solve a major use case of the feature, which is to prevent mapping explosion.

Other conclusions from the discussion can be found on the meta-issue: #33003 (comment)

jsoriano · 2018-11-28T10:02:42Z

Would this allow to store objects with dots in their field names?

On beats we have some cases were we would benefit of being able to store key-value objects (string to string) with dots in the keys, like in the subfields of docker.container.labels.* mentioned in the design issue where dots are pretty usual. We'd like to be able to store events with fields like:

{
  "docker": {
    "container": {
      "labels": {
        "com.docker.swarm.task": "foo",
        "com.docker.swarm.task.id": "xxxxx",
        "com.docker.swarm.task.name": "yyyyy"
      }
    }
  }
}

On queries, in principle, these fields would be used only for filtering.

At the moment users face mapping errors when they try to store data like this, and the only workarounds they have are to replace the dots with other characters (we offer this "dedotting" in some places), or to rename and/or drop the conflicting fields. This is not a very good user experience (see this topic in discuss for example), and makes them to lose the original names of these labels, or to completely discard some of them.

It'd be great if this new type could cover this case 🙂

jtibshirani · 2018-11-28T17:49:41Z

Hi @jsoriano, as currently designed the new field type would support this sort of data. For example, com.docker.swarm.task.name would be treated as a single key, and you could issue a simple query on the field labels.com.docker.swarm.task.name.

Also, you're not suggesting this in your comment, but just to be really clear -- this field type shouldn't be used as a general approach to handling dots in field names. It supports a much more restrictive set of search functionality than normal fields, and should only be used if it's the right fit for the particular data.

jsoriano · 2018-12-03T16:28:01Z

Hi, an update about its possible use in Beats after some conversations offline.

I have started a PR (elastic/beats#9286) with the changes that would be needed, and after trying it a little bit it works quite well for our case.
In principle we'd need to introduce these changes on 7.0 as they would be breaking. Alternative options for our problems with these fields would be breaking too (like "dedoting" by default), so we'd have to make a decision for 7.0 in any case.

There are two main cons about using this type already on 7.0 for labels:

It would be an experimental feature. But I think we could live with that, our current situation is quite prone to problems and this could be a good improvement in any case.
Lack of terms aggregations. Even if we don't use them directly in our solutions at the moment, it can be seen as a show-stopper soon as some users expect to be able to group by labels (see this topic for example).
We could live with workarounds (like renaming fields from beats, or duplicating them with different types) till aggregations are implemented, but then it wouldn't be such an improvement for these cases.
We could delay the adoption of this type till terms aggregation is implemented, but this could mean a breaking change for us in 7.x, or having to wait till 8.0.

If at the end we don't use it for labels, we could still consider using this type for kubernetes annotations. We are not storing them by default now to avoid loads of dynamic field mappings, this type would help on this. And terms aggregation is less required there.

I have also opened a discussion about the possible use of these fields in ECS (elastic/ecs#198).

clintongormley · 2018-12-03T16:39:59Z

@jsoriano Aggregations will not be implemented for queryable object fields. I don't think you should make plans based on this field type, it is too different from normal fields and will never support a number of features that you would expect, eg discoverability of the existence of the field via an API.

I think the correct way to deal with fields like:

        "com.docker.swarm.task": "foo",
        "com.docker.swarm.task.id": "xxxxx",
        "com.docker.swarm.task.name": "yyyyy"

is either to dedot them, or to rewrite com.docker.swarm.task to something like com.docker.swarm.task.main.

That way, these fields end up benefiting from all the features already supported.

jtibshirani · 2018-12-04T23:22:09Z

Aggregations will not be implemented for queryable object fields.

Just a note that while aggregations are not planned for the first version of the feature, I don't think they're out of the question, and I'm investigating if it'd be possible to support some simple aggregation types like terms.

jtibshirani · 2019-07-01T09:21:05Z

The initial version of the feature was merged in #42541 and backported to 7.3. I filed a new issue #43805 to track follow-up improvements.

clintongormley added :Search Foundations/Mapping Index mappings, including merging and defining field types discuss >feature labels Jun 20, 2017

clintongormley added help wanted adoptme >enhancement and removed discuss >enhancement labels Jun 30, 2017

colings86 assigned javanna Oct 31, 2017

colings86 removed the help wanted adoptme label Oct 31, 2017

javanna removed their assignment Nov 15, 2017

jtibshirani self-assigned this May 17, 2018

jtibshirani mentioned this issue Aug 20, 2018

Flattened object fields design + implementation #33003

Closed

11 tasks

colings86 mentioned this issue Oct 30, 2018

Enforce a limit on the depth of the JSON object. #35063

Merged

andrewkroh mentioned this issue Nov 28, 2018

Refactors for 7.0 (Breaking changes I want to make) elastic/beats#6106

Closed

20 tasks

jsoriano mentioned this issue Nov 29, 2018

Use new json type for labels and annotations elastic/beats#9286

Closed

roncohen mentioned this issue Dec 3, 2018

Investigate the use of Object Fields (JSONF) elastic/apm-server#1477

Closed

ruflin mentioned this issue Jan 30, 2019

Add timeseries.instance to allow automatic downsampling on query elastic/beats#10293

Merged

jtibshirani mentioned this issue May 24, 2019

Add support for 'flattened object' fields. #42541

Merged

3 tasks

jtibshirani closed this as completed Jul 1, 2019

Mpdreamz mentioned this issue Aug 7, 2019

[meta] 7.3 Release elastic/elasticsearch-net#4001

Closed

16 tasks

javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queryable object fields #25312

Queryable object fields #25312

clintongormley commented Jun 20, 2017

clintongormley commented Jun 20, 2017

nik9000 commented Jun 20, 2017

s1monw commented Jun 21, 2017

nik9000 commented Jun 21, 2017

s1monw commented Jun 22, 2017

clintongormley commented Jun 30, 2017

roncohen commented Nov 15, 2017

jpountz commented Mar 14, 2018

jtibshirani commented May 28, 2018 •

edited

Loading

jtibshirani commented Aug 15, 2018 •

edited

Loading

jpountz commented Aug 21, 2018

jpountz commented Aug 21, 2018

jtibshirani commented Sep 7, 2018

jsoriano commented Nov 28, 2018

jtibshirani commented Nov 28, 2018

jsoriano commented Dec 3, 2018

clintongormley commented Dec 3, 2018

jtibshirani commented Dec 4, 2018

jtibshirani commented Jul 1, 2019

Queryable object fields #25312

Queryable object fields #25312

Comments

clintongormley commented Jun 20, 2017

clintongormley commented Jun 20, 2017

nik9000 commented Jun 20, 2017

s1monw commented Jun 21, 2017

nik9000 commented Jun 21, 2017

s1monw commented Jun 22, 2017

clintongormley commented Jun 30, 2017

roncohen commented Nov 15, 2017

jpountz commented Mar 14, 2018

jtibshirani commented May 28, 2018 • edited Loading

jtibshirani commented Aug 15, 2018 • edited Loading

jpountz commented Aug 21, 2018

jpountz commented Aug 21, 2018

jtibshirani commented Sep 7, 2018

jsoriano commented Nov 28, 2018

jtibshirani commented Nov 28, 2018

jsoriano commented Dec 3, 2018

clintongormley commented Dec 3, 2018

jtibshirani commented Dec 4, 2018

jtibshirani commented Jul 1, 2019

jtibshirani commented May 28, 2018 •

edited

Loading

jtibshirani commented Aug 15, 2018 •

edited

Loading