Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support to get just populated fields from a Kibana index pattern #100779

Closed
walterra opened this issue May 27, 2021 · 6 comments
Closed

Support to get just populated fields from a Kibana index pattern #100779

walterra opened this issue May 27, 2021 · 6 comments
Labels
enhancement New value added to drive a business result Feature:Data Views Data Views code and UI - index patterns before 8.0 impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:medium Medium Level of Effort

Comments

@walterra
Copy link
Contributor

Follow up to #78590 and #98259.

To reduce the amount of fields being passed on for large indices like filebeat to components like data grid, we implemented custom code to retrieve a random sample of documents and find out which fields are actually populated.

For example, for an out of the box metricbeat index, this reduces the list of passed on fields from 3000+ to ~120 fields.

This has both usability and "work-around" reasons. Some React components we consume (for example the data grid's dropdown to select visible columns) isn't well optimized to large number of field and slows down pages. Additionally, for indices with lots of fields there might be empty ones based on the use case. A user might have a hard time with try and error to select fields that actually contain data.

It would be great if Kibana index pattern could expose a method getPopulatedFields() that encapsulates functionality likes this.

This feature is related to the discussion in #95558.

@botelastic botelastic bot added the needs-team Issues missing a team label label May 27, 2021
@peteharverson peteharverson added Feature:Data Views Data Views code and UI - index patterns before 8.0 and removed needs-team Issues missing a team label labels May 27, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label May 27, 2021
@peteharverson peteharverson added enhancement New value added to drive a business result Team:AppServices labels May 27, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-services (Team:AppServices)

@botelastic botelastic bot removed the needs-team Issues missing a team label label May 27, 2021
@mattkime
Copy link
Contributor

@timroes @flash1293 This sounds similar to what you're doing in discover and lens. We should verify that the needs are the same and find a shared solution.

@flash1293
Copy link
Contributor

flash1293 commented May 27, 2021

We just talked about this a bit and I think the index filter is a bit different in what it does.

field caps index filter

  • No false-negatives (if field caps says a field doesn't exist, it's guaranteed to not show up in results)
  • false-positives (it's always possible a field is reported as available, but doesn't hold any data for the current time range)
  • works very well for the "default way" data is indexed in data streams (per index most fields always hold data)
  • doesn't work well for messy unorganized mappings which grew organically over time and contain obsolete definitions

sample documents

  • No false-positives (if there are values in the sample documents, there will be at least some results)
  • false-negatives (sample documents might not include fields, but others do and they would return data)
  • works well if documents are relatively homogenous (high chance to get a good sample)
  • doesn't work well for some special cases (e.g. just started ingesting a new field in a large index and only a few documents have it yet, but the user knows for sure there is some data)

Suggestion (for places which use a form of document sampling today)

Given these pros and cons of the approaches, I don't think simply switching over to field caps index filters instead of sampling documents is a viable approach because in very common real-world cases (there is just a single mapping and it contains much more fields than necessary) the outcome would be much worse.

There is however additional information in the field caps index filter - whether or not it's even possible there is any data in fields.

One option would be to do both - sample some documents and query the field caps API with an index filter to get three categories of fields:

  • Available - field caps and sample documents confirm these fields hold data
  • (Probably) empty - field caps reports this field as part of the mapping of the current indices, but there was no data in the sample documents
  • Definitely empty - field caps didn't include this in the index-filtered response - the field is in one of the mappings of the entire index pattern, but not in the indices selected by the current time range and filter

The app could use these three categories to power the UI, e.g. in Lens:

  • Available - show prominently
  • (Probably empty) - show de-emphasized (collapsed by default or sorted to the bottom of the list)
  • Definitely empty - don't show in fields list at all, but don't treat them being used in a config as error

@rayafratkina
Copy link
Contributor

Thanks for the detailed notes, @flash1293
One clarification: you mentioned for field caps index filter

false-positives (it's always possible a field is reported as available, but doesn't hold any data for the current time range)

Can you explain why that is? Is the filtering not respecting all the criteria (including date range)?

@flash1293
Copy link
Contributor

flash1293 commented Jun 2, 2021

@rayafratkina The field caps API is not checking individual documents for values - it operates on the mappings. This means if there is an index which includes a field in its mapping, field caps will report this field and Kibana will show it even if there isn't a single document which actually has a value indexed for this field (which means it's useless for most purposes).

The "index filter" aspect is about only checking the mappings of indices which are known to have data for certain filters based on index level meta data. This is an optimization Elasticsearch uses to not query indices unnecessarily - e.g. in the index meta data the minimum and maximum date of any document in the index is stored, so it's possible to exclude indices (and the fields specified in their mappings) without looking at the data itself. The same is done for different datasets (e.g. separate indices for system metrics vs. apache metrics https://www.elastic.co/fr/blog/an-introduction-to-the-elastic-data-stream-naming-scheme), so in some cases it's possible to drastically reduce the number of fields relative to all fields in all mappings matching the whole index pattern.

Coming back to your question, false positives can happen because the granularity of the filter is limited to indices instead of individual documents. But AFAIK it's also not possible to reliably exclude indices for all kinds of filters - date ranges and filters on constant keyword fields definitely work, I think most other types of filters are simply ignored in this case.

@jimczi can definitely explain this better.

@exalate-issue-sync exalate-issue-sync bot added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Jun 21, 2021
@exalate-issue-sync exalate-issue-sync bot added loe:medium Medium Level of Effort and removed loe:small Small Level of Effort labels Apr 4, 2022
@ppisljar
Copy link
Member

resolved by #121367

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Data Views Data Views code and UI - index patterns before 8.0 impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:medium Medium Level of Effort
Projects
None yet
Development

No branches or pull requests

7 participants