Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rollup] Why the documents of the rollup contain <field>.<agg_type>._count having the same values? #47876

Closed
lucabelluccini opened this issue Oct 10, 2019 · 4 comments
Labels
:StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@lucabelluccini
Copy link
Contributor

Elasticsearch version: 7.4.0 (and previous)

Description of the problem including expected versus actual behavior:

The documents generated by the Rollup Job always produce, for each field considered in the terms, a field named <field>.<agg_type>._count.
The value of such field is always the same.
Why isn't it written just once at the root of the document?

Steps to reproduce:

  1. Add the demo dataset kiana_sample_data_logs
  2. Tweak the data to contain extra fields
POST kibana_sample_data_logs/_update_by_query
{
  "script": {
    "source": "ctx._source.my_field = 3991",
    "lang": "painless"
  },
  "query": {
    "term": {
      "extension": "deb"
    }
  }
}
  1. Run a rollup on my_field

Result:

        "_source" : {
          "@timestamp.date_histogram.time_zone" : "UTC",
          "my_field.max.value" : 3991.0,
          "bytes.value_count.value" : 2.0,
          "@timestamp.date_histogram._count" : 2,
          "my_field.avg.value" : 7982.0,
          "phpmemory.value_count.value" : 0.0,
          "phpmemory.sum.value" : 0.0,
          "bytes.histogram.interval" : 10,
          "my_field.avg._count" : 2.0,
          "bytes.histogram.value" : 6210.0,
          "bytes.sum.value" : 12438.0,
          "bytes.min.value" : 6219.0,
          "my_field.terms.value" : 3991,
          "_rollup.id" : "dasdsa",
          "my_field.min.value" : 3991.0,
          "response.keyword.terms.value" : "200",
          "@timestamp.date_histogram.timestamp" : 1569715200000,
          "my_field.value_count.value" : 2.0,
          "bytes.max.value" : 6219.0,
          "machine.os.keyword.terms.value" : "win 8",
          "my_field.histogram._count" : 2,
          "my_field.histogram.value" : 3990.0,
          "@timestamp.date_histogram.interval" : "60m",
          "my_field.histogram.interval" : 10,
          "bytes.avg.value" : 12438.0,
          "machine.ram.histogram.value" : 8.58993459E9,
          "bytes.avg._count" : 2.0,
          "request.keyword.terms._count" : 2,
          "request.keyword.terms.value" : "/elasticsearch/elasticsearch-6.3.2.deb",
          "my_field.terms._count" : 2,
          "my_field.sum.value" : 7982.0,
          "bytes.histogram._count" : 2,
          "_rollup.version" : 2,
          "machine.os.keyword.terms._count" : 2,
          "machine.ram.histogram._count" : 2,
          "response.keyword.terms._count" : 2,
          "machine.ram.histogram.interval" : 10
        }
      },
@lucabelluccini lucabelluccini added the :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data label Oct 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Rollup)

@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
@rtyley
Copy link

rtyley commented Jul 8, 2020

I see that @polyfractal mentioned in #45187 that Rollup 'saves all the metrics "in isolation" without seeing if a different metric provides the same value' and that the situation could be improved.

Here's an example where we're seeing the same count mentioned over and over again, and take up almost 50% of rolled-up document size:

{
  "pageview.location.countryCode.terms._count": 9,
  "pageview.indexedTags.tag2.terms._count": 9,
  "pageview.indexedTags.tag0.terms.value": null,
  "pageview.indexedTags.tag4.terms._count": 9,
  "pageview.attentionBucket.terms._count": 9,
  "pageview.indexedTags.tag1.terms.value": null,
  "pageview.dt.date_histogram.time_zone": "UTC",
  "pageview.dt.date_histogram.timestamp": 1592789400000,
  "pageview.userAgent.mobileClass.terms.value": null,
  "pageview.indexedTags.tag0.terms._count": 9,
  "pageview.userAgent.rendererInfo.rendererType.terms.value": null,
  "_rollup.id": "pageviews_historical_with_renderer_mobile_info_and_10_tags_job",
  "pageview.section.terms._count": 9,
  "pageview.userAgent.rendererInfo.thirdPartyRenderer.terms._count": 9,
  "pageview.referrer.site.terms._count": 9,
  "pageview.indexedTags.tag6.terms._count": 9,
  "pageview.indexedTags.tag5.terms._count": 9,
  "pageview.userAgent.isMobileAsString.terms._count": 9,
  "pageview.indexedTags.tag7.terms._count": 9,
  "pageview.userAgent.rendererInfo.nonWebRenderer.terms.value": null,
  "_rollup.version": 2,
  "pageview.location.countryCode.terms.value": null,
  "pageview.userAgent.rendererInfo.guardianNativeAppFamily.terms.value": null,
  "pageview.indexedTags.tag9.terms._count": 9,
  "pageview.indexedTags.tag8.terms._count": 9,
  "pageview.userAgent.rendererInfo.nonWebRenderer.terms._count": 9,
  "pageview.userAgent.rendererInfo.rendererType.terms._count": 9,
  "pageview.contentType.terms._count": 9,
  "pageview.userAgent.mobileClass.terms._count": 9,
  "pageview.contentType.terms.value": null,
  "pageview.indexedTags.tag3.terms._count": 9,
  "pageview.indexedTags.tag1.terms._count": 9,
  "pageview.productionOffice.terms.value": null,
  "pageview.indexedTags.tag7.terms.value": null,
  "pageview.userAgent.rendererInfo.thirdPartyRenderer.terms.value": null,
  "pageview.userAgent.rendererInfo.guardianNativeAppFamily.terms._count": 9,
  "pageview.productionOffice.terms._count": 9,
  "pageview.indexedTags.tag8.terms.value": null,
  "pageview.indexedTags.tag9.terms.value": null,
  "pageview.attentionBucket.terms.value": 7,
  "pageview.indexedTags.tag2.terms.value": null,
  "pageview.referrer.site.terms.value": null,
  "pageview.dt.date_histogram.interval": "10m",
  "pageview.indexedTags.tag3.terms.value": null,
  "pageview.indexedTags.tag6.terms.value": null,
  "pageview.dt.date_histogram._count": 9,
  "pageview.indexedTags.tag4.terms.value": null,
  "pageview.path.terms.value": null,
  "pageview.indexedTags.tag5.terms.value": null,
  "pageview.section.terms.value": null,
  "pageview.userAgent.isMobileAsString.terms.value": null,
  "pageview.path.terms._count": 9
}

There is some duplication (e.g. sum.value and avg.value) which we could improve. Today Rollup saves all the metrics "in isolation" without seeing if a different metric provides the same value. There's an issue for that here: #47876

@polyfractal
Copy link
Contributor

As an aside, this is something that should be drastically improved in the Rollup V2 refactor we're working on. We're introducing a dedicated "doc_count" field mapper which will store the count for the whole doc instead of duplicating it repeatedly like you see with v1

@wchaparro
Copy link
Member

With the 8.7 release of Elasticsearch, we have made a new downsampling capability associated with the new time series datastreams functionality generally available (GA). This capability was in tech preview in ILM since 8.5. Downsampling provides a method to reduce the footprint of your time series data by storing it at reduced granularity. The downsampling process rolls up documents within a fixed time interval into a single summary document. Each summary document includes statistical representations of the original data: the min, max, sum, value_count, and average for each metric. Data stream time series dimensions are stored unchanged.

Downsampling is superior to rollup because:

  • Downsampled indices are searched through the _search API
  • It is possible to query multiple downsampled indices together with raw data indices
  • The pre-aggregation is based on the metrics and time series definitions in the index mapping so very little configuration is required (i.e. much easier to add new time serieses)
  • Downsampling is managed as an action in ILM
  • It is possible to downsample a downsampled index, and reduce granularity as the index ages
  • The performance of the pre-aggregation process is superior in downsampling, as it builds on the time_series index mode infrastructure

Because of the introduction of this new capability, we are deprecating the rollups functionality, which never left the Tech Preview/Experimental status, in favor of downsampling and thus we are closing this issue. We encourage you to migrate your solution to downsampling and take advantage of the new TSDB functionality.

@wchaparro wchaparro closed this as not planned Won't fix, can't repro, duplicate, stale Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

6 participants