Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add better support for metric data types (TSDB) #74660

Closed
imotov opened this issue Jun 28, 2021 · 11 comments
Closed

Add better support for metric data types (TSDB) #74660

imotov opened this issue Jun 28, 2021 · 11 comments
Assignees
Labels
>feature Meta :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@imotov
Copy link
Contributor

imotov commented Jun 28, 2021

Phase 0 - Inception

  • Obtain schemas annotated with dimensions and metrics from the Metrics team (small) @nik9000
  • Prototyping Lucene Data Pull Mechanism(medium) @imotov
  • Prototyping Data Pull Mechanism in elasticsearch @imotov

Phase 1 - Mappings

Phase 2 - Ingest

Phase 2.1 Ingest follow ups

- [ ] Build the _id from dimension values
- [ ] Investigate moving timestamp to the front of the _id to automatically get an optimization on _id searches. Not sure if worth it - but possible. #84928 could be an alternative

  • Bring back something in the spirit of the append-only optimization but that works for tsdb. That's super improve write performance. Extract append-only optimization from Engine #84771 is a partial prototype
  • We store the _id in lucene stored fields. We could regenerate it from the _source or from doc values for the @timestamp and the _tsid. That'd save some bytes per document.
  • Move IndexRequest#autoGeneratId? It's a bit spook where it is but I don't like it any other place.
  • Improve error messages in _update_by_query when modifying the dimensions or @timestamp
  • On translog replay and recovery and replicas we regenerate the _id and assert that it matches the _id from the primary. Should we? Probably. Let's make sure.
  • Add tsdb benchmarks to the nightlies
    - [ ] Document best practices for using dimensions-based ID generator including how to use this with component templates

Phase 3.1 QL storage API (Postponed)

Phase 3.2 - Search MVP

Plans time series support in _search api are superceded by plans for this in ES|QL.

Phase 3.3 - Rollup / Downsampling

Phase 3.4 - TSID aggs (superseded by tsdb in ES|QL)

~~ - [ ] Update min, max, sum, avg pipeline aggs for intermediate result filtering optimization ~~
~~ - [ ] Sliding window aggregation ~~
~~ - [ ] A way to filter to windows within the sliding window. Like "measurements take in the last 30 seconds of the window". ~~
~~ - [ ] Open transform issue for newly added time series aggs ~~
~~ - [ ] Benchmarks for the tsid agg ~~

Phase 3.5 - Downsampling follow ups

  • Handling histograms
  • SQL support for downsampling

Phase 4.0 - Compression

Phase 5.0 - Follow-ups and Nice-to-have-s

  • Default the setting's value to all of the keyword dimensions
  • Support shard splitting on time_series indices
  • Make an object or interface for _id's values. Right now it's a String that we encode with Uid.encodeId. That was reasonable. Maybe it still is. But it feels complex and for tsdb who's _id is always some bytes. And encoding it also wastes a byte about 1/128 of the time. It's a common prefix byte so this is probably not really an issue. But still. This is a big change but it'd make ES easier to read. Probably wouldn't really improve the storage though.
  • Figure out how to specify tsdb settings in component templates. For example index.routing_path can be specified in a composable index template if data stream template' index_mode is set to time_series. But if this setting is specified in a component template then it is required to also set the index.mode index setting. This feels backwards. @martijnvg
  • In order to retrieve the routing values (defined in index.routin_path), the source needs to be parsed on coordinating node. However in the case that an ingest pipeline is executed this, then the source of document will be parsed for the second time. Ideally the routing values should be extracted when ingest is performed. Similar to how the @timestamp field is already retrieved from a document during pipeline execution.
  • In order to determine the backing index a document should be to, a timestamp is parsed into Instant. The format being used is: strict_date_optional_time_nanos||strict_date_optional_time||epoch_millis. This to allow regular data format, data nanos date format and epoch since mills defined as string. We can optimise the data parsing if we know the exact format being used. For example if on data stream there is parameter that indices that exact data format we can optimise parsing by either using strict_date_optional_time_nanos, strict_date_optional_time or epoch_millis.
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

csoulios added a commit that referenced this issue Jul 9, 2021
This PR adds the following constraints to dimension fields:

    It must be an indexed field and must has doc values
    It cannot be multi-valued
    The number of dimension fields in the index mapping must not be more than 16. This should be configurable through an index property (index.mapping.dimension_fields.limit)
    keyword fields cannot be more than 1024 bytes long
    keyword fields must not use a normalizer

Based on the code added in PR #74450
Relates to #74660
csoulios added a commit that referenced this issue Sep 27, 2021
…rameters (#78265)

Backports the following PRs:

* Add dimension mapping parameter (#74450)

Added the dimension parameter to the following field types:

    keyword
    ip
    Numeric field types (integer, long, byte, short)

The dimension parameter is of type boolean (default: false) and is used
to mark that a field is a time series dimension field.

Relates to #74014

* Add constraints to dimension fields (#74939)

This PR adds the following constraints to dimension fields:

    It must be an indexed field and must has doc values
    It cannot be multi-valued
    The number of dimension fields in the index mapping must not be more than 16. This should be configurable through an index property (index.mapping.dimension_fields.limit)
    keyword fields cannot be more than 1024 bytes long
    keyword fields must not use a normalizer

Based on the code added in PR #74450
Relates to #74660

* Expand DocumentMapperTests (#76368)

Adds a test for setting the maximum number of dimensions setting and
tests the names and types of the metadata fields in the index.
Previously we just asserted the count of metadata fields. That made it
hard to read failures.

* Fix broken test for dimension keywords (#75408)

Test was failing because it was testing 1024 bytes long keyword and assertion was failing.

Closes #75225

* Checkstyle

* Add time_series_metric parameter (#76766)

This PR adds the time_series_metric parameter to the following field types:

    Numeric field types
    histogram
    aggregate_metric_double

* Rename `dimension` mapping parameter to `time_series_dimension` (#78012)

This PR renames dimension mapping parameter to time_series_dimension to make it consistent with time_series_metric parameter (#76766)

Relates to #74450 and #74014

* Add time series params to `unsigned_long` and `scaled_float` (#78204)

    Added the time_series_metric mapping parameter to the unsigned_long and scaled_float field types
    Added the time_series_dimension mapping parameter to the unsigned_long field type

Fixes #78100

Relates to #76766, #74450 and #74014

Co-authored-by: Nik Everett <nik9000@gmail.com>
imotov added a commit to imotov/elasticsearch that referenced this issue Oct 6, 2021
Exposes information about dimensions and metrics via field caps. This
information will be needed for PromQL support.

Relates to elastic#74660
imotov added a commit that referenced this issue Oct 13, 2021
Exposes information about dimensions and metrics via field caps. This
information will be needed for PromQL support.

Relates to #74660
@jrodewig jrodewig self-assigned this Nov 5, 2021
imotov added a commit that referenced this issue Nov 11, 2021
Adds basic support for selectors in TimeSeriesMetricsService

Relates to #74660
csoulios added a commit that referenced this issue Sep 7, 2022
This PR renames all public APIs for downsampling so that they contain the downsample 
keyword instead of the rollup that we had until now.

1. The API endpoint for the downsampling action is renamed to:

/source-index/_downsample/target-index

2. The ILM action is renamed to

PUT _ilm/policy/my_policy
{
  "policy": {
    "phases": {
      "warm": {
        "actions": {
          "downsample": {
  	    "fixed_interval": "24h"
  	  }
  	}
      }
    }
  }
}

3.  unsupported_aggregation_on_rollup_index was renamed to unsupported_aggregation_on_downsampled_index

4. Internal trasport actions were renamed:

    indices:admin/xpack/rollup -> indices:admin/xpack/downsample
    indices:admin/xpack/rollup_indexer -> indices:admin/xpack/downsample_indexer

5. Renamed the following index settings:

    index.rollup.source.uuid -> index.downsample.source.uuid
    index.rollup.source.name -> index.downsample.source.name
    index.rollup.status -> index.downsample.status

Finally, we renamed many internal variables and classes from *Rollup* to *Downsample*. 
However, this effort will be completed in more than one PRs so that we minimize conflicts with other in-flight PRs.

Relates to #74660
csoulios added a commit that referenced this issue Sep 15, 2022
This PR modifies downsampling operation so that it uses global ordinal to track tsid changes

PR depends on the work done in #90035

Relates to #74660
csoulios added a commit that referenced this issue Sep 19, 2022
This PR removes the feature flag for the time-series data ingestion and downsampling functionality,
making time-series indices and downsampling available

For more information about the released functionality, see #74660

Aggregation time_series still remains behind the feature flag
@oatkiller
Copy link

Sorry for my newbie question. Is this the same as https://www.elastic.co/guide/en/elasticsearch///reference/master/tsds.html ? Thanks

@jrodewig
Copy link
Contributor

jrodewig commented Sep 19, 2022

@oatkiller Yes. It's the same thing. :) (I helped write the initial docs.)

Hope you enjoy Elastic! It's an awesome place to work.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Nov 8, 2022
Currently, the key is a map, which can make reducing large response
more memory intense then it should be also. Also data structures
used during reduce are not back by bigarrays so not accounted for.

This commit changes how the key is represented internally.
By using BytesRef instead of Map. This doesn't commit
doesn't change how the key is represented in the response.
It also changes the reduce method to make use of the bucket
keys are now bytes refs.

Relates to elastic#74660
@juliaElastic
Copy link
Contributor

@martijnvg Hi! I am working on a feature on Fleet UI to enable TSDB index setting, and trying to leave routing_path empty to rely on elasticsearch's auto generation.

I'm getting this error when trying to set index.mode=time_series, tried on index template and also component template. Is there any way to work around this error and trigger the auto generation? Thanks!

   "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "[index.mode=time_series] requires a non-empty [index.routing_path]"

@martijnvg
Copy link
Member

Hey @juliaElastic, can you point me to the composable index templates and component templates? Composable index templates is the place where this setting can be used. Typically with component templates, not all settings / mappings are present there and each component template needs to be valid on its own. So if index.mode index setting has been specified in one component and mappings or index.routing_path is in another component or composable index template then storing the component template with index.mode index setting fails, because during validation on its own isn't valid, due to the index.mode index setting validation failing that there is no index.routing_path. Also, in the case index.routing_path is missing, the auto generation of the index.routing_path setting is only performed for composable index templates.

@juliaElastic
Copy link
Contributor

juliaElastic commented Nov 10, 2022

I've tried to add to integrations index template here:
image

As discussed on slack, the setting works fine on installing a package, and the routing_path is generated correctly on the data stream:
image

I did see some errors when trying to add TSBD on existing templates, will check that again.

martijnvg added a commit that referenced this issue Nov 10, 2022
Currently, the key is a map, which can make reducing large responses
more memory intense then it should be also. Also the map used during the
reduce to detect duplicate buckets is not taken into account by circuit breaker.
This map can become very large when reducing large shard level responses.

This commit changes how the key is represented internally.
By using BytesRef instead of Map. This commit doesn't change 
how the key is represented in the response. The reduce is also 
changed to merge the shard responses without creating intermediate 
data structures for detected duplicated buckets. This is possible
because the buckets in the shard level responses are sorted by tsid.

Relates to #74660
@juliaElastic
Copy link
Contributor

juliaElastic commented Nov 15, 2022

@martijnvg So I managed to add the "index.mode=time_series" setting without routing_path to the metrics-system.cpu Index Template without an issue, however I am running to an error when trying to modify the Component Template metrics-system.cpu@custom, which is the parent of the Index Template.

Is there any workaround for this issue?

{
  "name": "ResponseError",
  "meta": {
    "body": {
      "error": {
        "root_cause": [
          {
            "type": "illegal_argument_exception",
            "reason": "updating component template [metrics-system.cpu@custom] results in invalid composable template [metrics-system.cpu] after templates are merged"
          }
        ],
        "type": "illegal_argument_exception",
        "reason": "updating component template [metrics-system.cpu@custom] results in invalid composable template [metrics-system.cpu] after templates are merged",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "[index.mode=time_series] requires a non-empty [index.routing_path]"
        }
      },
      "status": 400
    },
    "statusCode": 400,
    "headers": {
      "x-opaque-id": "59e2d33e-d6c8-4ed4-8d4a-14c412f64871;kibana::management:",
      "x-elastic-product": "Elasticsearch",
      "content-type": "application/json;charset=utf-8",
      "content-length": "549"
    },
    "meta": {
      "context": null,
      "request": {
        "params": {
          "method": "PUT",
          "path": "/_component_template/metrics-system.cpu%40custom",
          "body": "{\"template\":{\"settings\":{},\"mappings\":{\"properties\":{\"dummy\":{\"type\":\"text\"}}}},\"_meta\":{\"package\":{\"name\":\"system\"},\"managed_by\":\"fleet\",\"managed\":true}}",
          "querystring": "",
          "headers": {
            "user-agent": "Kibana/8.6.0",
            "x-elastic-product-origin": "kibana",
            "authorization": "Basic ZWxhc3RpYzpjaGFuZ2VtZQ==",
            "x-opaque-id": "59e2d33e-d6c8-4ed4-8d4a-14c412f64871;kibana::management:",
            "x-elastic-client-meta": "es=8.4.0p,js=16.18.1,t=8.2.0,hc=16.18.1",
            "content-type": "application/vnd.elasticsearch+json; compatible-with=8",
            "accept": "application/vnd.elasticsearch+json; compatible-with=8",
            "content-length": "154"
          }
        },
        "options": {
          "opaqueId": "59e2d33e-d6c8-4ed4-8d4a-14c412f64871;kibana::management:",
          "headers": {
            "x-elastic-product-origin": "kibana",
            "user-agent": "Kibana/8.6.0",
            "authorization": "Basic ZWxhc3RpYzpjaGFuZ2VtZQ==",
            "x-opaque-id": "59e2d33e-d6c8-4ed4-8d4a-14c412f64871",
            "x-elastic-client-meta": "es=8.4.0p,js=16.18.1,t=8.2.0,hc=16.18.1"
          }
        },
        "id": 2
      },
      "name": "elasticsearch-js",
      "connection": {
        "url": "http://localhost:9200/",
        "id": "http://localhost:9200/",
        "headers": {},
        "status": "alive"
      },
      "attempts": 0,
      "aborted": false
    },
    "warnings": null
  }
}

image

image

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Nov 16, 2022
Typically time_series aggregation is wrapped by a date histogram aggregation.
This commit explores idea around making things more efficient for time series agg if this is the case.

This commit explores two main ideas:
* With time series index searcher docs are emitted in tsid and timestamp order. Because of this within docs of the tsid, the date histogram buckets are also emitted in order to sub aggs. This allows time series aggregator to only keep track of the bucket belonging to the current tsid and bucket ordinal. The removes the need for using BytesKeyedBucketOrds, which in production is very heavy. Also given the fact the tsid is a high cardinality field. For each tsid and buck ordinal combination we keep track of doc count and delegate to sub agg. When the tsid / bucket ordinal combination changes the time series agg on the fly creates a new bucket. Sub aggs of time series agg, only ever contain buckets for a single parent bucket ordinal, this allows to always use a bucket ordinal of value 0. After each bucket has been created the sub agg is cleared.
* If the bucket that date histogram creates are contained with the index boundaries of the backing index the shard the search is executed belongs to, then reduction/pipeline aggregation can happen locally only the fly when the time series buckets are created. In order to support this a TimestampBoundsAware interface was added. That can tell a sub agg of a date histogram whether the bounds of parent bucket are within the bounds of the backing index. In this experiment the terms aggregator was hard coded to use min bucket pipeline agg, which gets fed a time series bucket (with sub agg buckets) each time tsid / bucket ordinal combo changes. If buckets are outside backing index boundary then buckets are kept around and pipeline agg is executed in reduce method of InternalTimeSeries response class. This fundamentally changes the time series agg, since the response depends on the pipeline agg used.

The `TimeSeriesAggregator3` contains both of these changes.

Extra notes:
* Date histogram could use `AggregationExecutionContext#getTimestamp()` as source for rounding values into buckets.
* I think there is no need for doc count if pipeline aggs reduce on the fly the buckets created by time series agg.
* Date agg's filter by filter optimization has been disabled when agg requires in order execution. The time series index searcher doesn't work with filter by filter optimization.

Relates to elastic#74660
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Nov 22, 2022
…t ordinal and buck ordinal.

This avoids needlessly adding the same parent bucket ordinal or TSIDs to `BytesKeyedBucketOrds`.

Relates to elastic#74660
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Nov 23, 2022
…that docids are emitted in tsid and parent bucket ordinal.

This is true when the parent aggregation is data histogram (which is typical),
due to the fact that TimeSeriesIndexSearcher emits docs in tsid and timestamp order.

Relates to elastic#74660
martijnvg added a commit that referenced this issue Nov 30, 2022
…t ordinal and buck ordinal (#91784)

This avoids needlessly adding the same parent bucket ordinal or TSIDs to `BytesKeyedBucketOrds`.

Relates to #74660
@martijnvg martijnvg mentioned this issue Aug 25, 2023
14 tasks
@martijnvg
Copy link
Member

Initial TSDB support has been added a while ago. I moved the leftover tasks to #98877

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Meta :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests