-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add better support for metric data types (TSDB) #74660
Comments
Pinging @elastic/es-analytics-geo (Team:Analytics) |
This PR adds the following constraints to dimension fields: It must be an indexed field and must has doc values It cannot be multi-valued The number of dimension fields in the index mapping must not be more than 16. This should be configurable through an index property (index.mapping.dimension_fields.limit) keyword fields cannot be more than 1024 bytes long keyword fields must not use a normalizer Based on the code added in PR #74450 Relates to #74660
…rameters (#78265) Backports the following PRs: * Add dimension mapping parameter (#74450) Added the dimension parameter to the following field types: keyword ip Numeric field types (integer, long, byte, short) The dimension parameter is of type boolean (default: false) and is used to mark that a field is a time series dimension field. Relates to #74014 * Add constraints to dimension fields (#74939) This PR adds the following constraints to dimension fields: It must be an indexed field and must has doc values It cannot be multi-valued The number of dimension fields in the index mapping must not be more than 16. This should be configurable through an index property (index.mapping.dimension_fields.limit) keyword fields cannot be more than 1024 bytes long keyword fields must not use a normalizer Based on the code added in PR #74450 Relates to #74660 * Expand DocumentMapperTests (#76368) Adds a test for setting the maximum number of dimensions setting and tests the names and types of the metadata fields in the index. Previously we just asserted the count of metadata fields. That made it hard to read failures. * Fix broken test for dimension keywords (#75408) Test was failing because it was testing 1024 bytes long keyword and assertion was failing. Closes #75225 * Checkstyle * Add time_series_metric parameter (#76766) This PR adds the time_series_metric parameter to the following field types: Numeric field types histogram aggregate_metric_double * Rename `dimension` mapping parameter to `time_series_dimension` (#78012) This PR renames dimension mapping parameter to time_series_dimension to make it consistent with time_series_metric parameter (#76766) Relates to #74450 and #74014 * Add time series params to `unsigned_long` and `scaled_float` (#78204) Added the time_series_metric mapping parameter to the unsigned_long and scaled_float field types Added the time_series_dimension mapping parameter to the unsigned_long field type Fixes #78100 Relates to #76766, #74450 and #74014 Co-authored-by: Nik Everett <nik9000@gmail.com>
Exposes information about dimensions and metrics via field caps. This information will be needed for PromQL support. Relates to elastic#74660
Exposes information about dimensions and metrics via field caps. This information will be needed for PromQL support. Relates to #74660
Adds basic support for selectors in TimeSeriesMetricsService Relates to #74660
This PR renames all public APIs for downsampling so that they contain the downsample keyword instead of the rollup that we had until now. 1. The API endpoint for the downsampling action is renamed to: /source-index/_downsample/target-index 2. The ILM action is renamed to PUT _ilm/policy/my_policy { "policy": { "phases": { "warm": { "actions": { "downsample": { "fixed_interval": "24h" } } } } } } 3. unsupported_aggregation_on_rollup_index was renamed to unsupported_aggregation_on_downsampled_index 4. Internal trasport actions were renamed: indices:admin/xpack/rollup -> indices:admin/xpack/downsample indices:admin/xpack/rollup_indexer -> indices:admin/xpack/downsample_indexer 5. Renamed the following index settings: index.rollup.source.uuid -> index.downsample.source.uuid index.rollup.source.name -> index.downsample.source.name index.rollup.status -> index.downsample.status Finally, we renamed many internal variables and classes from *Rollup* to *Downsample*. However, this effort will be completed in more than one PRs so that we minimize conflicts with other in-flight PRs. Relates to #74660
This PR removes the feature flag for the time-series data ingestion and downsampling functionality, making time-series indices and downsampling available For more information about the released functionality, see #74660 Aggregation time_series still remains behind the feature flag
Sorry for my newbie question. Is this the same as https://www.elastic.co/guide/en/elasticsearch///reference/master/tsds.html ? Thanks |
@oatkiller Yes. It's the same thing. :) (I helped write the initial docs.) Hope you enjoy Elastic! It's an awesome place to work. |
Currently, the key is a map, which can make reducing large response more memory intense then it should be also. Also data structures used during reduce are not back by bigarrays so not accounted for. This commit changes how the key is represented internally. By using BytesRef instead of Map. This doesn't commit doesn't change how the key is represented in the response. It also changes the reduce method to make use of the bucket keys are now bytes refs. Relates to elastic#74660
@martijnvg Hi! I am working on a feature on Fleet UI to enable TSDB index setting, and trying to leave I'm getting this error when trying to set
|
Hey @juliaElastic, can you point me to the composable index templates and component templates? Composable index templates is the place where this setting can be used. Typically with component templates, not all settings / mappings are present there and each component template needs to be valid on its own. So if |
Currently, the key is a map, which can make reducing large responses more memory intense then it should be also. Also the map used during the reduce to detect duplicate buckets is not taken into account by circuit breaker. This map can become very large when reducing large shard level responses. This commit changes how the key is represented internally. By using BytesRef instead of Map. This commit doesn't change how the key is represented in the response. The reduce is also changed to merge the shard responses without creating intermediate data structures for detected duplicated buckets. This is possible because the buckets in the shard level responses are sorted by tsid. Relates to #74660
@martijnvg So I managed to add the Is there any workaround for this issue?
|
Typically time_series aggregation is wrapped by a date histogram aggregation. This commit explores idea around making things more efficient for time series agg if this is the case. This commit explores two main ideas: * With time series index searcher docs are emitted in tsid and timestamp order. Because of this within docs of the tsid, the date histogram buckets are also emitted in order to sub aggs. This allows time series aggregator to only keep track of the bucket belonging to the current tsid and bucket ordinal. The removes the need for using BytesKeyedBucketOrds, which in production is very heavy. Also given the fact the tsid is a high cardinality field. For each tsid and buck ordinal combination we keep track of doc count and delegate to sub agg. When the tsid / bucket ordinal combination changes the time series agg on the fly creates a new bucket. Sub aggs of time series agg, only ever contain buckets for a single parent bucket ordinal, this allows to always use a bucket ordinal of value 0. After each bucket has been created the sub agg is cleared. * If the bucket that date histogram creates are contained with the index boundaries of the backing index the shard the search is executed belongs to, then reduction/pipeline aggregation can happen locally only the fly when the time series buckets are created. In order to support this a TimestampBoundsAware interface was added. That can tell a sub agg of a date histogram whether the bounds of parent bucket are within the bounds of the backing index. In this experiment the terms aggregator was hard coded to use min bucket pipeline agg, which gets fed a time series bucket (with sub agg buckets) each time tsid / bucket ordinal combo changes. If buckets are outside backing index boundary then buckets are kept around and pipeline agg is executed in reduce method of InternalTimeSeries response class. This fundamentally changes the time series agg, since the response depends on the pipeline agg used. The `TimeSeriesAggregator3` contains both of these changes. Extra notes: * Date histogram could use `AggregationExecutionContext#getTimestamp()` as source for rounding values into buckets. * I think there is no need for doc count if pipeline aggs reduce on the fly the buckets created by time series agg. * Date agg's filter by filter optimization has been disabled when agg requires in order execution. The time series index searcher doesn't work with filter by filter optimization. Relates to elastic#74660
…t ordinal and buck ordinal. This avoids needlessly adding the same parent bucket ordinal or TSIDs to `BytesKeyedBucketOrds`. Relates to elastic#74660
…that docids are emitted in tsid and parent bucket ordinal. This is true when the parent aggregation is data histogram (which is typical), due to the fact that TimeSeriesIndexSearcher emits docs in tsid and timestamp order. Relates to elastic#74660
Initial TSDB support has been added a while ago. I moved the leftover tasks to #98877 |
Phase 0 - Inception
Phase 1 - Mappings
time_series_dimension
mapping parameter to fieldsdimension
mapping parameter totime_series_dimension
#78012 @csouliosPhase 2 - Ingest
Dimension-based tsid generator
_tsid
field to time_series indices #80276 @csoulios_tsid
#81382 @csoulios (prototype)_tsid
field #81998 @csouliosRouting
BulkOperation
. Maybe we can make this simpler.ids
query on time series index #81436 @csoulios_id
for tsid (TSDB: Support GET and DELETE and doc versioning #82633)_id
is automatically generated (TSDB: improve document description on error #84903, TSDB: Add dimensions and timestamp to parse errors #84962)_tsid
and@timestamp
(TSDB: Expand _id on version conflict #84957)@timestamp
component of the_id
from little endian to big endian. That should mean there are more common prefixes. TSDB: shrink _id inverted index #85008 cuts the size of the inverted index for_id
by 37%. That's not a lot of the index in total, but it sure does feel good for such a small change._id
inRecoverySourceHandlerTests.java
andEngineTests.java
Test time series id in RecoverySourceHandlerTests #84996, Use tsdb's id in Engine tests #85055@timestamp
or dimensions in reindex TSDB: Initial reindex fix #86647 + Reindex support for TSDB creating indices #86704_id
with the securitycreate_doc
privilege. Can a user withcreate_doc
(only) ingest new TSDB docs? Doescreate_doc
prevent a user from overwriting an existing TSDB doc? (create_doc
relies on theOpType
of theIndexRequest
, which is automatically set toCREATE
for docs with auto-generated ids) TSDB: Testcreate_doc
permission #86638Handling Time Boundaries
start_time
,end_time
index settings ) @weizijunindex.time_series.start_time
andindex.time_series.end_time
index settings) that don't match with the@timestamp
range in a search request. Skip backing indices with a disjoint range on @timestamp field. #85162 (@martijnvg)Other tasks
index_mode
setting isn't good enough. It requires additional config to be specified (time_series_dimension
attribute in mappings andindex.routing_path
as index settings) elsewhere and it doesn't allow the data stream tsdb features (routing based on@timestamp
field) to be enabled without enabled the index level tsdb features.index.mode
setting is set totime_series
.index.routing_path
index setting if not defined in composable index template that creates a tsdb data stream. All mapped fields of typekeyword
andtime_series_dimension
enabled will be included in the generatedindex.routing_path
index setting. Auto generate index.routing_path from mapping #86790 (@martijnvg)- [ ] Theindex.routing_path
index setting generation doesn't kick in when index.mode and dimension fields are defined in component templates. (@martijnvg).Phase 2.1 Ingest follow ups
- [ ] Build the_id
from dimension values- [ ] Investigate moving timestamp to the front of the_id
to automatically get an optimization on_id
searches. Not sure if worth it - but possible. #84928 could be an alternative_id
in lucene stored fields. We could regenerate it from the_source
or from doc values for the@timestamp
and the_tsid
. That'd save some bytes per document.IndexRequest#autoGeneratId
? It's a bit spook where it is but I don't like it any other place._update_by_query
when modifying the dimensions or@timestamp
_id
and assert that it matches the_id
from the primary. Should we? Probably. Let's make sure.- [ ] Document best practices for using dimensions-based ID generator including how to use this with component templatesPhase 3.1 QL storage API (Postponed)
- [ ] Reimplement QL storage API for TSDB database (depends on completion of Phase 2 and 3.2) (Postponed)Phase 3.2 - Search MVP
Plans time series support in _search api are superceded by plans for this in ES|QL.
- [ ] Aggregation results filtering- [ ] Retrieve the last value for a time series metric within a parent bucket- [ ] Add a new histogram field subtype to support Prometheus-style histograms- [ ] TSDB indices could speed up cardinality aggregations on dimension fields #85523- [ ] Should the _tsid agg return doc_counts by default?- [ ] Shortcut aggs for TSDB #90423Phase 3.3 - Rollup / Downsampling
TimeSeriesIndexSearcher
and compute rollups docs and add them to the rollup indexaggregate_metric_double
fields #90029 @csouliosfixed_interval
vscalendar_interval
time_zone
date_histogram
resolutionaggregate_metric_double
fields as their own field type instead ofdouble
#87849 @csouliosPhase 3.4 - TSID aggs (superseded by tsdb in ES|QL)
~~ - [ ] Update min, max, sum, avg pipeline aggs for intermediate result filtering optimization ~~
~~ - [ ] Sliding window aggregation ~~
~~ - [ ] A way to filter to windows within the sliding window. Like "measurements take in the last 30 seconds of the window". ~~
~~ - [ ] Open transform issue for newly added time series aggs ~~
~~ - [ ] Benchmarks for the tsid agg ~~
Phase 3.5 - Downsampling follow ups
Phase 4.0 - Compression
_source
@nik9000 Synthetic Source #86603Phase 5.0 - Follow-ups and Nice-to-have-s
_id
's values. Right now it's aString
that we encode withUid.encodeId
. That was reasonable. Maybe it still is. But it feels complex and for tsdb who's_id
is always some bytes. And encoding it also wastes a byte about 1/128 of the time. It's a common prefix byte so this is probably not really an issue. But still. This is a big change but it'd make ES easier to read. Probably wouldn't really improve the storage though.index.routin_path
), the source needs to be parsed on coordinating node. However in the case that an ingest pipeline is executed this, then the source of document will be parsed for the second time. Ideally the routing values should be extracted when ingest is performed. Similar to how the@timestamp
field is already retrieved from a document during pipeline execution.Instant
. The format being used is:strict_date_optional_time_nanos||strict_date_optional_time||epoch_millis
. This to allow regular data format, data nanos date format and epoch since mills defined as string. We can optimise the data parsing if we know the exact format being used. For example if on data stream there is parameter that indices that exact data format we can optimise parsing by either usingstrict_date_optional_time_nanos
,strict_date_optional_time
orepoch_millis
.The text was updated successfully, but these errors were encountered: