Hash the tsid to overcome dimensions limits #98023

salvatore-campagna · 2023-07-28T07:43:47Z

A Lucene limitation on doc values for UTF-8 fields does not allow
us to write keyword fields whose size is larger then 32K. This limits
our ability to map more than a certain number of dimension fields
for time series indices. Before introducing this change the tsid is
created as a catenation of dimension field names and values into
a keyword field.

To overcome this limitation we hash the tsid. This PR is intended
to be used as a draft to test different options.

Note that, as a side effect, this reduces the size of the tsid field
as a result of storing far less data when the tsid is hashed. Anyway,
we expect tsid hashing to affect compression of doc values and
resulting in larger storage footprint. Effect on query latency
needs to be evaluated too.

Resolves #93564

A Lucene limitation on doc values for UTF-8 fields does not allow us to write string fields whose size is larger then 32K. This limits our ability to map more than a certain number of dimension fields for time series indices. To overcome this limitation we hash the tsid if the size of the fields exceeds the limit and prepend the hash with a prefix that makes it easier to query fields and investigate how many documents are using an hahed tsid.

salvatore-campagna · 2023-07-28T07:52:01Z

This breaks for sure downsampling persistent task assumption about _tsid ordering.
See #97557.

If we decide to proceed with this tsid hashing solution we can address the issue in the
downsampling PR by searching if the source index includes hashed tsids and eventually
starting from scratch that specific downsampling task. The idea is that we use tsid
hashing rarely and, as a result, we would incur this performance penalty rarely too.

salvatore-campagna · 2023-07-28T10:21:34Z

server/src/main/java/org/elasticsearch/index/mapper/TsidExtractingIdFieldMapper.java

-        assert context.getDynamicMappers().isEmpty() == false
-            || context.getDynamicRuntimeFields().isEmpty() == false
-            || id.equals(indexRouting.createId(TimeSeriesIdFieldMapper.decodeTsid(tsid), suffix));
+        // assert context.getDynamicMappers().isEmpty() == false


I need to comment this assertion, at least for now, because when we hash the tsid we do not serialise dimensions and routing fields which later on results in a failure since createId will throw an exception due to missing routing fields.

Maybe this is a matter of where we do the hashing? Maybe we should alter TimeSeriesIdBuilder class.
Just hash the timeSeriesId in TimeSeriesIdFieldMapper#postParse(...) after line 148 only if it is beyond the 32KB limit and store that as SortedDocValuesField. And just pass down the original id to TsidExtractingIdFieldMapper#createField(...)?

Can this assertion be re-enabled?

indexRouting.createId requires a Map as input... decodeTsid does not return a Map anymore.
I don't think we can have that assertion as it was before. Note that decodeTsid returns a Base64 encoding of the tsid and I see no way to retrieve the id given a Base64 encoding of the tsid. Maybe I can do something using the plain (not base64 encoded) tsid.

felixbarny · 2023-08-01T07:33:49Z

Not saying it's better but one alternative idea to only hashing if the tsid is over 32k I had was to always hash but make the hash consist of two parts: ${similarityHash(_tsid)}-${hash(_tsid)}.

This would have two potential advantages:

Always use the same hashing approach, regardless of the length of the input. This avoids having a rarely executed special case for large _tsids.
Potentially does a better job at co-locating time series that are similar close to one another. IINM, this is desirable as it reduces disk usage. Currently, if the first dimension for two time series differ, but the rest of the dimensions are the same, they're not co-located on disk because of the lexicographically sort order based on the _tsid. Using a similarity hash as the first part of the _tsid may not have this limitation.

However, I suppose that this approach would need more testing and tuning than the approach in this PR as it has an impact on all cases, not just edge cases. Not sure what impact this would have on downsampling and its assumptions around ordering. Also, this would change the _tsid of all time series when upgrading from a previous version to that new strategy.

I'll leave it up to you and the team to decide if it's worth investigating that approach.

salvatore-campagna · 2023-08-03T16:43:07Z

Right now running the cardinality aggregation on one of the dimension fields fails.

salvatore-campagna · 2023-08-23T08:31:07Z

@elasticmachine test this please

salvatore-campagna · 2023-08-29T13:34:29Z

@elasticmachine run elasticsearch-ci/bwc

salvatore-campagna · 2023-08-30T07:35:37Z

@felixbarny do you have something specific in mind for similarityHash?

I understand the goal is to favor clustering of similar tsid but I was wondering if you are thinking of a specific hash function.

felixbarny · 2023-08-30T07:51:10Z

I was thinking of a locality-sensitive hash (LSH). Maybe there's already something in the ES code base that we could re-use, not sure. Maybe something related to nearest neighbor search or maybe we can reuse parts of the MinHash token filter.

I bet folks like @jpountz or @benwtrent would know best what's an appropriate hash function for that use case.

felixbarny · 2023-08-30T12:01:15Z

Seems like this is somewhat related: apache/lucene#12489

salvatore-campagna · 2023-08-30T13:54:50Z

@felixbarny I was considering doing something a bit simpler like for instance generate the first piece of the hash, hashing just a subset of the dimensions. I was considering to include in this subset at least the set of fields used in the routing_path so that colocation is also related to the documents actual physical shard location.

So something like

hash(routing_path)

putting a limit like an upper bound to the number of fields included at least to deal with the fact that, if I understand correctly, typically integrations use routing_path = dimensions.

So something more like

hash(subset(dimensions, LIMIT))

where LIMIT is some integer value lower that the total number of dimensions.

Anyway this probably would not work well if the set of fields in the routing_path is small...which means other than having an upper bound I would also need a lower bound.

salvatore-campagna · 2023-08-30T14:23:41Z

Another reason why I was considering what I said above is that it would allow us, maybe later, to also add an optional additional attribute to dimension fields to have applications controlling which dimensions to hash, because for instance applications know better which time series are used together?

benwtrent · 2023-08-30T17:58:46Z

@felixbarny @salvatore-campagna do we know the tsids ahead of time? I am guessing not.

Could we get away with only hashing PART of the tsid? What constitutes the biggest part of it?

There are various order-preserving hashes out there, I am honestly surprised we don't have one implemented in Elasticsearch yet.

I don't think you want to "cluster" things but instead you want the order of things to be preserved.

A variation on MinHashFilter could work. But I don't think it would be trivial to implement. Probably something that takes inspiration from that, or one of the order-preserving near-perfect hashes out there.

felixbarny · 2023-08-31T07:21:10Z

I don't think you want to "cluster" things but instead you want the order of things to be preserved.

Overall, the goal is to reduce storage space for the index data structures. As TSDB has a built-in index sorting on the _tsid field, the order of that field determines which time series are co-located to one another. The assumption is that if you co-locate _tsids that have similar dimensions, the index data structures can be more compact.

I'm not sure if we've validated that assumption. I think it would be useful to check what happens to the storage size when comparing the current implementation where the _tsid is a concatenation of all dimensions (scenario a) vs a simple hash for the full _tsid (scenario b). We'd expect to see less storage needed for the index data structure in scenario a, as in scenario b, the distribution of time series is basically random.

However, scenario a is also not perfect: If you have two time series for a metric with 10 dimensions each. If only the first dimension differs, the two time series will probably not be co-located to one another, even though the rest of the dimensions may be the same. That's where the idea with the similarity hash comes in: if we can cluster time series whose dimensions are similar to each other, we may see less storage needed for index data structures.

However, I'm not sure if the index sort is enough for that. We may also need to think about the shard _routing. But we need to be careful that this doesn't result in shard hot-spotting in case they're different sets of clusters that have an uneven distribution.

Having said all of that, I can propose an approach that should have similar properties to the current mechanism but allows us to remove the dimension limit. The basic idea is that we construct the _tsid similarly to today but instead of using the full keys and values for the dimensions, we're hashing each key and value individually with a short hash (for example 32 bit hashes like Adler32, Crc32c, or even String#hashCode) on the first n dimensions. N can be a relatively high number like 1024 and we'd still be well under Lucene's 32kb keyword token length limit. To avoid hash collisions, we prepend a larger hash (like 256 bit, such as sha256) at the end.

Pseudo code:

sort(dimensions)
_tsid = new ByteBuffer()
for (i = 0; i < dimensions.length && i < 1024; i++) {
  _tsid.put(hash32(dimensions[i].key))
  _tsid.put('=')
  _tsid.put(hash32(dimensions[i].value))
  _tsid.put(':')
}
_tsid.put(hash256(dimensions))

salvatore-campagna · 2023-08-31T08:05:06Z

I don't think you want to "cluster" things but instead you want the order of things to be preserved.

Overall, the goal is to reduce storage space for the index data structures. As TSDB has a built-in index sorting on the _tsid field, the order of that field determines which time series are co-located to one another. The assumption is that if you co-locate _tsids that have similar dimensions, the index data structures can be more compact.

I'm not sure if we've validated that assumption. I think it would be useful to check what happens to the storage size when comparing the current implementation where the _tsid is a concatenation of all dimensions (scenario a) vs a simple hash for the full _tsid (scenario b). We'd expect to see less storage needed for the index data structure in scenario a, as in scenario b, the distribution of time series is basically random.

However, scenario a is also not perfect: If you have two time series for a metric with 10 dimensions each. If only the first dimension differs, the two time series will probably not be co-located to one another, even though the rest of the dimensions may be the same. That's where the idea with the similarity hash comes in: if we can cluster time series whose dimensions are similar to each other, we may see less storage needed for index data structures.

However, I'm not sure if the index sort is enough for that. We may also need to think about the shard _routing. But we need to be careful that this doesn't result in shard hot-spotting in case they're different sets of clusters that have an uneven distribution.

Having said all of that, I can propose an approach that should have similar properties to the current mechanism but allows us to remove the dimension limit. The basic idea is that we construct the _tsid similarly to today but instead of using the full keys and values for the dimensions, we're hashing each key and value individually with a short hash (for example 32 bit hashes like Adler32, Crc32c, or even String#hashCode) on the first n dimensions. N can be a relatively high number like 1024 and we'd still be well under Lucene's 32kb keyword token length limit. To avoid hash collisions, we prepend a larger hash (like 256 bit, such as sha256) at the end.

Pseudo code:
_tsid = ""
for (i = 0; i < dimensions.length && i < 1024; i++) {
  _tsid += hash32(dimensions[i].key)
  _tsid += ':'
  _tsid += hash32(dimensions[i].value)
  _tsid += ';'
}
_tsid += hash256(dimensions)

This is similar to what I was considering above, where the subset of dimensions corresponds to the first 1024 dimensions here (LIMIT =1024).

felixbarny · 2023-08-31T08:27:44Z

The difference is that I was proposing to hash each dimension key and value individually (a) where I think you were proposing to take a subset of the dimensions and create a single hash (b) out of that.

In a case where only the last dimension value is different, with approach b, the hash is entirely different but with approach a, only the last part will differ. Therefore, potentially doing a better job at clustering similar time series next to one another.

But I'm still not sure whether we're chasing ghosts here and whether that clustering actually makes a big difference in storage size.

salvatore-campagna · 2023-08-31T08:30:15Z

The difference is that I was proposing to hash each dimension key and value individually (a) where I think you were proposing to take a subset of the dimensions and create a single hash (b) out of that.

In a case where only the last dimension value is different, with approach b, the hash is entirely different but with approach a, only the last part will differ. Therefore, potentially doing a better job at clustering similar time series next to one another.

But I'm still not sure whether we're chasing ghosts here and whether that clustering actually makes a big difference in storage size.

To answer the last part I would like to werite the PR so that to have a branch where, by changing a setting, we can test the following:

hash without similarity for _tsid > 32 kb
hash with similarity for _tsid > 32 kb
hash always without similarity
hash always with similarity

## Summary This RP updates two `data.json.gz` files to remove all doc IDs (which is unsupported according to #176109 (comment)), which started causing failings after tsid hashing was implemented by ES in elastic/elasticsearch#98023. Resolves #176103. Resolves #176102. Resolves #176100. Resolves #176107. Resolves #176108. Resolves #176109. ### Checklist - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] Any UI touched in this PR is usable by keyboard only (learn more about [keyboard accessibility](https://webaim.org/techniques/keyboard/)) - [ ] Any UI touched in this PR does not create any new axe failures (run axe in browser: [FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/), [Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US)) - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This renders correctly on smaller devices using a responsive layout. (You can test this [in your browser](https://www.browserstack.com/guide/responsive-testing-on-local-server)) - [ ] This was checked for [cross-browser compatibility](https://www.elastic.co/support/matrix#matrix_browsers) ### For maintainers - [ ] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

## Summary This RP updates two `data.json.gz` files to remove all doc IDs (which is unsupported according to elastic#176109 (comment)), which started causing failings after tsid hashing was implemented by ES in elastic/elasticsearch#98023. Resolves elastic#176103. Resolves elastic#176102. Resolves elastic#176100. Resolves elastic#176107. Resolves elastic#176108. Resolves elastic#176109. ### Checklist - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] Any UI touched in this PR is usable by keyboard only (learn more about [keyboard accessibility](https://webaim.org/techniques/keyboard/)) - [ ] Any UI touched in this PR does not create any new axe failures (run axe in browser: [FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/), [Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US)) - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This renders correctly on smaller devices using a responsive layout. (You can test this [in your browser](https://www.browserstack.com/guide/responsive-testing-on-local-server)) - [ ] This was checked for [cross-browser compatibility](https://www.elastic.co/support/matrix#matrix_browsers) ### For maintainers - [ ] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

After tsid hashing was introduced (#98023), the time series aggregator generates the tsid (from all dimension fields) instead of using the value from the _tsid field directly. This generation of the tsid happens for every time serie, parent bucket and segment combination. This changes alters that by only generating the tsid once per time serie and segment. This is done by just locally recording the current tsid.

After tsid hashing was introduced (elastic#98023), the time series aggregator generates the tsid (from all dimension fields) instead of using the value from the _tsid field directly. This generation of the tsid happens for every time serie, parent bucket and segment combination. This changes alters that by only generating the tsid once per time serie and segment. This is done by just locally recording the current tsid.

After tsid hashing was introduced (#98023), the time series aggregator generates the tsid (from all dimension fields) instead of using the value from the _tsid field directly. This generation of the tsid happens for every time serie, parent bucket and segment combination. This changes alters that by only generating the tsid once per time serie and segment. This is done by just locally recording the current tsid.

Missing a check on the transport version results in unreadable cluster state if it includes a serialized instance of DownsampleShardTaskParams. #98023 introduced an optional string array including dimensions used by time serie indices. Reading an optional array requires reading a boolean first which is required to know if an array of values exists in serialized form. From 8.13 on we try to read such a boolean which is not there because older versions don't write any boolean nor any string array. Here we include the check on versions for backward compatibility skipping reading any boolean or array whatsoever whenever not possible. Customers using downsampling might have cluster states including such serielized objects and would be unable to upgrade to version 8.13. They will be able to upgrade to any version including this fix. This fix has a side effect #106880

salvatore-campagna added >non-issue :StorageEngine/TSDB You know, for Metrics v8.10.0 labels Jul 28, 2023

salvatore-campagna self-assigned this Jul 28, 2023

fix: comment document id assertion

1d900e4

salvatore-campagna commented Jul 28, 2023

View reviewed changes

salvatore-campagna added 2 commits August 2, 2023 18:22

fix: base 64 encode the hash

0c24a12

test: a few aggregations (cardinality not working)

672e27c

fix: minor refactor

7e78dac

salvatore-campagna requested a review from martijnvg August 7, 2023 10:11

quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023

fix: refactor and disable test

a9f439b

Merge branch 'main' into fix/93564-time-series-dimensions-limit

8ce7d21

salvatore-campagna added 5 commits February 1, 2024 11:31

fix: spelling hasing vs hashing

e3a49ba

fix: update Lucene version

b0ab308

fix: checkstyle warnings

76a38c9

Merge branch 'main' into fix/93564-time-series-dimensions-limit

78f30aa

fix: remove conflict marker

c53fac5

salvatore-campagna added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Feb 1, 2024

elasticsearchmachine merged commit bdd3a4f into elastic:main Feb 1, 2024
14 checks passed

salvatore-campagna deleted the fix/93564-time-series-dimensions-limit branch February 1, 2024 13:25

felixbarny mentioned this pull request Feb 1, 2024

Missing metric name in tsid results in TSDB duplicate detection dropping data #99123

Open

davismcphee mentioned this pull request Feb 2, 2024

Fix broken tests as a result of tsid hashing elastic/kibana#176120

Merged

10 tasks

davidkyle mentioned this pull request Feb 5, 2024

[CI] TSDB Mixed cluster tests failing #105129

Closed

martijnvg mentioned this pull request Mar 13, 2024

Small time series agg improvement #106288

Merged

StephanErb mentioned this pull request Mar 20, 2024

Eliminate fingerprinting in Prometheus TSDB input for Elastic 8.13 and above elastic/integrations#9400

Closed

salvatore-campagna mentioned this pull request Mar 28, 2024

Gate reading of optional string array for bwc #106878

Merged

This was referenced Mar 29, 2024

Harden new tsid hashing transport version #106897

Closed

Harden new tsid hashing transport version #106898

Closed

salvatore-campagna mentioned this pull request Mar 29, 2024

Unable to restart uncompleted downsampling tasks in ES 8.13 and above #106880

Closed

agithomas mentioned this pull request May 17, 2024

[Meta] [TSDB] Eliminate fingerprinting of dimensions and labels added as part of TSDB Enablement elastic/integrations#9910

Open

felixbarny removed a link to an issue Jan 16, 2025

Missing metric name in tsid results in TSDB duplicate detection dropping data #99123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash the tsid to overcome dimensions limits #98023

Hash the tsid to overcome dimensions limits #98023

salvatore-campagna commented Jul 28, 2023 •

edited

Loading

salvatore-campagna commented Jul 28, 2023 •

edited

Loading

salvatore-campagna Jul 28, 2023 •

edited

Loading

martijnvg Aug 8, 2023

martijnvg Nov 27, 2023

salvatore-campagna Nov 27, 2023 •

edited

Loading

felixbarny commented Aug 1, 2023

salvatore-campagna commented Aug 3, 2023

salvatore-campagna commented Aug 23, 2023

salvatore-campagna commented Aug 29, 2023

salvatore-campagna commented Aug 30, 2023

felixbarny commented Aug 30, 2023

felixbarny commented Aug 30, 2023

salvatore-campagna commented Aug 30, 2023 •

edited

Loading

salvatore-campagna commented Aug 30, 2023 •

edited

Loading

benwtrent commented Aug 30, 2023

felixbarny commented Aug 31, 2023 •

edited

Loading

salvatore-campagna commented Aug 31, 2023 •

edited

Loading

felixbarny commented Aug 31, 2023

salvatore-campagna commented Aug 31, 2023 •

edited

Loading

Hash the tsid to overcome dimensions limits #98023

Hash the tsid to overcome dimensions limits #98023

Conversation

salvatore-campagna commented Jul 28, 2023 • edited Loading

salvatore-campagna commented Jul 28, 2023 • edited Loading

salvatore-campagna Jul 28, 2023 • edited Loading

Choose a reason for hiding this comment

martijnvg Aug 8, 2023

Choose a reason for hiding this comment

martijnvg Nov 27, 2023

Choose a reason for hiding this comment

salvatore-campagna Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

felixbarny commented Aug 1, 2023

salvatore-campagna commented Aug 3, 2023

salvatore-campagna commented Aug 23, 2023

salvatore-campagna commented Aug 29, 2023

salvatore-campagna commented Aug 30, 2023

felixbarny commented Aug 30, 2023

felixbarny commented Aug 30, 2023

salvatore-campagna commented Aug 30, 2023 • edited Loading

salvatore-campagna commented Aug 30, 2023 • edited Loading

benwtrent commented Aug 30, 2023

felixbarny commented Aug 31, 2023 • edited Loading

salvatore-campagna commented Aug 31, 2023 • edited Loading

felixbarny commented Aug 31, 2023

salvatore-campagna commented Aug 31, 2023 • edited Loading

salvatore-campagna commented Jul 28, 2023 •

edited

Loading

salvatore-campagna commented Jul 28, 2023 •

edited

Loading

salvatore-campagna Jul 28, 2023 •

edited

Loading

salvatore-campagna Nov 27, 2023 •

edited

Loading

salvatore-campagna commented Aug 30, 2023 •

edited

Loading

salvatore-campagna commented Aug 30, 2023 •

edited

Loading

felixbarny commented Aug 31, 2023 •

edited

Loading

salvatore-campagna commented Aug 31, 2023 •

edited

Loading

salvatore-campagna commented Aug 31, 2023 •

edited

Loading