Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash the tsid to overcome dimensions limits #98023

Conversation

salvatore-campagna
Copy link
Contributor

@salvatore-campagna salvatore-campagna commented Jul 28, 2023

A Lucene limitation on doc values for UTF-8 fields does not allow
us to write keyword fields whose size is larger then 32K. This limits
our ability to map more than a certain number of dimension fields
for time series indices. Before introducing this change the tsid is
created as a catenation of dimension field names and values into
a keyword field.

To overcome this limitation we hash the tsid. This PR is intended
to be used as a draft to test different options.

Note that, as a side effect, this reduces the size of the tsid field
as a result of storing far less data when the tsid is hashed. Anyway,
we expect tsid hashing to affect compression of doc values and
resulting in larger storage footprint. Effect on query latency
needs to be evaluated too.

Resolves #93564

A Lucene limitation on doc values for UTF-8 fields does not
allow us to write string fields whose size is larger then
32K. This limits our ability to map more than a certain
number of dimension fields for time series indices.

To overcome this limitation we hash the tsid if the size of
the fields exceeds the limit and prepend the hash with a prefix
that makes it easier to query fields and investigate how many
documents are using an hahed tsid.
@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Jul 28, 2023

This breaks for sure downsampling persistent task assumption about _tsid ordering.
See #97557.

If we decide to proceed with this tsid hashing solution we can address the issue in the
downsampling PR by searching if the source index includes hashed tsids and eventually
starting from scratch that specific downsampling task. The idea is that we use tsid
hashing rarely and, as a result, we would incur this performance penalty rarely too.

assert context.getDynamicMappers().isEmpty() == false
|| context.getDynamicRuntimeFields().isEmpty() == false
|| id.equals(indexRouting.createId(TimeSeriesIdFieldMapper.decodeTsid(tsid), suffix));
// assert context.getDynamicMappers().isEmpty() == false
Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to comment this assertion, at least for now, because when we hash the tsid we do not serialise dimensions and routing fields which later on results in a failure since createId will throw an exception due to missing routing fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a matter of where we do the hashing? Maybe we should alter TimeSeriesIdBuilder class.
Just hash the timeSeriesId in TimeSeriesIdFieldMapper#postParse(...) after line 148 only if it is beyond the 32KB limit and store that as SortedDocValuesField. And just pass down the original id to TsidExtractingIdFieldMapper#createField(...)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this assertion be re-enabled?

Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexRouting.createId requires a Map as input... decodeTsid does not return a Map anymore.
I don't think we can have that assertion as it was before. Note that decodeTsid returns a Base64 encoding of the tsid and I see no way to retrieve the id given a Base64 encoding of the tsid. Maybe I can do something using the plain (not base64 encoded) tsid.

@felixbarny
Copy link
Member

Not saying it's better but one alternative idea to only hashing if the tsid is over 32k I had was to always hash but make the hash consist of two parts: ${similarityHash(_tsid)}-${hash(_tsid)}.

This would have two potential advantages:

  • Always use the same hashing approach, regardless of the length of the input. This avoids having a rarely executed special case for large _tsids.
  • Potentially does a better job at co-locating time series that are similar close to one another. IINM, this is desirable as it reduces disk usage. Currently, if the first dimension for two time series differ, but the rest of the dimensions are the same, they're not co-located on disk because of the lexicographically sort order based on the _tsid. Using a similarity hash as the first part of the _tsid may not have this limitation.

However, I suppose that this approach would need more testing and tuning than the approach in this PR as it has an impact on all cases, not just edge cases. Not sure what impact this would have on downsampling and its assumptions around ordering. Also, this would change the _tsid of all time series when upgrading from a previous version to that new strategy.

I'll leave it up to you and the team to decide if it's worth investigating that approach.

@salvatore-campagna
Copy link
Contributor Author

Right now running the cardinality aggregation on one of the dimension fields fails.

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine test this please

@salvatore-campagna
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

@salvatore-campagna
Copy link
Contributor Author

@felixbarny do you have something specific in mind for similarityHash?

I understand the goal is to favor clustering of similar tsid but I was wondering if you are thinking of a specific hash function.

@felixbarny
Copy link
Member

I was thinking of a locality-sensitive hash (LSH). Maybe there's already something in the ES code base that we could re-use, not sure. Maybe something related to nearest neighbor search or maybe we can reuse parts of the MinHash token filter.

I bet folks like @jpountz or @benwtrent would know best what's an appropriate hash function for that use case.

@felixbarny
Copy link
Member

Seems like this is somewhat related: apache/lucene#12489

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Aug 30, 2023

@felixbarny I was considering doing something a bit simpler like for instance generate the first piece of the hash, hashing just a subset of the dimensions. I was considering to include in this subset at least the set of fields used in the routing_path so that colocation is also related to the documents actual physical shard location.

So something like

hash(routing_path)

putting a limit like an upper bound to the number of fields included at least to deal with the fact that, if I understand correctly, typically integrations use routing_path = dimensions.

So something more like

hash(subset(dimensions, LIMIT))

where LIMIT is some integer value lower that the total number of dimensions.

Anyway this probably would not work well if the set of fields in the routing_path is small...which means other than having an upper bound I would also need a lower bound.

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Aug 30, 2023

Another reason why I was considering what I said above is that it would allow us, maybe later, to also add an optional additional attribute to dimension fields to have applications controlling which dimensions to hash, because for instance applications know better which time series are used together?

@benwtrent
Copy link
Member

@felixbarny @salvatore-campagna do we know the tsids ahead of time? I am guessing not.

Could we get away with only hashing PART of the tsid? What constitutes the biggest part of it?

There are various order-preserving hashes out there, I am honestly surprised we don't have one implemented in Elasticsearch yet.

I don't think you want to "cluster" things but instead you want the order of things to be preserved.

A variation on MinHashFilter could work. But I don't think it would be trivial to implement. Probably something that takes inspiration from that, or one of the order-preserving near-perfect hashes out there.

@felixbarny
Copy link
Member

felixbarny commented Aug 31, 2023

I don't think you want to "cluster" things but instead you want the order of things to be preserved.

Overall, the goal is to reduce storage space for the index data structures. As TSDB has a built-in index sorting on the _tsid field, the order of that field determines which time series are co-located to one another. The assumption is that if you co-locate _tsids that have similar dimensions, the index data structures can be more compact.

I'm not sure if we've validated that assumption. I think it would be useful to check what happens to the storage size when comparing the current implementation where the _tsid is a concatenation of all dimensions (scenario a) vs a simple hash for the full _tsid (scenario b). We'd expect to see less storage needed for the index data structure in scenario a, as in scenario b, the distribution of time series is basically random.

However, scenario a is also not perfect: If you have two time series for a metric with 10 dimensions each. If only the first dimension differs, the two time series will probably not be co-located to one another, even though the rest of the dimensions may be the same. That's where the idea with the similarity hash comes in: if we can cluster time series whose dimensions are similar to each other, we may see less storage needed for index data structures.

However, I'm not sure if the index sort is enough for that. We may also need to think about the shard _routing. But we need to be careful that this doesn't result in shard hot-spotting in case they're different sets of clusters that have an uneven distribution.


Having said all of that, I can propose an approach that should have similar properties to the current mechanism but allows us to remove the dimension limit. The basic idea is that we construct the _tsid similarly to today but instead of using the full keys and values for the dimensions, we're hashing each key and value individually with a short hash (for example 32 bit hashes like Adler32, Crc32c, or even String#hashCode) on the first n dimensions. N can be a relatively high number like 1024 and we'd still be well under Lucene's 32kb keyword token length limit. To avoid hash collisions, we prepend a larger hash (like 256 bit, such as sha256) at the end.

Pseudo code:

sort(dimensions)
_tsid = new ByteBuffer()
for (i = 0; i < dimensions.length && i < 1024; i++) {
  _tsid.put(hash32(dimensions[i].key))
  _tsid.put('=')
  _tsid.put(hash32(dimensions[i].value))
  _tsid.put(':')
}
_tsid.put(hash256(dimensions))

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Aug 31, 2023

I don't think you want to "cluster" things but instead you want the order of things to be preserved.

Overall, the goal is to reduce storage space for the index data structures. As TSDB has a built-in index sorting on the _tsid field, the order of that field determines which time series are co-located to one another. The assumption is that if you co-locate _tsids that have similar dimensions, the index data structures can be more compact.

I'm not sure if we've validated that assumption. I think it would be useful to check what happens to the storage size when comparing the current implementation where the _tsid is a concatenation of all dimensions (scenario a) vs a simple hash for the full _tsid (scenario b). We'd expect to see less storage needed for the index data structure in scenario a, as in scenario b, the distribution of time series is basically random.

However, scenario a is also not perfect: If you have two time series for a metric with 10 dimensions each. If only the first dimension differs, the two time series will probably not be co-located to one another, even though the rest of the dimensions may be the same. That's where the idea with the similarity hash comes in: if we can cluster time series whose dimensions are similar to each other, we may see less storage needed for index data structures.

However, I'm not sure if the index sort is enough for that. We may also need to think about the shard _routing. But we need to be careful that this doesn't result in shard hot-spotting in case they're different sets of clusters that have an uneven distribution.

Having said all of that, I can propose an approach that should have similar properties to the current mechanism but allows us to remove the dimension limit. The basic idea is that we construct the _tsid similarly to today but instead of using the full keys and values for the dimensions, we're hashing each key and value individually with a short hash (for example 32 bit hashes like Adler32, Crc32c, or even String#hashCode) on the first n dimensions. N can be a relatively high number like 1024 and we'd still be well under Lucene's 32kb keyword token length limit. To avoid hash collisions, we prepend a larger hash (like 256 bit, such as sha256) at the end.

Pseudo code:

_tsid = ""
for (i = 0; i < dimensions.length && i < 1024; i++) {
  _tsid += hash32(dimensions[i].key)
  _tsid += ':'
  _tsid += hash32(dimensions[i].value)
  _tsid += ';'
}
_tsid += hash256(dimensions)

This is similar to what I was considering above, where the subset of dimensions corresponds to the first 1024 dimensions here (LIMIT =1024).

@felixbarny
Copy link
Member

The difference is that I was proposing to hash each dimension key and value individually (a) where I think you were proposing to take a subset of the dimensions and create a single hash (b) out of that.

In a case where only the last dimension value is different, with approach b, the hash is entirely different but with approach a, only the last part will differ. Therefore, potentially doing a better job at clustering similar time series next to one another.

But I'm still not sure whether we're chasing ghosts here and whether that clustering actually makes a big difference in storage size.

@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Aug 31, 2023

The difference is that I was proposing to hash each dimension key and value individually (a) where I think you were proposing to take a subset of the dimensions and create a single hash (b) out of that.

In a case where only the last dimension value is different, with approach b, the hash is entirely different but with approach a, only the last part will differ. Therefore, potentially doing a better job at clustering similar time series next to one another.

But I'm still not sure whether we're chasing ghosts here and whether that clustering actually makes a big difference in storage size.

To answer the last part I would like to werite the PR so that to have a branch where, by changing a setting, we can test the following:

  1. hash without similarity for _tsid > 32 kb
  2. hash with similarity for _tsid > 32 kb
  3. hash always without similarity
  4. hash always with similarity

@salvatore-campagna salvatore-campagna added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Feb 1, 2024
@elasticsearchmachine elasticsearchmachine merged commit bdd3a4f into elastic:main Feb 1, 2024
14 checks passed
@salvatore-campagna salvatore-campagna deleted the fix/93564-time-series-dimensions-limit branch February 1, 2024 13:25
davismcphee added a commit to elastic/kibana that referenced this pull request Feb 2, 2024
## Summary

This RP updates two `data.json.gz` files to remove all doc IDs (which is
unsupported according to
#176109 (comment)),
which started causing failings after tsid hashing was implemented by ES
in elastic/elasticsearch#98023.

Resolves #176103.
Resolves #176102.
Resolves #176100.
Resolves #176107.
Resolves #176108.
Resolves #176109.

### Checklist

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)

### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
fkanout pushed a commit to fkanout/kibana that referenced this pull request Feb 7, 2024
## Summary

This RP updates two `data.json.gz` files to remove all doc IDs (which is
unsupported according to
elastic#176109 (comment)),
which started causing failings after tsid hashing was implemented by ES
in elastic/elasticsearch#98023.

Resolves elastic#176103.
Resolves elastic#176102.
Resolves elastic#176100.
Resolves elastic#176107.
Resolves elastic#176108.
Resolves elastic#176109.

### Checklist

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)

### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this pull request Feb 15, 2024
## Summary

This RP updates two `data.json.gz` files to remove all doc IDs (which is
unsupported according to
elastic#176109 (comment)),
which started causing failings after tsid hashing was implemented by ES
in elastic/elasticsearch#98023.

Resolves elastic#176103.
Resolves elastic#176102.
Resolves elastic#176100.
Resolves elastic#176107.
Resolves elastic#176108.
Resolves elastic#176109.

### Checklist

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)

### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
fkanout pushed a commit to fkanout/kibana that referenced this pull request Mar 4, 2024
## Summary

This RP updates two `data.json.gz` files to remove all doc IDs (which is
unsupported according to
elastic#176109 (comment)),
which started causing failings after tsid hashing was implemented by ES
in elastic/elasticsearch#98023.

Resolves elastic#176103.
Resolves elastic#176102.
Resolves elastic#176100.
Resolves elastic#176107.
Resolves elastic#176108.
Resolves elastic#176109.

### Checklist

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)

### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
martijnvg added a commit that referenced this pull request Mar 13, 2024
After tsid hashing was introduced (#98023), the time series aggregator generates the tsid (from all dimension fields) instead of using the value from the _tsid field directly. This generation of the tsid happens for every time serie, parent bucket and segment combination.

This changes alters that by only generating the tsid once per time serie and segment. This is done by just locally recording the current tsid.
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Mar 13, 2024
After tsid hashing was introduced (elastic#98023), the time series aggregator generates the tsid (from all dimension fields) instead of using the value from the _tsid field directly. This generation of the tsid happens for every time serie, parent bucket and segment combination.

This changes alters that by only generating the tsid once per time serie and segment. This is done by just locally recording the current tsid.
elasticsearchmachine pushed a commit that referenced this pull request Mar 13, 2024
After tsid hashing was introduced (#98023), the time series aggregator generates the tsid (from all dimension fields) instead of using the value from the _tsid field directly. This generation of the tsid happens for every time serie, parent bucket and segment combination.

This changes alters that by only generating the tsid once per time serie and segment. This is done by just locally recording the current tsid.
dnhatn pushed a commit that referenced this pull request Mar 29, 2024
Missing a check on the transport version results in unreadable cluster state
if it includes a serialized instance of DownsampleShardTaskParams.
#98023 introduced an optional string array including dimensions used by time
serie indices.
Reading an optional array requires reading a boolean first which is required to
know if an array of values exists in serialized form. From 8.13 on we try to
read such a boolean which is not there because older versions don't write any
boolean nor any string array. Here we include the check on versions for backward
compatibility skipping reading any boolean or array whatsoever whenever not possible.

Customers using downsampling might have cluster states including such serielized
objects and would be unable to upgrade to version 8.13. They will be able to
upgrade to any version including this fix.

This fix has a side effect #106880
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >non-issue :StorageEngine/TSDB You know, for Metrics Team:StorageEngine v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.