Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly #36823

zmoog · 2023-10-11T10:30:15Z

This is a rough draft for a revision of the Azure Metrics grouping logic to make it more TSDB friendly.

Overview

(WHAT) Here is a summary of the changes introduced with this PR.

Update the metric grouping logic
Track metrics collection info
Adjust collection interval

Update the metric grouping logic

Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events.

Here are the current components of the grouping key:

timestamp
namespace
resource ID
resource Sub ID
dimensions
time grain

It also tries to make the grouping simpler to read.

WHY:

When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss.

Track metrics collection info

The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period.

(WHY)

The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times.

For example, consider a PT1H (one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identical PT1H data point 12 times.

Adjust collection interval

Change the collection interval to [{-2 x INTERVAL},{-1 x INTERVAL}) with a delay of {INTERVAL}.

(WHY)

The collection interval was 2x the collection period or time grain. This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases.

Glossary:

collection interval: the time range used to fetch metrics values.
collection period: time between metric collections (e.g., with a 5 min period, the metricset collects new metrics every 5 minutes)

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

[ ]

How to test this PR locally

Related issues

Relates to:

Use cases

Screenshots

Logs

mergify · 2023-10-11T10:30:51Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @zmoog? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-10-11T11:26:07Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-11-14T15:53:15.130+0000
Duration: 51 min 14 sec

Test stats 🧪

Test	Results
Failed	0
Passed	1518
Skipped	96
Total	1614

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

This is a rough draft for a revision of the Azure Metrics grouping logic to make it more TSDB friendly.

The ID is unique for each metricset collection, a call to the `Fetch()` function.

We can't use a random value for a dimensions field. It would create a new time series on each collection.

The dimension name has a different case in the definition ("containerName") and in the value ("containername") structures. We must pick the name from the definition to avoid losing an essential information later used to build the actual field name.

Unfortunately, dimensions in the metric definition and metric values are not in the same order.

We keep track of the timestamp/timegrain when we collected a metric value so we know when to collect it again.

Added more context about why we introduced the `mapToKeyValuePoints()` vs. make assumptions about timestamp and dimensions. We may consider making these assumption to remove this function and the `KeyValuePoint` struct.

x-pack/metricbeat/module/azure/client.go

Also, remove the `else` clause in the error check `if` to make the code more readable.

mergify · 2023-11-20T22:15:45Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b zmoog/group-azure-metrics-for-tsdb upstream/zmoog/group-azure-metrics-for-tsdb
git merge upstream/main
git push upstream zmoog/group-azure-metrics-for-tsdb

elasticmachine · 2023-11-20T23:57:38Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-11-20T23:00:50.445+0000
Duration: 57 min 42 sec

Test stats 🧪

Test	Results
Failed	0
Passed	1518
Skipped	96
Total	1614

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

tetianakravchenko · 2023-11-21T09:56:07Z

Hey @zmoog should it be backported to the older versions? As I remember we discussed that those changes should be considered as a bug, not as a new feature, isn't it?

zmoog · 2023-11-21T11:04:59Z

Yeah, I remember the conversation about backpointing it to 8.11. Two of the three changes are bug fixes; the remaining is a tweak of an existing feature (metrics grouping).

So, it qualifies for a backport to 8.11.

Shall we proceed?

…to make the metricset TSDB-friendly (#36823) ## Overview (WHAT) Here is a summary of the changes introduced with this PR. - Update the metric grouping logic - Track metrics collection info - Adjust collection interval ### Update the metric grouping logic Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events. Here are the current components of the grouping key: - timestamp - namespace - resource ID - resource Sub ID - dimensions - time grain It also tries to make the grouping simpler to read. (WHY) When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss. ### Track metrics collection info The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period. (WHY) The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times. For example, consider a `PT1H` (one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identical `PT1H` data point 12 times. ### Adjust collection interval Change the collection interval to `[{-2 x INTERVAL},{-1 x INTERVAL})` with a delay of `{INTERVAL}`. (WHY) The collection interval was [2x the collection period or time grain](https://github.com/elastic/beats/blob/ed34c37f59c7bc0cf9e8051f7b5327c861b59467/x-pack/metricbeat/module/azure/client.go#L110-L116). This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases. Glossary: - collection interval: the time range used to fetch metrics values. - collection period: time between metric collections (e.g., with a 5 min period, the metricset collects new metrics every 5 minutes) (cherry picked from commit 886d078)

…d duplicating documents to make the metricset TSDB-friendly (#37177) * Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly (#36823) ## Overview (WHAT) Here is a summary of the changes introduced with this PR. - Update the metric grouping logic - Track metrics collection info - Adjust collection interval ### Update the metric grouping logic Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events. Here are the current components of the grouping key: - timestamp - namespace - resource ID - resource Sub ID - dimensions - time grain It also tries to make the grouping simpler to read. (WHY) When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss. ### Track metrics collection info The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period. (WHY) The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times. For example, consider a `PT1H` (one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identical `PT1H` data point 12 times. ### Adjust collection interval Change the collection interval to `[{-2 x INTERVAL},{-1 x INTERVAL})` with a delay of `{INTERVAL}`. (WHY) The collection interval was [2x the collection period or time grain](https://github.com/elastic/beats/blob/ed34c37f59c7bc0cf9e8051f7b5327c861b59467/x-pack/metricbeat/module/azure/client.go#L110-L116). This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases. Glossary: - collection interval: the time range used to fetch metrics values. - collection period: time between metric collections (e.g., with a 5 min period, the metricset collects new metrics every 5 minutes) (cherry picked from commit 886d078) * Remove unintentional changelog entries --------- Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>

…to make the metricset TSDB-friendly (elastic#36823) ## Overview (WHAT) Here is a summary of the changes introduced with this PR. - Update the metric grouping logic - Track metrics collection info - Adjust collection interval ### Update the metric grouping logic Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events. Here are the current components of the grouping key: - timestamp - namespace - resource ID - resource Sub ID - dimensions - time grain It also tries to make the grouping simpler to read. (WHY) When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss. ### Track metrics collection info The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period. (WHY) The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times. For example, consider a `PT1H` (one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identical `PT1H` data point 12 times. ### Adjust collection interval Change the collection interval to `[{-2 x INTERVAL},{-1 x INTERVAL})` with a delay of `{INTERVAL}`. (WHY) The collection interval was [2x the collection period or time grain](https://github.com/elastic/beats/blob/ed34c37f59c7bc0cf9e8051f7b5327c861b59467/x-pack/metricbeat/module/azure/client.go#L110-L116). This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases. Glossary: - collection interval: the time range used to fetch metrics values. - collection period: time between metric collections (e.g., with a 5 min period, the metricset collects new metrics every 5 minutes)

…trics (#40367) Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case. Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes. This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823. However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates.

…trics (#40367) Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case. Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes. This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823. However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates. (cherry picked from commit 5fccb0d)

…trics (#40367) (#40414) Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case. Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes. This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823. However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates. (cherry picked from commit 5fccb0d) Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>

…trics (#40367) (#40413) Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case. Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes. This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823. However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates. (cherry picked from commit 5fccb0d) Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 11, 2023

zmoog self-assigned this Oct 11, 2023

zmoog added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Oct 11, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 11, 2023

tetianakravchenko mentioned this pull request Oct 27, 2023

[azure_metrics] Duplicated metrics/metrics are not groupped elastic/integrations#7621

Closed

zmoog force-pushed the zmoog/group-azure-metrics-for-tsdb branch from 60fe7f7 to 432999d Compare October 30, 2023 09:58

zmoog added 9 commits November 4, 2023 19:01

WIP: make the grouping logic TSDB friendly

d97c90a

This is a rough draft for a revision of the Azure Metrics grouping logic to make it more TSDB friendly.

Add a batch ID field

daf6f76

The ID is unique for each metricset collection, a call to the `Fetch()` function.

Adjust the interval to only fetch one timestamp

769199b

Remove 'event.batch_id'

c9383fd

We can't use a random value for a dimensions field. It would create a new time series on each collection.

Initialize dimension map to avoid panics

f268878

Lookup dimension value

5506f1a

Unfortunately, dimensions in the metric definition and metric values are not in the same order.

Store metric collection timestamp and timegrain

cb9d05c

We keep track of the timestamp/timegrain when we collected a metric value so we know when to collect it again.

Cleanup + add comments on hard-to-understand areas

26d365e

zmoog force-pushed the zmoog/group-azure-metrics-for-tsdb branch from 432999d to 26d365e Compare November 6, 2023 08:53

zmoog added 2 commits November 6, 2023 10:53

Better document the mapToKeyValuePoints()

0fb2c40

Added more context about why we introduced the `mapToKeyValuePoints()` vs. make assumptions about timestamp and dimensions. We may consider making these assumption to remove this function and the `KeyValuePoint` struct.

Address linter complaints

7dad674

zmoog changed the title ~~WIP: make the grouping logic TSDB friendly~~ Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly Nov 6, 2023

zmoog marked this pull request as ready for review November 6, 2023 12:26

zmoog requested a review from a team as a code owner November 6, 2023 12:26

zmoog requested review from kaiyan-sheng, tetianakravchenko, lalit-satapathy and gpop63 November 6, 2023 12:27

kaiyan-sheng reviewed Nov 13, 2023

View reviewed changes

x-pack/metricbeat/module/azure/client.go Outdated Show resolved Hide resolved

Move time interval calculation out of the for loop

5ed4efb

Also, remove the `else` clause in the error check `if` to make the code more readable.

kaiyan-sheng approved these changes Nov 15, 2023

View reviewed changes

Update changelog

2f06b90

zmoog and others added 3 commits November 20, 2023 23:19

Cleanup

cc948a0

Clean up

76a31fd

Merge branch 'main' into zmoog/group-azure-metrics-for-tsdb

0153512

zmoog merged commit 886d078 into elastic:main Nov 21, 2023
8 checks passed

zmoog deleted the zmoog/group-azure-metrics-for-tsdb branch November 21, 2023 08:27

zmoog added the backport-v8.11.0 Automated backport with mergify label Nov 22, 2023

mergify bot mentioned this pull request Nov 22, 2023

[8.11](backport #36823) Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly #37177

Merged

zmoog mentioned this pull request Nov 26, 2023

Azure Monitor may skip a metric collection depending on the timing #37204

Closed

zmoog mentioned this pull request Mar 4, 2024

[Azure] [Metrics] Update group by dimensions logic #36491

Closed

6 tasks

This was referenced Jul 29, 2024

Azure Monitor stops collecting PT1H metrics for Storage Account #40376

Closed

Azure Monitor: fix metric timespan to restore Storage Account PT1H metrics #40367

Merged

mergify bot mentioned this pull request Aug 1, 2024

[8.14](backport #40367) Azure Monitor: fix metric timespan to restore Storage Account PT1H metrics #40413

Merged

11 tasks

mergify bot mentioned this pull request Aug 1, 2024

[8.15](backport #40367) Azure Monitor: fix metric timespan to restore Storage Account PT1H metrics #40414

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly #36823

Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly #36823

zmoog commented Oct 11, 2023 •

edited

Loading

mergify bot commented Oct 11, 2023

elasticmachine commented Oct 11, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

mergify bot commented Nov 20, 2023

elasticmachine commented Nov 20, 2023

Build stats

Test stats 🧪

tetianakravchenko commented Nov 21, 2023

zmoog commented Nov 21, 2023

Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly #36823

Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly #36823

Conversation

zmoog commented Oct 11, 2023 • edited Loading

Overview

Update the metric grouping logic

Track metrics collection info

Adjust collection interval

Checklist

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

mergify bot commented Oct 11, 2023

elasticmachine commented Oct 11, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

mergify bot commented Nov 20, 2023

elasticmachine commented Nov 20, 2023

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

tetianakravchenko commented Nov 21, 2023

zmoog commented Nov 21, 2023

zmoog commented Oct 11, 2023 •

edited

Loading

elasticmachine commented Oct 11, 2023 •

edited by jenkins-beats-ci bot

Loading