Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly #36823

Merged
merged 16 commits into from
Nov 21, 2023

Conversation

zmoog
Copy link
Contributor

@zmoog zmoog commented Oct 11, 2023

This is a rough draft for a revision of the Azure Metrics grouping logic to make it more TSDB friendly.

Overview

(WHAT) Here is a summary of the changes introduced with this PR.

  • Update the metric grouping logic
  • Track metrics collection info
  • Adjust collection interval

Update the metric grouping logic

Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events.

Here are the current components of the grouping key:

  • timestamp
  • namespace
  • resource ID
  • resource Sub ID
  • dimensions
  • time grain

It also tries to make the grouping simpler to read.

WHY:

When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss.

Track metrics collection info

The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period.

(WHY)

The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times.

For example, consider a PT1H (one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identical PT1H data point 12 times.

Adjust collection interval

Change the collection interval to [{-2 x INTERVAL},{-1 x INTERVAL}) with a delay of {INTERVAL}.

(WHY)

The collection interval was 2x the collection period or time grain. This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases.

Glossary:

  • collection interval: the time range used to fetch metrics values.
  • collection period: time between metric collections (e.g., with a 5 min period, the metricset collects new metrics every 5 minutes)

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 11, 2023
@zmoog zmoog self-assigned this Oct 11, 2023
@zmoog zmoog added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Oct 11, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 11, 2023
@mergify
Copy link
Contributor

mergify bot commented Oct 11, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @zmoog? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 11, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-14T15:53:15.130+0000

  • Duration: 51 min 14 sec

Test stats 🧪

Test Results
Failed 0
Passed 1518
Skipped 96
Total 1614

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

zmoog added 9 commits November 4, 2023 19:01
This is a rough draft for a revision of the Azure Metrics grouping
logic to make it more TSDB friendly.
The ID is unique for each metricset collection, a call to the
`Fetch()` function.
We can't use a random value for a dimensions field. It would create
a new time series on each collection.
The dimension name has a different case in the definition
("containerName") and in the value ("containername") structures.

We must pick the name from the definition to avoid losing an essential
information later used to build the actual field name.
Unfortunately, dimensions in the metric definition and metric values
are not in the same order.
We keep track of the timestamp/timegrain when we collected a metric
value so we know when to collect it again.
@zmoog zmoog force-pushed the zmoog/group-azure-metrics-for-tsdb branch from 432999d to 26d365e Compare November 6, 2023 08:53
zmoog added 2 commits November 6, 2023 10:53
Added more context about why we introduced the `mapToKeyValuePoints()`
vs. make assumptions about timestamp and dimensions.

We may consider making these assumption to remove this function and
the `KeyValuePoint` struct.
@zmoog zmoog changed the title WIP: make the grouping logic TSDB friendly Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly Nov 6, 2023
@zmoog zmoog marked this pull request as ready for review November 6, 2023 12:26
@zmoog zmoog requested a review from a team as a code owner November 6, 2023 12:26
Also, remove the `else` clause in the error check `if` to make the
code more readable.
Copy link
Contributor

mergify bot commented Nov 20, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b zmoog/group-azure-metrics-for-tsdb upstream/zmoog/group-azure-metrics-for-tsdb
git merge upstream/main
git push upstream zmoog/group-azure-metrics-for-tsdb

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-20T23:00:50.445+0000

  • Duration: 57 min 42 sec

Test stats 🧪

Test Results
Failed 0
Passed 1518
Skipped 96
Total 1614

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@zmoog zmoog merged commit 886d078 into elastic:main Nov 21, 2023
8 checks passed
@zmoog zmoog deleted the zmoog/group-azure-metrics-for-tsdb branch November 21, 2023 08:27
@tetianakravchenko
Copy link
Contributor

Hey @zmoog should it be backported to the older versions? As I remember we discussed that those changes should be considered as a bug, not as a new feature, isn't it?

@zmoog
Copy link
Contributor Author

zmoog commented Nov 21, 2023

Yeah, I remember the conversation about backpointing it to 8.11. Two of the three changes are bug fixes; the remaining is a tweak of an existing feature (metrics grouping).

So, it qualifies for a backport to 8.11.

Shall we proceed?

@zmoog zmoog added the backport-v8.11.0 Automated backport with mergify label Nov 22, 2023
mergify bot pushed a commit that referenced this pull request Nov 22, 2023
…to make the metricset TSDB-friendly (#36823)

## Overview

(WHAT) Here is a summary of the changes introduced with this PR.

- Update the metric grouping logic
- Track metrics collection info
- Adjust collection interval

### Update the metric grouping logic

Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events.

Here are the current components of the grouping key:

- timestamp
- namespace
- resource ID
- resource Sub ID
- dimensions
- time grain

It also tries to make the grouping simpler to read.

(WHY)

When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss.

### Track metrics collection info

The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period.

(WHY)

The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times.

For example, consider a `PT1H` (one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identical `PT1H` data point 12 times.

### Adjust collection interval

Change the collection interval to `[{-2 x INTERVAL},{-1 x INTERVAL})` with a delay of `{INTERVAL}`.

(WHY)

The collection interval was [2x the collection period or time grain](https://github.com/elastic/beats/blob/ed34c37f59c7bc0cf9e8051f7b5327c861b59467/x-pack/metricbeat/module/azure/client.go#L110-L116). This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases.

Glossary:
- collection interval: the time range used to fetch metrics values.
- collection period: time between metric collections (e.g., with a 5 min period, the metricset collects new metrics every 5 minutes)

(cherry picked from commit 886d078)
zmoog added a commit that referenced this pull request Nov 28, 2023
…d duplicating documents to make the metricset TSDB-friendly (#37177)

* Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly (#36823)

## Overview

(WHAT) Here is a summary of the changes introduced with this PR.

- Update the metric grouping logic
- Track metrics collection info
- Adjust collection interval

### Update the metric grouping logic

Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events.

Here are the current components of the grouping key:

- timestamp
- namespace
- resource ID
- resource Sub ID
- dimensions
- time grain

It also tries to make the grouping simpler to read.

(WHY)

When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss.

### Track metrics collection info

The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period.

(WHY)

The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times.

For example, consider a `PT1H` (one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identical `PT1H` data point 12 times.

### Adjust collection interval

Change the collection interval to `[{-2 x INTERVAL},{-1 x INTERVAL})` with a delay of `{INTERVAL}`.

(WHY)

The collection interval was [2x the collection period or time grain](https://github.com/elastic/beats/blob/ed34c37f59c7bc0cf9e8051f7b5327c861b59467/x-pack/metricbeat/module/azure/client.go#L110-L116). This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases.

Glossary:
- collection interval: the time range used to fetch metrics values.
- collection period: time between metric collections (e.g., with a 5 min period, the metricset collects new metrics every 5 minutes)

(cherry picked from commit 886d078)

* Remove unintentional changelog entries

---------

Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
…to make the metricset TSDB-friendly (elastic#36823)


## Overview

(WHAT) Here is a summary of the changes introduced with this PR.

- Update the metric grouping logic
- Track metrics collection info
- Adjust collection interval 

### Update the metric grouping logic

Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events. 

Here are the current components of the grouping key:

- timestamp
- namespace
- resource ID
- resource Sub ID
- dimensions
- time grain

It also tries to make the grouping simpler to read. 

(WHY)

When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss.

### Track metrics collection info

The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period.

(WHY)

The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times.

For example, consider a `PT1H` (one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identical `PT1H` data point 12 times.


### Adjust collection interval 

Change the collection interval to `[{-2 x INTERVAL},{-1 x INTERVAL})` with a delay of `{INTERVAL}`. 

(WHY)

The collection interval was [2x the collection period or time grain](https://github.com/elastic/beats/blob/ed34c37f59c7bc0cf9e8051f7b5327c861b59467/x-pack/metricbeat/module/azure/client.go#L110-L116). This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases. 

Glossary:
- collection interval: the time range used to fetch metrics values.
- collection period: time between metric collections (e.g., with a 5 min period, the metricset collects new metrics every 5 minutes)
zmoog added a commit that referenced this pull request Aug 1, 2024
…trics (#40367)

Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case. 

Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes.

This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823. 

However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates.
mergify bot pushed a commit that referenced this pull request Aug 1, 2024
…trics (#40367)

Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case.

Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes.

This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823.

However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates.

(cherry picked from commit 5fccb0d)
mergify bot pushed a commit that referenced this pull request Aug 1, 2024
…trics (#40367)

Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case.

Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes.

This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823.

However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates.

(cherry picked from commit 5fccb0d)
zmoog added a commit that referenced this pull request Aug 1, 2024
…trics (#40367) (#40414)

Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case.

Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes.

This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823.

However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates.

(cherry picked from commit 5fccb0d)

Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
zmoog added a commit that referenced this pull request Aug 1, 2024
…trics (#40367) (#40413)

Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case.

Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes.

This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823.

However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates.

(cherry picked from commit 5fccb0d)

Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.11.0 Automated backport with mergify Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants