-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly #36823
Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly #36823
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
60fe7f7
to
432999d
Compare
This is a rough draft for a revision of the Azure Metrics grouping logic to make it more TSDB friendly.
The ID is unique for each metricset collection, a call to the `Fetch()` function.
We can't use a random value for a dimensions field. It would create a new time series on each collection.
The dimension name has a different case in the definition ("containerName") and in the value ("containername") structures. We must pick the name from the definition to avoid losing an essential information later used to build the actual field name.
Unfortunately, dimensions in the metric definition and metric values are not in the same order.
We keep track of the timestamp/timegrain when we collected a metric value so we know when to collect it again.
432999d
to
26d365e
Compare
Added more context about why we introduced the `mapToKeyValuePoints()` vs. make assumptions about timestamp and dimensions. We may consider making these assumption to remove this function and the `KeyValuePoint` struct.
Also, remove the `else` clause in the error check `if` to make the code more readable.
This pull request is now in conflicts. Could you fix it? 🙏
|
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
Hey @zmoog should it be backported to the older versions? As I remember we discussed that those changes should be considered as a bug, not as a new feature, isn't it? |
Yeah, I remember the conversation about backpointing it to 8.11. Two of the three changes are bug fixes; the remaining is a tweak of an existing feature (metrics grouping). So, it qualifies for a backport to 8.11. Shall we proceed? |
…to make the metricset TSDB-friendly (#36823) ## Overview (WHAT) Here is a summary of the changes introduced with this PR. - Update the metric grouping logic - Track metrics collection info - Adjust collection interval ### Update the metric grouping logic Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events. Here are the current components of the grouping key: - timestamp - namespace - resource ID - resource Sub ID - dimensions - time grain It also tries to make the grouping simpler to read. (WHY) When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss. ### Track metrics collection info The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period. (WHY) The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times. For example, consider a `PT1H` (one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identical `PT1H` data point 12 times. ### Adjust collection interval Change the collection interval to `[{-2 x INTERVAL},{-1 x INTERVAL})` with a delay of `{INTERVAL}`. (WHY) The collection interval was [2x the collection period or time grain](https://github.com/elastic/beats/blob/ed34c37f59c7bc0cf9e8051f7b5327c861b59467/x-pack/metricbeat/module/azure/client.go#L110-L116). This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases. Glossary: - collection interval: the time range used to fetch metrics values. - collection period: time between metric collections (e.g., with a 5 min period, the metricset collects new metrics every 5 minutes) (cherry picked from commit 886d078)
…d duplicating documents to make the metricset TSDB-friendly (#37177) * Azure Monitor: adjust grouping logic and avoid duplicating documents to make the metricset TSDB-friendly (#36823) ## Overview (WHAT) Here is a summary of the changes introduced with this PR. - Update the metric grouping logic - Track metrics collection info - Adjust collection interval ### Update the metric grouping logic Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events. Here are the current components of the grouping key: - timestamp - namespace - resource ID - resource Sub ID - dimensions - time grain It also tries to make the grouping simpler to read. (WHY) When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss. ### Track metrics collection info The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period. (WHY) The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times. For example, consider a `PT1H` (one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identical `PT1H` data point 12 times. ### Adjust collection interval Change the collection interval to `[{-2 x INTERVAL},{-1 x INTERVAL})` with a delay of `{INTERVAL}`. (WHY) The collection interval was [2x the collection period or time grain](https://github.com/elastic/beats/blob/ed34c37f59c7bc0cf9e8051f7b5327c861b59467/x-pack/metricbeat/module/azure/client.go#L110-L116). This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases. Glossary: - collection interval: the time range used to fetch metrics values. - collection period: time between metric collections (e.g., with a 5 min period, the metricset collects new metrics every 5 minutes) (cherry picked from commit 886d078) * Remove unintentional changelog entries --------- Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
…to make the metricset TSDB-friendly (elastic#36823) ## Overview (WHAT) Here is a summary of the changes introduced with this PR. - Update the metric grouping logic - Track metrics collection info - Adjust collection interval ### Update the metric grouping logic Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events. Here are the current components of the grouping key: - timestamp - namespace - resource ID - resource Sub ID - dimensions - time grain It also tries to make the grouping simpler to read. (WHY) When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss. ### Track metrics collection info The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period. (WHY) The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times. For example, consider a `PT1H` (one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identical `PT1H` data point 12 times. ### Adjust collection interval Change the collection interval to `[{-2 x INTERVAL},{-1 x INTERVAL})` with a delay of `{INTERVAL}`. (WHY) The collection interval was [2x the collection period or time grain](https://github.com/elastic/beats/blob/ed34c37f59c7bc0cf9e8051f7b5327c861b59467/x-pack/metricbeat/module/azure/client.go#L110-L116). This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases. Glossary: - collection interval: the time range used to fetch metrics values. - collection period: time between metric collections (e.g., with a 5 min period, the metricset collects new metrics every 5 minutes)
…trics (#40367) Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case. Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes. This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823. However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates.
…trics (#40367) Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case. Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes. This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823. However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates. (cherry picked from commit 5fccb0d)
…trics (#40367) Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case. Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes. This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823. However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates. (cherry picked from commit 5fccb0d)
…trics (#40367) (#40414) Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case. Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes. This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823. However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates. (cherry picked from commit 5fccb0d) Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
…trics (#40367) (#40413) Move the timespan logic into a dedicated `buildTimespan()` function with a test for each supported use case. Some Azure services have longer latency between service usage and metric availability. For example, the Storage Account capacity metrics (Blob capacity, etc.) have a PT1H time grain and become available after one hour. Service X also has PT1H metrics, however become available after a few minutes. This PR restores the core of the [older timespan logic](https://github.com/elastic/beats/blob/d3facc808d2ba293a42b2ad3fc8e21b66c5f2a7f/x-pack/metricbeat/module/azure/client.go#L110-L116) the Azure Monitor metricset was using before the regression introduced by the PR #36823. However, the `buildTimespan()` does not restore the `interval * (-2)` part because doubling the interval causes duplicates. (cherry picked from commit 5fccb0d) Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
This is a rough draft for a revision of the Azure Metrics grouping logic to make it more TSDB friendly.
Overview
(WHAT) Here is a summary of the changes introduced with this PR.
Update the metric grouping logic
Streamlines the metrics grouping logic to include all the fields the TSDB team identified as dimensions for the Azure Metrics events.
Here are the current components of the grouping key:
It also tries to make the grouping simpler to read.
WHY:
When TSDB is enabled, it drops events with the same timestamp and dimensions. The metricset must group all metrics values by timestamp+dimensions and create one event for each group to avoid data loss.
Track metrics collection info
The metricset tracks the timestamp and time grain for each metrics collection. At the beginning of each iteration, it skips collecting a value if the metricset has already collected a value for the (time grain) period.
(WHY)
The metricset usually collects one data point for each collection period. When the time grain is larger than the collection period, the metricset collects the identical data point multiple times.
For example, consider a
PT1H
(one hour) time grain and a collection period of five minutes: without tracking, the metrics would collect the identicalPT1H
data point 12 times.Adjust collection interval
Change the collection interval to
[{-2 x INTERVAL},{-1 x INTERVAL})
with a delay of{INTERVAL}
.(WHY)
The collection interval was 2x the collection period or time grain. This interval is too large, and we collected multiple data points for the same metric. There was some code to drop the additional data points, but it wasn't working in all cases.
Glossary:
Checklist
I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Related issues
kube_node_status_condition
metric integrations#7160Use cases
Screenshots
Logs