-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/add-kafka-monitoring #625
feature/add-kafka-monitoring #625
Conversation
WalkthroughThe pull request implements extensive updates across Kafka monitoring and deployment configurations. It introduces a new Grafana dashboard for Strimzi Kafka, updates Helm charts and version tracking, and enhances various Kubernetes CRDs with new fields and features (such as CSI volume support and offset management). Additionally, RBAC rules, deployment templates, image mappings, and Prometheus alert rules have been improved. These changes aim to improve observability, configuration flexibility, and resource management within the Kafka operator ecosystem. Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant G as Grafana Dashboard
participant P as Prometheus
participant K as Kafka Pod
participant VS as VMPodScrape
U->>G: Open Kafka dashboard
G->>P: Query metrics for brokers, controllers, partitions
P->>VS: Trigger pod metric scraping
VS->>K: Fetch JMX metrics via ConfigMap settings
K-->>P: Return Prometheus metrics
P-->>G: Deliver queried metrics
G-->>U: Display updated dashboard
Suggested labels
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (27)
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/060-Deployment-strimzi-cluster-operator.yaml (1)
143-146
: Review Conditional Logic for New Pod Disruption Budget Environment VariableA new conditional block has been introduced to set the environment variable
STRIMZI_POD_DISRUPTION_BUDGET_GENERATION
, which is only included when.Values.generatePodDisruptionBudget
is not equal totrue
. Please verify that this is the intended behavior. If the aim is to override or provide a specific configuration when pod disruption budget generation is disabled, this condition is correct. However, to improve clarity and safeguard against potential type or evaluation issues, consider one of the following refinements:
- Explicit Boolean Check: If
.Values.generatePodDisruptionBudget
is a boolean, you might use an explicit inverse condition such asif not .Values.generatePodDisruptionBudget
to clearly indicate that the variable should be set when budget generation is disabled.- Comment Clarification: Adding a YAML comment above this block to explain the use-case can aid future maintainers in quickly understanding the condition.
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/020-RoleBinding-strimzi-cluster-operator.yaml (1)
5-5
: Fix trailing whitespace.Remove the trailing space at the end of line 5.
- {{- $watchNamespaces = list -}} + {{- $watchNamespaces = list -}}🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 5-5: trailing spaces
(trailing-spaces)
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-RoleBinding-strimzi-cluster-operator.yaml (1)
5-5
: Trailing Whitespace CleanupA YAMLlint warning indicates trailing spaces on this line. Please remove any extraneous spaces to maintain clean formatting.
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 5-5: trailing spaces
(trailing-spaces)
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/031-RoleBinding-strimzi-cluster-operator-entity-operator-delegation.yaml (1)
1-38
: Refined Namespace Selection in Entity Operator RoleBindingSimilar to the other RoleBinding templates, the introduction of
$watchNamespaces
with the conditional reset (whenwatchAnyNamespace
is true) enhances the RBAC configuration’s consistency across different templates. Please remove the trailing spaces reported on line 5 to ensure proper YAML formatting.🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
[error] 5-5: trailing spaces
(trailing-spaces)
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-ClusterRole-strimzi-cluster-operator-role.yaml (1)
82-89
: New Deletion Permission for KafkaRebalancesA new rule granting delete permissions on the
kafkarebalances
resource has been added, which supports the operator’s auto-rebalancing workflow. However, YAMLlint warns that the indentation for the list items on lines 86 and 88 is incorrect (expected 2 spaces but found 4). Please adjust the indentation as shown below to maintain consistency:@@ - - kafkarebalances + - kafkarebalances @@ - - delete + - delete🧰 Tools
🪛 YAMLlint (1.35.1)
[warning] 86-86: wrong indentation: expected 2 but found 4
(indentation)
[warning] 88-88: wrong indentation: expected 2 but found 4
(indentation)
packages/system/kafka-operator/charts/strimzi-kafka-operator/values.yaml (2)
183-185
: Enable Pod Disruption Budget Generation
The changes introduce explicit configuration forconnectBuildTimeoutMs
(which remains at 300000) and—more importantly—the new optiongeneratePodDisruptionBudget: true
. This provides enhanced control over pod availability during disruptions. Make sure that the corresponding deployment templates (which conditionally set environment variables based on this flag) are updated and tested.🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 185-185: no new line character at the end of file
(new-line-at-end-of-file)
185-185
: Ensure Newline at End of File
The static analysis tool reported a missing newline at the end of the file. Adding a newline will improve compatibility with YAML linters and editors.🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 185-185: no new line character at the end of file
(new-line-at-end-of-file)
hack/download-dashboards.sh (1)
83-83
: Add Strimzi Kafka Dashboard Reference
The new dashboard reference for the Strimzi Kafka dashboard (modules/340-monitoring-kubernetes/monitoring/grafana-dashboards//kafka/strimzi-kafka.json
) has been added. Note the double slash (//
) beforekafka
—please verify whether this is intentional or if it should be a single slash to avoid potential file path issues.packages/apps/kafka/templates/metrics-configmap.yaml (1)
1-5
: Initialize New Metrics ConfigMap
A new ConfigMap is created for Kafka and Zookeeper metrics configurations. The metadata section using the Helm template variable ({{ .Release.Name }}-metrics
) looks appropriate; however, consider quoting the name value (e.g.,"{{ .Release.Name }}-metrics"
) if YAML lint issues persist in certain environments.🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 4-4: syntax error: expected , but found ''
(syntax)
packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/047-Crd-kafkaconnector.yaml (1)
95-118
: Introduce Offset Management Configuration
The newlistOffsets
andalterOffsets
properties extend the CRD schema to support offset management via ConfigMaps. The structure and required fields are clear. It may be helpful to add examples or further documentation in the CRD’s description to guide users on how to configure these settings.packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/049-Crd-kafkarebalance.yaml (1)
105-120
: Define moveReplicasOffVolumes Structure
The new propertymoveReplicasOffVolumes
outlines an array of objects withbrokerId
andvolumeIds
. The minimum item restrictions help ensure valid configuration. Consider including an example snippet in the CRD documentation to clarify its usage.packages/system/kafka-operator/templates/prometheus-rules.yaml (1)
134-152
: Entity Operator Alerts – Pod Health Monitoring
The alerts for the entity operator (TopicOperatorContainerDown and UserOperatorContainerDown) correctly capture critical issues related to the operator pods. The expressions are clear; consider if a longer “for” duration might help avoid transient issues.packages/system/kafka-operator/charts/strimzi-kafka-operator/README.md (2)
9-16
: Migration and Deprecation Notice Update
The revised section clearly states that Strimzi 0.45 is the final version to support ZooKeeper-based clusters and MirrorMaker 1 deployments, and it urges migration to KRaft and MirrorMaker 2. The deprecation of the EnvVar Configuration Provider and MirrorMaker 2 Extensions is concisely noted. This update is clear and aligns with the migration objectives.
68-68
: Markdown Unordered List Style
Static analysis suggests using asterisks (*
) for unordered lists rather than dashes. Consider updating the prerequisites list (line 68) for consistency with the project’s markdown style guidelines.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
68-68: Unordered list style
Expected: asterisk; Actual: dash(MD004, ul-style)
packages/system/kafka-operator/charts/strimzi-kafka-operator/files/grafana-dashboards/strimzi-kraft.json (2)
1076-1158
: New “Current Raft State” Panel Addition
A new panel titled “Current Raft State” has been added. It queriesmax(kafka_server_raftmetrics_current_state) by (kubernetes_pod_name, current_state)
and is configured as a table with transformations to rename fields. This addition will improve KRaft observability. Please verify that the query returns data in the expected format and that the transformation options align with your monitoring needs.
1224-1440
: Revised Grid Positions for Leadership Panels
The grid positions for the “Current Quorum Leader”, “Current Voted Leader”, and “Current Quorum Epoch” panels have been updated (to widths of 6 instead of 8, with adjusted x-coordinates) to improve dashboard layout and visual coherence. Ensure that this new arrangement fits well with other panels and that there is no overlap or excessive white space.packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/046-Crd-kafkabridge.yaml (2)
1192-1224
: Addition of “valueFrom” in bridgeContainer Environment Variables
The environment variable specification for the bridge container now supports avalueFrom
field. This allows sourcing environment variable values from a Secret or a ConfigMap. The use of a oneOf validation (ensuring either a literalvalue
or avalueFrom
is provided) is a good design choice.
1371-1413
: Addition of “valueFrom” in initContainer Environment Variables
Similarly, the init container’s env section now includes support for avalueFrom
property. This change enhances flexibility in provisioning dynamic configurations during container startup. Verify that the oneOf construct correctly enforces the required fields and that both secretKeyRef and configMapKeyRef configurations work as expected.packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/045-Crd-kafkamirrormaker.yaml (1)
1335-1368
: Review of AddedvalueFrom
Property for mirrorMakerContainer Environment VariablesA new
valueFrom
property has been introduced under the container’s environment variable configuration. It supports sourcing values from either a secret (secretKeyRef
) or a config map (configMapKeyRef
), wrapped in aoneOf
clause to enforce mutual exclusivity. This enhances the flexibility in configuring environment variables. Please verify that this pattern is consistent with similar configurations elsewhere in your codebase, and consider adding notes to help users understand when to use each reference type.dashboards/kafka/strimzi-kafka.json (1)
1-2941
: Overall New Grafana Dashboard for Strimzi KafkaThis new JSON file introduces a comprehensive Grafana dashboard for monitoring a Strimzi Kafka cluster. The structure includes well-defined sections for annotations, panels, templating, and refresh settings. Key highlights:
• Panels and Queries: Multiple panels display metrics (e.g., broker status, controller activity, replica counts, network and disk metrics) via Prometheus queries that use dynamic variables (such as
$kubernetes_namespace
,$strimzi_cluster_name
,$kafka_broker
, etc.). Ensure that these Prometheus expressions return the expected time series data in your environment.• Templating: The templating variables (datasource, namespace, cluster name, broker, topic, and partition) are correctly defined. Verify that the regular expressions used to extract label values match your actual metrics.
• Refresh and Layout: With a refresh rate of 5 seconds and careful grid positioning, the dashboard appears designed to provide near-real-time insights without overwhelming the backend.
• Usability: Since this dashboard is newly added, consider including user documentation (or inline panel descriptions) to guide users on interpreting each metric.
Overall, the JSON is well-organized and seems consistent with Grafana dashboard best practices. If the queries have been validated in your environment, no further changes are required.
packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/041-Crd-kafkaconnect.yaml (3)
1106-1132
: CSI Volume Type Addition in Volumes
The newcsi
block is added as an option for volume configuration. This change increases flexibility by allowing CSI‐specified storage options. Please double‑check that its properties (driver, fsType, nodePublishSecretRef, readOnly, and volumeAttributes) match the expected schema and align with similar changes in related CRDs.
1214-1247
: Enhanced Environment Variable Reference in Connect Container
Introducing thevalueFrom
property under the connect container’s environment variable configuration enables externalizing secrets or configuration values via Kubernetes Secrets or ConfigMaps. Ensure that documentation clearly explains both the usage and limitations for users and that the one‑of schema correctly enforces a single source reference.
1357-1390
: Enhanced Environment Variable Reference in Init Container
Similar to the connect container, the init container now supports specifying environment variables viavalueFrom
(using either a secret or ConfigMap). Verify that this change is consistent with other CRDs and that the resulting behavior is well‑documented for operators.packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/048-Crd-kafkamirrormaker2.yaml (4)
2332-2358
: CSI Volume Support in Pod Template for MirrorMaker2
A newcsi
option has been added to the volumes section within the pod template. This addition brings MirrorMaker2’s volume configuration inline with other components. Please verify that:
- The set of properties (
driver
,fsType
,nodePublishSecretRef
,readOnly
,volumeAttributes
) conforms to the CSI VolumeSource spec,- The one‑of clause (ensuring that only one volume source is used) remains valid.
1431-1463
: Dynamic Environment Variable Sourcing in Connect Container
The connect container now supports specifying environment variable values using avalueFrom
block that allows either a secret or a ConfigMap reference. Make sure that the one‑of condition correctly enforces that only one of these is provided and that users are aware of this new option in the operator’s documentation.
1574-1607
: Dynamic Environment Variable Sourcing in Init Container
The init container has also been updated to include the newvalueFrom
approach in its environment variable definitions. This change provides consistency throughout the pods and improves configuration flexibility. Please ensure compatibility with any consuming scripts or automated tests.
2378-2410
: Dynamic Environment Variable Sourcing in Build Container
The build container now allows environment variables to be set viavalueFrom
, similar to the other container specifications. This enables users to source configuration or secrets dynamically during the build process. Verify that the schema’s one‑of clause functions as intended and that the change is documented in the build configuration guidelines.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (26)
dashboards/kafka/strimzi-kafka.json
(1 hunks)hack/download-dashboards.sh
(1 hunks)packages/apps/kafka/Chart.yaml
(1 hunks)packages/apps/kafka/templates/kafka.yaml
(2 hunks)packages/apps/kafka/templates/metrics-configmap.yaml
(1 hunks)packages/apps/kafka/templates/podscrape.yaml
(1 hunks)packages/apps/versions_map
(1 hunks)packages/extra/monitoring/dashboards.list
(1 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/Chart.yaml
(2 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/README.md
(5 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/040-Crd-kafka.yaml
(19 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/041-Crd-kafkaconnect.yaml
(8 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/045-Crd-kafkamirrormaker.yaml
(2 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/046-Crd-kafkabridge.yaml
(3 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/047-Crd-kafkaconnector.yaml
(1 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/048-Crd-kafkamirrormaker2.yaml
(10 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/049-Crd-kafkarebalance.yaml
(3 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/files/grafana-dashboards/strimzi-kraft.json
(4 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/020-RoleBinding-strimzi-cluster-operator.yaml
(1 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-ClusterRole-strimzi-cluster-operator-role.yaml
(1 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-RoleBinding-strimzi-cluster-operator.yaml
(1 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/031-RoleBinding-strimzi-cluster-operator-entity-operator-delegation.yaml
(1 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/060-Deployment-strimzi-cluster-operator.yaml
(1 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/_kafka_image_map.tpl
(1 hunks)packages/system/kafka-operator/charts/strimzi-kafka-operator/values.yaml
(3 hunks)packages/system/kafka-operator/templates/prometheus-rules.yaml
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- packages/apps/kafka/Chart.yaml
- packages/system/kafka-operator/charts/strimzi-kafka-operator/Chart.yaml
🧰 Additional context used
🪛 YAMLlint (1.35.1)
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/020-RoleBinding-strimzi-cluster-operator.yaml
[error] 5-5: trailing spaces
(trailing-spaces)
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/031-RoleBinding-strimzi-cluster-operator-entity-operator-delegation.yaml
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
[error] 5-5: trailing spaces
(trailing-spaces)
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-ClusterRole-strimzi-cluster-operator-role.yaml
[warning] 86-86: wrong indentation: expected 2 but found 4
(indentation)
[warning] 88-88: wrong indentation: expected 2 but found 4
(indentation)
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-RoleBinding-strimzi-cluster-operator.yaml
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
[error] 5-5: trailing spaces
(trailing-spaces)
packages/apps/kafka/templates/podscrape.yaml
[warning] 10-10: wrong indentation: expected 8 but found 6
(indentation)
[error] 4-4: syntax error: expected , but found ''
(syntax)
packages/system/kafka-operator/charts/strimzi-kafka-operator/values.yaml
[error] 185-185: no new line character at the end of file
(new-line-at-end-of-file)
packages/apps/kafka/templates/metrics-configmap.yaml
[error] 4-4: syntax error: expected , but found ''
(syntax)
🪛 LanguageTool
packages/system/kafka-operator/charts/strimzi-kafka-operator/README.md
[style] ~34-~34: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...fka® cluster via the HTTP protocol. * Includes Cruise Control - Automates the proces...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🪛 markdownlint-cli2 (0.17.2)
packages/system/kafka-operator/charts/strimzi-kafka-operator/README.md
68-68: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
🔇 Additional comments (44)
packages/extra/monitoring/dashboards.list (1)
37-37
: LGTM! The Strimzi Kafka dashboard has been added.The addition aligns with the PR's objective of implementing Kafka monitoring and maintains the list's alphabetical order.
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/_kafka_image_map.tpl (2)
8-31
: LGTM! Version updates are consistent across all components.The changes maintain version consistency by:
- Updating default images to 3.9.0
- Adding intermediate version 3.8.1
- Adding latest version 3.9.0
3-4
: Verify the file generation process.The file header indicates this is a generated file. Please ensure the
helm-charts/Makefile
has been updated accordingly.❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Description: Check if Makefile exists and contains relevant generation rules # Look for the Makefile fd --type f "Makefile$" helm-charts/ --exec cat {} | grep -A 5 "kafka.*image.*map"Length of output: 186
I'll run an improved search across all Makefiles to locate any reference to the generated file. This should help confirm whether the file generation rules for
_kafka_image_map.tpl
in the Makefile have been updated to reflect the new version changes.If this search returns no relevant output, it may indicate that the generation process isn’t directly referencing
_kafka_image_map.tpl
(or the Makefile might be located in a non-standard directory). In that case, please verify manually that the generation logic (as mentioned in the file header comment "generated in helm-charts/Makefile") has been updated accordingly with the version changes.Based on the current automated findings, additional manual verification is needed to validate the Makefile update.
🏁 Script executed:
#!/bin/bash # Searching all Makefiles in the repository for references to _kafka_image_map.tpl fd --type f "Makefile$" | xargs grep -H "kafka_image_map.tpl"Length of output: 61
Manual verification required for file generation rules.
The automated search did not locate any generation rule for
_kafka_image_map.tpl
in Makefiles (including the expectedhelm-charts/Makefile
). Please manually verify that the Makefile handling file generation is updated to reflect the changes in version updates for the Kafka components.packages/apps/versions_map (1)
26-27
: LGTM! Version updates follow the established pattern.The changes appropriately:
- Lock version 0.3.1 to specific commit b7375f7
- Add new version 0.3.2 tracking HEAD
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/020-RoleBinding-strimzi-cluster-operator.yaml (1)
3-23
: LGTM! Well-structured conditional namespace handling.The implementation correctly:
- Initializes $watchNamespaces variable
- Handles watchAnyNamespace condition
- Switches between ClusterRoleBinding and RoleBinding
- Manages namespace assignment appropriately
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 5-5: trailing spaces
(trailing-spaces)
[error] 20-20: trailing spaces
(trailing-spaces)
packages/system/kafka-operator/charts/strimzi-kafka-operator/templates/023-RoleBinding-strimzi-cluster-operator.yaml (1)
1-7
: Enhanced Namespace Handling in RBAC ConfigurationThe introduction of the
$watchNamespaces
variable along with the conditional check for$root.Values.watchAnyNamespace
improves the flexibility of namespace management in the RoleBinding. This change helps ensure that whenwatchAnyNamespace
is active, the list of watched namespaces defaults appropriately (to an empty list) while otherwise using the specified.Values.watchNamespaces
. Note that YAML lint tools may incorrectly flag templating syntax (e.g. the leading{{-
and trailing-}}
) as errors—please verify that the rendered YAML is valid.🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
[error] 5-5: trailing spaces
(trailing-spaces)
packages/apps/kafka/templates/podscrape.yaml (1)
1-41
: New VMPodScrape Resource for Enhanced Metrics CollectionThe new
VMPodScrape
resource is well-defined and structured to enable Prometheus to scrape pod metrics via the specified endpoints. The use of templated metadata (e.g.{{ .Release.Name }}-pod-scrape
) fits the Helm convention; however, please verify that these values render correctly during deployment. Also, YAMLlint reports a warning regarding the indentation of the- separator: ;
line (line 10) and a syntax issue on line 4, which are likely false positives due to the mix of Helm templating and YAML. It is recommended to double-check the rendered output.🧰 Tools
🪛 YAMLlint (1.35.1)
[warning] 10-10: wrong indentation: expected 8 but found 6
(indentation)
[error] 4-4: syntax error: expected , but found ''
(syntax)
packages/apps/kafka/templates/kafka.yaml (2)
60-66
: Addition of Kafka Metrics ConfigurationThe new
metricsConfig
field for Kafka, usingjmxPrometheusExporter
, is a strong enhancement for observability. Ensure that the referenced ConfigMap (named{{ .Release.Name }}-metrics
) exists and contains the expected key (kafka-metrics-config.yml
) to provide the proper configuration for metrics scraping.
77-82
: Addition of Zookeeper Metrics ConfigurationSimilarly, the inclusion of a
metricsConfig
section for Zookeeper aligns with the extended monitoring capabilities intended by this PR. Please verify that the ConfigMap reference and key are correct and that the configuration structure matches what is required for the JMX Prometheus Exporter.packages/system/kafka-operator/charts/strimzi-kafka-operator/values.yaml (2)
13-13
: Update Default Image Tag Consistency
The default image tag for the operator has been updated to0.45.0
, aligning the operator version with the expected release. Please verify that downstream templates and documentation reflect this change.
129-129
: Update Kafka Bridge Image Tag
The Kafka Bridge image tag is now set to0.31.1
. Ensure that this version is compatible with the rest of the Kafka operator components and that any related documentation is updated accordingly.packages/apps/kafka/templates/metrics-configmap.yaml (2)
6-168
: Comprehensive Kafka Metrics Rules Set
The rules underkafka-metrics-config.yml
detail a wide range of patterns and metric types. The configurations appear thorough. It might be beneficial to add inline comments or examples for some of the more complex regex patterns to aid future maintainers.
169-190
: Zookeeper Metrics Configuration
The zookeeper metrics rules are defined in a similar detailed manner. Just ensure that any changes or updates to JMX exporter rule formats are tracked in documentation.packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/049-Crd-kafkarebalance.yaml (3)
34-37
: Add Template Printer Column
The new additional printer column "Template" (using the JSONPath.metadata.annotations.strimzi\.io/rebalance-template
) is added. Verify that this annotation is consistently applied to resources that serve as templates.
38-41
: Add Status Printer Column
The addition of a "Status" printer column using the JSONPath.status.conditions[*].type
offers a concise view of rebalancing status. Consider whether aggregating multiple condition types in one column meets your operational needs or if a custom formatter might provide clearer insight in the UI.
57-64
: Expand Supported Rebalance Modes
Themode
enum has been extended to includeremove-disks
, which permits intra-broker disk balancing. The updated description is detailed; please ensure supporting logic in the operator reconciler accommodates the nuances of disk rebalancing.packages/system/kafka-operator/templates/prometheus-rules.yaml (8)
1-8
: Resource Definition and Metadata Setup
The PrometheusRule resource is newly introduced and is configured with an appropriate API version, kind, and metadata (including labels such asrole: alert-rules
andapp: strimzi
). This is well structured and follows standard conventions.
10-92
: Kafka Group Alerts – Comprehensive Coverage
The “kafka” group defines a broad set of alerts (e.g. for low disk space, under-replicated partitions, abnormal controller state, offline partitions, etc.). The PromQL expressions are detailed and incorporate dynamic labels in the annotations for context. Please double-check that thresholds (e.g. available disk space below 15%) and the durations (mostly 10s) align with operational expectations.
93-133
: Zookeeper Group Alerts – Latency and Resource Conditions
Alerts in the “zookeeper” group (e.g. AvgRequestLatency, OutstandingRequests, running out of space, container restart frequency) are properly defined. Verify that the chosen numerical thresholds (like> 10
for latency or> 10
for outstanding requests) are well calibrated for your cluster’s performance characteristics.
152-178
: Connect Group Alerts – Operational Failures Detection
The rules for the “connect” group (covering container downtimes and connector/task failures) are straightforward and useful for flagging issues. They should help in administering Kafka Connect reliability.
178-228
: Bridge Group Alerts – HTTP and Latency Metrics
The “bridge” group includes several alerts to monitor container availability and HTTP error rates as well as latency metrics for both producers and consumers. The configuration appears well balanced; double-check that the alert durations and threshold values (e.g. error rate > 10) are appropriate for your environment’s noise level.
228-238
: MirrorMaker Alert – Container Health Check
The single alert for MirrorMaker ensures that container downtime is caught if present for more than 3 minutes. This rule is direct and effective.
238-264
: Kafka Exporter Alerts – Replication and Lag Monitoring
The “kafkaExporter” group covers alerts for under-replicated partitions, excessive consumer group lag, and a lack of message updates. These rules use concise PromQL expressions and dynamic annotations. Please confirm that the thresholds (e.g. consumer lag > 1000) have been validated in your production setting.
264-276
: Certificates Alert – Expiry Notification
The “certificates” group alerts administrators when a certificate is nearing expiration. Using a fixed time window of 30 days seems reasonable; ensure that this duration matches your operational policy.packages/system/kafka-operator/charts/strimzi-kafka-operator/README.md (7)
27-27
: KRaft Support Emphasis
The highlighted “KRaft support” line explicitly informs users about running Kafka clusters in KRaft mode. This concise message is helpful given the migration push.
32-32
: MirrorMaker Inclusion Clarity
The updated description for Kafka MirrorMaker inclusion is clear. It helps set the user’s expectation regarding mirroring functionality between Kafka clusters.
35-35
: Auto-Rebalancing Feature Announcement
The note on auto-rebalancing when scaling is an attractive feature, ensuring that cluster health is maintained during scale operations. Confirm that the operational behavior meets your design goals.
36-36
: Tiered Storage Feature Description
The “Tiered storage” update provides clear insight into offloading older data. Ensure that the documentation explains any configuration nuances associated with this feature.
68-68
: Prerequisites Update – Kubernetes Version Requirement
Updating the prerequisite to “Kubernetes 1.25+” is an important operational change. Please verify that all deployment scripts and instructions reflect this updated requirement.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
68-68: Unordered list style
Expected: asterisk; Actual: dash(MD004, ul-style)
105-105
: Default Image Tag Update
The default image tag for all images (except Kafka Bridge) has been updated to0.45.0
. This is consistent with the operator upgrade and should be tested to ensure full compatibility with the new features.
169-169
: Kafka Bridge Image Tag Update
The update of the Kafka Bridge image tag to0.31.1
is clearly documented. Ensure that any downstream configurations or integrations referencing the previous tag are updated accordingly.packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/046-Crd-kafkabridge.yaml (1)
1095-1121
: CSI Volume Support in Volumes Section
A newcsi
volume type has been added to the volumes section. The properties includedriver
,fsType
,nodePublishSecretRef
,readOnly
, andvolumeAttributes
. This enhancement will allow more flexible storage configurations. Ensure that all required fields are correctly populated during deployment and that it’s compatible with your CSI drivers.packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/045-Crd-kafkamirrormaker.yaml (2)
1271-1290
: Review of CSI Volume Type AdditionThe new
csi
volume definition is clearly added with its own object schema. The properties (driver
,fsType
,nodePublishSecretRef
,readOnly
, andvolumeAttributes
) are defined in a straightforward manner. Please verify that these fields match the Kubernetes CSI specification and your operator’s expectations. It may also be beneficial to consider explicitly listing required keys as per your validation needs.
1291-1297
: Ensure Inclusion of CSI in Volume OptionsThe
oneOf
block now explicitly requires the presence of thecsi
property when that volume type is used. This approach clearly distinguishes it from other volume types (such assecret
,configMap
, etc.), and it aligns with your documentation note of replacing an empty required array with a firm requirement for thecsi
type.packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/048-Crd-kafkamirrormaker2.yaml (3)
340-363
: Offset Management in Source Connector Configuration
The new propertieslistOffsets
andalterOffsets
enable dynamic offset management for the source connector. In this section, the schema now expects atoConfigMap
(for listing offsets) and afromConfigMap
(for altering offsets), each with a requiredname
field. Please verify that the naming convention for the referenced ConfigMaps is clearly communicated to users and that error messages are helpful if misconfigured.
396-419
: Offset Management in Heartbeat Connector Configuration
Similar to the source connector, the heartbeat connector now includeslistOffsets
andalterOffsets
settings. This enhancement supports more granular control of connector offset management. Ensure that both sections use consistent validation rules and that the change is reflected in any related documentation.
452-475
: Offset Management in Checkpoint Connector Configuration
The addition oflistOffsets
andalterOffsets
in the checkpoint connector mirrors the changes seen in the source and heartbeat connectors. Confirm that these additions consistently require a respective ConfigMap reference and that the expected behavior is thoroughly tested.packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/040-Crd-kafka.yaml (7)
483-495
: LGTM: Host template configuration for broker hostnamesThe new
hostTemplate
andadvertisedHostTemplate
properties provide a flexible way to customize broker hostnames using placeholders like{nodeId}
and{nodePodName}
. The implementation looks good with clear descriptions.
1576-1594
: LGTM: CSI volume source support addedThe addition of CSI volume source support enhances storage flexibility by allowing the use of Container Storage Interface drivers. The implementation includes all necessary fields like driver, fsType, nodePublishSecretRef, and volumeAttributes.
1832-1874
: LGTM: Environment variable value sourcing from secrets and configmapsThe new
valueFrom
property enables environment variables to be populated from secrets and configmaps, providing better security and configuration management. The implementation is consistent across all container templates.Also applies to: 1975-2017, 2826-2868, 3278-3320, 3540-3582, 3683-3725, 4014-4056, 4157-4199, 4540-4582, 4683-4725, 5014-5056, 5157-5199, 5540-5582, 5683-5725, 6014-6056, 6157-6199, 6826-6868, 7165-7207
6411-6432
: LGTM: Auto-rebalancing configuration for scaling operationsThe new
autoRebalance
property provides configuration for automatic rebalancing when brokers are added or removed. The implementation includes:
- Mode specification (
add-brokers
orremove-brokers
)- Template reference for rebalance configuration
8333-8364
: LGTM: Auto-rebalancing status trackingThe new
autoRebalance
status field provides comprehensive tracking of rebalancing operations:
- Current state (Idle/RebalanceOnScaleDown/RebalanceOnScaleUp)
- Last transition timestamp
- Mode-specific broker lists for running/queued operations
1576-1594
: Verify CSI volume source compatibilityLet's verify the CSI volume source implementation across different Kubernetes versions and check for any known issues.
Also applies to: 3130-3148, 3917-3935, 4498-4516, 5917-5935, 7123-7141, 8021-8039
✅ Verification successful
🌐 Web query:
What is the minimum Kubernetes version required for CSI volume source support in CRDs?
💡 Result:
The minimum Kubernetes version required for CSI volume source support in Custom Resource Definitions (CRDs) is v1.14. This is based on:
The
CSIDriver
object (a CRD used for CSI configuration) became available as a built-in Kubernetes API in v1.14 as a beta feature[8]. This eliminated the need for manual CRD installation through YAML manifests.Key implementation details:
- Beta status meant the feature was enabled by default starting from v1.14[8]
- CSIDriver API integration allowed cluster administrators to configure driver behaviors through native Kubernetes objects rather than external CRDs[8]
- This version alignment matches when CSI itself reached GA status in Kubernetes v1.13[1], with subsequent improvements to CRD handling
For production use cases requiring full stability, it's recommended to use Kubernetes v1.16+, as this brought additional maturity to both CSI and CRD versioning features[10]. However, basic functionality exists since v1.14[8].
Citations:
- 1: https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/
- 2: https://kubernetes.io/docs/concepts/storage/volume-pvc-datasource/
- 3: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
- 4: https://kubernetes-csi.github.io/docs/project-policies.html
- 5: https://kubernetes.io/docs/concepts/storage/volumes/?emptydir
- 6: https://github.com/kubernetes-csi/external-snapshotter/releases
- 7: https://github.com/kubernetes-csi/external-snapshotter
- 8: https://kubernetes-csi.github.io/docs/csi-driver-object.html
- 9: https://techdocs.broadcom.com/us/en/vmware-tanzu/standalone-components/tanzu-mission-control/saas/tanzu-mission-control-documentation/tanzumc-using-GUID-197D36F5-D96A-4104-8067-2C7ABE18EE37.html
- 10: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/
CSI Volume Source Compatibility Verified
The web query confirmed that CSI volume source support within CRDs requires a minimum Kubernetes version of v1.14 (with production setups recommending v1.16+). Thus, the definition in the YAML snippet (and the similar blocks at lines 3130-3148, 3917-3935, 4498-4516, 5917-5935, 7123-7141, 8021-8039) is compatible with Kubernetes versions that provide CSI volume source support.
483-495
: Verify auto-rebalancing configuration and status handlingLet's verify that the auto-rebalancing feature is properly documented and that there are no conflicting configurations.
Also applies to: 6411-6432, 8333-8364
✅ Verification successful
🏁 Script executed:
#!/bin/bash # Search for auto-rebalancing related documentation and examples rg -A 5 "autoRebalance" --type mdLength of output: 33
Below is a new shell script to search the entire repository for any occurrence of “autoRebalance” (not just in Markdown files) so that we can check if the auto‐rebalancing configuration and status handling are documented or defined elsewhere:
Please run this script and review the output. If nothing shows up (or you only find irrelevant matches), we will need to manually verify the intended documentation and configuration of the auto-rebalancing feature.
🏁 Script executed:
#!/bin/bash # Search the entire repository (all file types) for references to "autoRebalance" rg -R "autoRebalance"Length of output: 426
Below is a revised shell script that correctly searches the entire repository for any reference to "autoRebalance". Please execute the script and share the output for further analysis:
After running this script, please provide the output so we can continue verifying the auto-rebalancing configuration and status handling.
🏁 Script executed:
#!/bin/bash # Correctly search the entire repository for references to "autoRebalance" rg "autoRebalance"Length of output: 254
Below is a script that will extract additional context around the
autoRebalance
occurrences as well as the contents of the referenced line ranges (6411–6432 and 8333–8364) in the CRD file. This will help us verify that the auto-rebalancing feature is properly documented and that there are no conflicting configurations. Please run the script and share its output.
🏁 Script executed:
#!/bin/bash echo "---- Retrieving autoRebalance context in CRD ----" rg -C 5 "autoRebalance:" packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/040-Crd-kafka.yaml echo echo "---- Extracting lines 6411-6432 ----" sed -n '6411,6432p' packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/040-Crd-kafka.yaml echo echo "---- Extracting lines 8333-8364 ----" sed -n '8333,8364p' packages/system/kafka-operator/charts/strimzi-kafka-operator/crds/040-Crd-kafka.yamlLength of output: 6102
Auto-Rebalancing Documentation Verified
The CRD clearly distinguishes between the auto-rebalancing configuration (lines 6411–6432) and status handling (lines 8333–8364). Both sections include well-defined types, enumerations, and descriptions without conflicts.
f96a044
to
d234d58
Compare
Do we support KRAFT kafka instances? If yes - there should be different monitoring for them. |
and also please clean the history - it looks like that it should be single commit to the main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Я не знаю, какая часть этого кода просто взята из открытых источников, а какая -- самописная, мне и самому не очень нравится идея поддерживать свой форк кода в том случае, когда у нас naming convention -- pod
, а в опенсорсном источнике kubernetes_pod_name
, но если это наши куски кода, то лучше их привести в соответствие.
severity: warning | ||
annotations: | ||
summary: 'Kafka under replicated partitions' | ||
description: 'There are {{ $value }} under replicated partitions on {{ $labels.kubernetes_pod_name }}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Судя по всему, если учесть мои комменты выше, то и prometheus rules придётся исправлять.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (1)
packages/system/kafka-operator/templates/prometheus-rules.yaml (1)
212-227
: Improve HTTP error rate alerts.The current implementation has several limitations:
- Uses absolute thresholds which don't scale with traffic
- Only excludes favicon.ico from 4xx errors
Consider using error rates instead of absolute counts:
- alert: Http4xxErrorRate - expr: strimzi_bridge_http_server_requestCount_total{code=~"^4..$", container=~"^.+-bridge", path !="/favicon.ico"} > 10 + expr: sum(rate(strimzi_bridge_http_server_requestCount_total{code=~"^4..$", container=~"^.+-bridge"}[5m])) by (path) / sum(rate(strimzi_bridge_http_server_requestCount_total{container=~"^.+-bridge"}[5m])) by (path) > 0.05 for: 1m labels: severity: warning annotations: summary: 'Kafka Bridge returns code 4xx too often' - description: 'Kafka Bridge returns code 4xx too much ({{ $value }}) for the path {{ $labels.path }}' + description: 'Kafka Bridge error rate is {{ $value | humanizePercentage }} for the path {{ $labels.path }}' - alert: Http5xxErrorRate - expr: strimzi_bridge_http_server_requestCount_total{code=~"^5..$", container=~"^.+-bridge"} > 10 + expr: sum(rate(strimzi_bridge_http_server_requestCount_total{code=~"^5..$", container=~"^.+-bridge"}[5m])) by (path) / sum(rate(strimzi_bridge_http_server_requestCount_total{container=~"^.+-bridge"}[5m])) by (path) > 0.01 for: 1m
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
packages/apps/kafka/templates/podscrape.yaml
(1 hunks)packages/system/kafka-operator/templates/prometheus-rules.yaml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/apps/kafka/templates/podscrape.yaml
🔇 Additional comments (1)
packages/system/kafka-operator/templates/prometheus-rules.yaml (1)
264-276
: LGTM! Well-designed certificate expiration monitoring.The certificate expiration alert is well-implemented with appropriate thresholds and context.
- alert: ZookeeperRunningOutOfSpace | ||
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 5368709120 | ||
for: 10s | ||
labels: | ||
severity: warning | ||
annotations: | ||
summary: 'Zookeeper is running out of free disk space' | ||
description: 'There are only {{ $value }} bytes available at {{ $labels.persistentvolumeclaim }} PVC' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Align Zookeeper disk space alert with Kafka's approach.
The Zookeeper disk space alert uses an absolute value (5GB) while Kafka uses a percentage. This inconsistency could cause issues with different volume sizes.
- alert: ZookeeperRunningOutOfSpace
- expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 5368709120
+ expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 15
for: 10s
labels:
severity: warning
annotations:
summary: 'Zookeeper is running out of free disk space'
- description: 'There are only {{ $value }} bytes available at {{ $labels.persistentvolumeclaim }} PVC'
+ description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- alert: ZookeeperRunningOutOfSpace | |
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 5368709120 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Zookeeper is running out of free disk space' | |
description: 'There are only {{ $value }} bytes available at {{ $labels.persistentvolumeclaim }} PVC' | |
- alert: ZookeeperRunningOutOfSpace | |
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data-(.+)-zookeeper-[0-9]+"} < 15 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Zookeeper is running out of free disk space' | |
description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC' |
- alert: KafkaRunningOutOfSpace | ||
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} < 15 | ||
for: 10s | ||
labels: | ||
severity: warning | ||
annotations: | ||
summary: 'Kafka is running out of free disk space' | ||
description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Adjust disk space alert threshold and duration.
The disk space alert might generate too many false positives:
- 15% threshold might be too low for large volumes
- 10s duration is too short for disk space trends
- alert: KafkaRunningOutOfSpace
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} < 15
- for: 10s
+ for: 5m
labels:
severity: warning
annotations:
summary: 'Kafka is running out of free disk space'
description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC'
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- alert: KafkaRunningOutOfSpace | |
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} < 15 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka is running out of free disk space' | |
description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC' | |
- alert: KafkaRunningOutOfSpace | |
expr: kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} * 100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"data(-[0-9]+)?-(.+)-kafka-[0-9]+"} < 15 | |
for: 5m | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka is running out of free disk space' | |
description: 'There are only {{ $value }} percent available at {{ $labels.persistentvolumeclaim }} PVC' |
- alert: UnderReplicatedPartitions | ||
expr: kafka_server_replicamanager_underreplicatedpartitions > 0 | ||
for: 10s | ||
labels: | ||
severity: warning | ||
annotations: | ||
summary: 'Kafka under replicated partitions' | ||
description: 'There are {{ $value }} under replicated partitions on {{ $labels.pod }}' | ||
- alert: AbnormalControllerState | ||
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) by (strimzi_io_name) != 1 | ||
for: 10s | ||
labels: | ||
severity: warning | ||
annotations: | ||
summary: 'Kafka abnormal controller state' | ||
description: 'There are {{ $value }} active controllers in the cluster' | ||
- alert: OfflinePartitions | ||
expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0 | ||
for: 10s | ||
labels: | ||
severity: warning | ||
annotations: | ||
summary: 'Kafka offline partitions' | ||
description: 'One or more partitions have no leader' | ||
- alert: UnderMinIsrPartitionCount | ||
expr: kafka_server_replicamanager_underminisrpartitioncount > 0 | ||
for: 10s | ||
labels: | ||
severity: warning | ||
annotations: | ||
summary: 'Kafka under min ISR partitions' | ||
description: 'There are {{ $value }} partitions under the min ISR on {{ $labels.pod }}' | ||
- alert: OfflineLogDirectoryCount | ||
expr: kafka_log_logmanager_offlinelogdirectorycount > 0 | ||
for: 10s | ||
labels: | ||
severity: warning | ||
annotations: | ||
summary: 'Kafka offline log directories' | ||
description: 'There are {{ $value }} offline log directories on {{ $labels.pod }}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Increase alert durations for replication and partition issues.
The 10s duration for replication and partition alerts is too short and may cause alert noise. These issues often require time to stabilize after temporary blips.
Apply this change to all alerts in this range:
- alert: UnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
- for: 10s
+ for: 2m
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- alert: UnderReplicatedPartitions | |
expr: kafka_server_replicamanager_underreplicatedpartitions > 0 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka under replicated partitions' | |
description: 'There are {{ $value }} under replicated partitions on {{ $labels.pod }}' | |
- alert: AbnormalControllerState | |
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) by (strimzi_io_name) != 1 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka abnormal controller state' | |
description: 'There are {{ $value }} active controllers in the cluster' | |
- alert: OfflinePartitions | |
expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka offline partitions' | |
description: 'One or more partitions have no leader' | |
- alert: UnderMinIsrPartitionCount | |
expr: kafka_server_replicamanager_underminisrpartitioncount > 0 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka under min ISR partitions' | |
description: 'There are {{ $value }} partitions under the min ISR on {{ $labels.pod }}' | |
- alert: OfflineLogDirectoryCount | |
expr: kafka_log_logmanager_offlinelogdirectorycount > 0 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka offline log directories' | |
description: 'There are {{ $value }} offline log directories on {{ $labels.pod }}' | |
- alert: UnderReplicatedPartitions | |
expr: kafka_server_replicamanager_underreplicatedpartitions > 0 | |
for: 2m | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka under replicated partitions' | |
description: 'There are {{ $value }} under replicated partitions on {{ $labels.pod }}' | |
- alert: AbnormalControllerState | |
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) by (strimzi_io_name) != 1 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka abnormal controller state' | |
description: 'There are {{ $value }} active controllers in the cluster' | |
- alert: OfflinePartitions | |
expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka offline partitions' | |
description: 'One or more partitions have no leader' | |
- alert: UnderMinIsrPartitionCount | |
expr: kafka_server_replicamanager_underminisrpartitioncount > 0 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka under min ISR partitions' | |
description: 'There are {{ $value }} partitions under the min ISR on {{ $labels.pod }}' | |
- alert: OfflineLogDirectoryCount | |
expr: kafka_log_logmanager_offlinelogdirectorycount > 0 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'Kafka offline log directories' | |
description: 'There are {{ $value }} offline log directories on {{ $labels.pod }}' |
- alert: NoMessageForTooLong | ||
expr: changes(kafka_topic_partition_current_offset[10m]) == 0 | ||
for: 10s | ||
labels: | ||
severity: warning | ||
annotations: | ||
summary: 'No message for 10 minutes' | ||
description: 'There is no messages in topic {{ $labels.topic}}/partition {{ $labels.partition }} for 10 minutes' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refine the no-message alert logic.
The current implementation might generate false positives for topics with naturally low message rates or during off-peak hours.
- alert: NoMessageForTooLong
expr: changes(kafka_topic_partition_current_offset[10m]) == 0
- for: 10s
+ for: 30m
labels:
severity: warning
+ topic_pattern: '{{ $labels.topic }}'
annotations:
summary: 'No message for 10 minutes'
- description: 'There is no messages in topic {{ $labels.topic}}/partition {{ $labels.partition }} for 10 minutes'
+ description: 'No messages in topic {{ $labels.topic }}/partition {{ $labels.partition }} for 30 minutes. Verify if this is expected for this topic pattern.'
✅ Verification successful
🏁 Script executed:
#!/bin/bash
# Find topic patterns or naming conventions that might help identify topics with different message rate expectations
rg -g '*.{yaml,yml,json}' -i 'topic.*pattern|topic.*name' --no-filename | sort -u
Length of output: 7477
Refine No-Message Alert Logic
The suggested changes improve the alert in two important ways:
- Extend the duration: Increasing the “for” duration from 10 seconds to 30 minutes reduces the risk of false positives in topics with naturally low or off-peak message rates.
- Add topic-based filtering: Introducing the
topic_pattern
label (using the current topic label) aligns with existing conventions (as seen in other config fields liketopicsPattern
) and helps refine which topics should trigger the alert.
Please ensure that the alert’s summary and description consistently reflect the new duration (i.e., update “No message for 10 minutes” if the check duration is set to 30 minutes).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- alert: NoMessageForTooLong | |
expr: changes(kafka_topic_partition_current_offset[10m]) == 0 | |
for: 10s | |
labels: | |
severity: warning | |
annotations: | |
summary: 'No message for 10 minutes' | |
description: 'There is no messages in topic {{ $labels.topic}}/partition {{ $labels.partition }} for 10 minutes' | |
- alert: NoMessageForTooLong | |
expr: changes(kafka_topic_partition_current_offset[10m]) == 0 | |
for: 30m | |
labels: | |
severity: warning | |
topic_pattern: '{{ $labels.topic }}' | |
annotations: | |
summary: 'No message for 10 minutes' | |
description: 'No messages in topic {{ $labels.topic }}/partition {{ $labels.partition }} for 30 minutes. Verify if this is expected for this topic pattern.' |
"pluginVersion": "7.4.5", | ||
"targets": [ | ||
{ | ||
"expr": "max(kafka_server_raftmetrics_current_state) by (kubernetes_pod_name, current_state)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Похоже, дашборд тоже потребует правки =(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
dashboards/kafka/strimzi-kafka.json (1)
1-2941
: 💡 Verification agent🧩 Analysis chain
Verify KRAFT mode compatibility.
Based on the PR comments, verify if this dashboard supports monitoring KRAFT (KRaft) mode Kafka clusters, as the metrics might differ from ZooKeeper-based clusters.
🌐 Web query:
What metrics are different between ZooKeeper-based Kafka and KRaft mode Kafka?
💡 Result:
The shift from ZooKeeper-based Kafka to KRaft mode introduces significant differences in performance, scalability, and operational metrics. Here’s a breakdown of key metric variations:
Metadata Read Latency
KRaft reduces metadata read latency by storing state changes as ordered events in an internal Kafka topic (
__cluster_metadata
). Brokers recover by replaying events from their last offset instead of reloading state from ZooKeeper[1][4]. This lowers recovery latency during failures[1][5].Maximum Partition Scalability
ZooKeeper-based clusters historically capped partitions at ~200k due to coordination bottlenecks between ZooKeeper and Kafka controllers[5]. KRaft eliminates this limitation by integrating metadata management directly into Kafka’s architecture, supporting up to 2 million partitions with faster controller failover[4][5].
Failover Delay
ZooKeeper’s message propagation inconsistencies often caused delayed failovers during leader elections[1]. KRaft’s Raft consensus protocol ensures faster leader election (near-instantaneous) and reduces unavailability windows during controller transitions[2][4]. Confluent benchmarks show 10x faster recovery times for clusters with 2M partitions under KRaft[5].
Throughput and Operational Efficiency
- Metadata Propagation: ZooKeeper used RPCs for state synchronization, while KRaft leverages an event-driven log for efficient metadata replication[4][9].
- Startup/Shutdown Times: KRaft avoids redundant state reloads during restarts since controllers retain in-memory metadata[2][4].
Metric ZooKeeper Mode KRaft Mode Max Partitions ~200k 2M+ Controller Failover Seconds to minutes Near-instantaneous Metadata Storage External (ZooKeeper) Internal ( __cluster_metadata
)Recovery Latency Higher due to ZK sync Lower via event replay[1][5] Security Complexity Dual models (ZK + Kafka) Simplified single model[1][3] Operational Overhead
KRft eliminates the need for separate ZooKeeper clusters, reducing deployment complexity (single vs. dual systems)[3][7]. Tools like
kafka-topics
shift from--zookeeper
flags to--bootstrap-server
endpoints[2][6].For existing clusters migrating to KRaft, phased rollouts are required to transition metadata storage while maintaining dual-write compatibility until full cutover[3][6].
Citations:
- 1: https://www.redpanda.com/guides/kafka-alternatives-kafka-raft
- 2: https://developer.confluent.io/learn/kraft/
- 3: https://docs.redhat.com/en/documentation/red_hat_streams_for_apache_kafka/2.8/html/deploying_and_managing_streams_for_apache_kafka_on_openshift/assembly-kraft-mode-str
- 4: https://developer.confluent.io/courses/architecture/control-plane/
- 5: https://docs.confluent.io/platform/current/kafka-metadata/kraft.html
- 6: https://www.redpanda.com/blog/migration-apache-zookeeper-kafka-kraft
- 7: https://docs.aws.amazon.com/msk/latest/developerguide/metadata-management.html
- 8: https://strimzi.io/blog/2024/03/21/kraft-migration/
- 9: https://socprime.com/blog/kraft-apache-kafka-without-zookeeper/
- 10: https://www.baeldung.com/kafka-shift-from-zookeeper-to-kraft
KRaft Compatibility: Update Required for ZooKeeper-Specific Panels
- The dashboard currently queries several ZooKeeper-specific metrics (e.g., the panel using
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms
) that won’t be available in KRaft mode.- In KRaft mode, Kafka manages metadata internally (via the
__cluster_metadata
topic) and exhibits different metrics for controller behavior (e.g., reduced metadata read latency and faster failover). This suggests that some panels may return no data or misleading values when monitoring KRaft clusters.- Please verify that all panels relying on legacy ZooKeeper metrics have KRaft-compatible alternatives or include conditionals to handle both deployment modes.
- Consider updating the documentation to clarify that this dashboard is designed for ZooKeeper-based clusters unless the necessary modifications for KRaft mode are implemented.
🧰 Tools
🪛 Biome (1.9.4)
[error] 2670-2670: The key pod was already declared.
This where a duplicated key was declared again.
If a key is defined multiple times, only the last definition takes effect. Previous definitions are ignored.
(lint/suspicious/noDuplicateObjectKeys)
[error] 2694-2694: The key pod was already declared.
This where a duplicated key was declared again.
If a key is defined multiple times, only the last definition takes effect. Previous definitions are ignored.
(lint/suspicious/noDuplicateObjectKeys)
🧹 Nitpick comments (4)
dashboards/kafka/strimzi-kafka.json (4)
18-22
: Consider adding dashboard links for related dashboards.The dashboard links array is empty. Consider adding links to related dashboards (e.g., Zookeeper, Kafka Connect) for better navigation between different components of the Kafka ecosystem.
"links": [ + { + "title": "Zookeeper Dashboard", + "type": "dashboard", + "tags": ["Strimzi", "Zookeeper"] + }, + { + "title": "Kafka Connect Dashboard", + "type": "dashboard", + "tags": ["Strimzi", "Kafka Connect"] + } ],
2816-2818
: Consider adjusting the refresh interval based on load.The dashboard is set to refresh every 5 seconds, which might be too frequent for high-traffic clusters and could impact Prometheus performance.
- "refresh": "5s", + "refresh": "30s",
2812-2812
: Fix typo in panel title.There's a typo in the ZooKeeper latency panel title.
- "title": "ZK request latecy", + "title": "ZK Request Latency",
23-655
: Consider adding alert thresholds for critical metrics.The broker health metrics panel group provides essential monitoring but lacks alert thresholds for critical conditions. Consider adding alert rules for:
- Under-replicated partitions > 0
- Offline partitions > 0
- Active controllers != 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dashboards/kafka/strimzi-kafka.json
(1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
dashboards/kafka/strimzi-kafka.json
[error] 2670-2670: The key pod was already declared.
This where a duplicated key was declared again.
If a key is defined multiple times, only the last definition takes effect. Previous definitions are ignored.
(lint/suspicious/noDuplicateObjectKeys)
[error] 2694-2694: The key pod was already declared.
This where a duplicated key was declared again.
If a key is defined multiple times, only the last definition takes effect. Previous definitions are ignored.
(lint/suspicious/noDuplicateObjectKeys)
🔇 Additional comments (2)
dashboards/kafka/strimzi-kafka.json (2)
1-17
: LGTM! Well-structured dashboard annotations.The dashboard annotations are properly configured with appropriate data source and visibility settings.
2823-2929
: LGTM! Well-structured templating variables.The templating variables are properly configured with appropriate data sources, regex patterns, and dependencies between variables. This allows for flexible filtering and drill-down capabilities.
"pod": true, | ||
"strimzi_io_broker_role": true, | ||
"strimzi_io_cluster": true, | ||
"strimzi_io_component_type": true, | ||
"strimzi_io_controller": true, | ||
"strimzi_io_controller_name": true, | ||
"strimzi_io_controller_role": true, | ||
"strimzi_io_kind": true, | ||
"strimzi_io_name": true, | ||
"strimzi_io_pod_name": true, | ||
"strimzi_io_pool_name": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix duplicate 'pod' key in transformations.
The transformations section contains duplicate 'pod' keys which could lead to unexpected behavior.
"excludeByName": {
"Time": true,
"__name__": true,
"container": true,
"endpoint": true,
"instance": true,
"job": true,
"pod": true,
"namespace": true,
"node_ip": true,
"node": true,
- "pod": true,
"strimzi_io_broker_role": true,
Also applies to: 2694-2694
Summary by CodeRabbit
New Features
Chores
Documentation