Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Correctly set byte and message out totals per subscription #18451

Merged

Conversation

pgier
Copy link
Contributor

@pgier pgier commented Nov 13, 2022

Fixes #15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total per subscription metrics by adding the values from the currently connected consumers. This produces incorrect values as soon as one or more of the consumers disconnects from the subscription.

This changes these two metrics to directly use the subscription stats for these values, and match the output of pulsar-admin topic stats.

Signed-off-by: Paul Gier paul.gier@datastax.com

Fixes #15819

Motivation

The prometheus metrics for pulsar_out_bytes_total and pulsar_out_messages_total should never decrease,
and they should match the output seen when using pulsar-admin.

Modifications

Changed the calculation of pulsar_out_bytes_total and pulsar_out_messages_total to directly use the
subscription stats instead of calculating these values by summing the values of the currently connected
consumers.

Verifying this change

  • Make sure that the change passes the CI checks.

Added a unit test to cover this case.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: pgier#2

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Nov 13, 2022
@pgier pgier force-pushed the fix-pulsar-messages-out-metric-reset branch 2 times, most recently from ce05f9f to fa9b71d Compare November 14, 2022 15:04
@pgier pgier changed the title [fix][broker] Correctly set byte and message out totals per topic. [fix][broker] Correctly set byte and message out totals per subscription. Nov 14, 2022
@pgier pgier changed the title [fix][broker] Correctly set byte and message out totals per subscription. [fix][broker] Correctly set byte and message out totals per subscription Nov 14, 2022
@pgier pgier force-pushed the fix-pulsar-messages-out-metric-reset branch 3 times, most recently from 4afc9c4 to 216ec00 Compare November 14, 2022 17:19
@codecov-commenter
Copy link

codecov-commenter commented Nov 16, 2022

Codecov Report

Merging #18451 (62514d5) into master (aeb4503) will increase coverage by 15.88%.
The diff coverage is 0.00%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #18451       +/-   ##
=============================================
+ Coverage     31.39%   47.27%   +15.88%     
- Complexity     6651    10456     +3805     
=============================================
  Files           697      697               
  Lines         68015    68015               
  Branches       7285     7285               
=============================================
+ Hits          21353    32157    +10804     
+ Misses        43667    32267    -11400     
- Partials       2995     3591      +596     
Flag Coverage Δ
unittests 47.27% <0.00%> (+15.88%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ker/stats/prometheus/NamespaceStatsAggregator.java 0.00% <0.00%> (ø)
...java/org/apache/pulsar/proxy/stats/TopicStats.java 58.82% <0.00%> (-41.18%) ⬇️
...a/org/apache/pulsar/proxy/server/ProxyService.java 80.00% <0.00%> (-0.94%) ⬇️
...apache/pulsar/proxy/server/LookupProxyHandler.java 57.75% <0.00%> (-0.44%) ⬇️
...kkeeper/mledger/impl/ManagedLedgerFactoryImpl.java 49.90% <0.00%> (-0.20%) ⬇️
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java 37.11% <0.00%> (-0.06%) ⬇️
.../apache/pulsar/broker/namespace/LookupOptions.java 87.50% <0.00%> (ø)
...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java 53.93% <0.00%> (+0.08%) ⬆️
...rg/apache/pulsar/proxy/server/ProxyConnection.java 56.33% <0.00%> (+1.12%) ⬆️
.../apache/pulsar/common/naming/NamespaceBundles.java 69.87% <0.00%> (+1.20%) ⬆️
... and 164 more

Copy link
Contributor

@dlg99 dlg99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @pgier. The fix looks good to me. I think we might want to break this out into two PRs though because the PrometheusMetricStreams might require larger changes. Let me know what you think. Thanks!

@eolivelli eolivelli self-requested a review November 17, 2022 09:28
@eolivelli
Copy link
Contributor

I second @michaeljmarshall comments

@pgier pgier force-pushed the fix-pulsar-messages-out-metric-reset branch from 216ec00 to b88a386 Compare November 17, 2022 16:15
@pgier
Copy link
Contributor Author

pgier commented Nov 17, 2022

@michaeljmarshall @eolivelli I've removed the changes to the metric type and only left in the changes to fix the metric value. However, I rebased on the latest master, and some recent PR seems to have broken a lot of the tests around prometheus metrics. There are some new metrics introduced that look like this:

pulsar_out_bytes_total: Metric{tags={cluster=test, namespace=my-property/use/my-ns, subscription=__compaction, topic=persistent://my-property/use/my-ns/__transaction_buffer_snapshot}, value=0.0}

And this breaks the count checks in several unit tests. I'm not sure which change introduced this issue, but possibly this one: #17905

Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total
per subscription metrics by adding the values from the currently connected consumers.
This produces incorrect values as soon as one or more of the consumers disconnects
from the subscription.

This updates these two metrics to directly use the subscription stats for these
values, and match the output of `pulsar-admin topic stats`.  It also changes
these metrics to be defined as `counter` type.

Signed-off-by: Paul Gier <paul.gier@datastax.com>
@pgier pgier force-pushed the fix-pulsar-messages-out-metric-reset branch from b88a386 to 62514d5 Compare November 17, 2022 20:07
Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@michaeljmarshall
Copy link
Member

@pgier - which test was failing? I can't seem to reproduce that metric output.

@pgier
Copy link
Contributor Author

pgier commented Nov 18, 2022

@michaeljmarshall You should be able to reproduce the issue by running the broker metrics tests in the pulsar-broker subdir.

mvn test -Dtest=PrometheusMetricsTest

They fail on current master but work fine if you revert 3715934.
There is a new PR which fixes the issue: #18539

@michaeljmarshall
Copy link
Member

Thanks @pgier. I realized that I wasn't able to reproduce it in my basic test because I wasn't enabling the transaction coordinator.

@michaeljmarshall michaeljmarshall merged commit c03e33e into apache:master Nov 19, 2022
@michaeljmarshall michaeljmarshall added this to the 2.12.0 milestone Nov 19, 2022
michaeljmarshall pushed a commit that referenced this pull request Nov 19, 2022
…ion (#18451)

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes #15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total per subscription metrics by adding the values from the currently connected consumers. This produces incorrect values as soon as one or more of the consumers disconnects from the subscription.

This changes these two metrics to directly use the subscription stats for these values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes #15819

### Motivation

The prometheus metrics for pulsar_out_bytes_total and pulsar_out_messages_total should never decrease,
and they should match the output seen when using pulsar-admin.

### Modifications

Changed the calculation of pulsar_out_bytes_total and pulsar_out_messages_total to directly use the
subscription stats instead of calculating these values by summing the values of the currently connected
consumers.

### Verifying this change

- [X] Make sure that the change passes the CI checks.

Added a unit test to cover this case.

### Does this pull request potentially affect one of the following parts:

*If the box was checked, please highlight the changes*

- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] Anything that affects deployment

### Documentation

<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->

- [ ] `doc` <!-- Your PR contains doc changes. Please attach the local preview screenshots (run `sh start.sh` at `pulsar/site2/website`) to your PR description, or else your PR might not get merged. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [X] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

### Matching PR in forked repository

PR in forked repository: pgier#2

(cherry picked from commit c03e33e)
michaeljmarshall pushed a commit that referenced this pull request Nov 19, 2022
…ion (#18451)

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes #15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total per subscription metrics by adding the values from the currently connected consumers. This produces incorrect values as soon as one or more of the consumers disconnects from the subscription.

This changes these two metrics to directly use the subscription stats for these values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes #15819

### Motivation

The prometheus metrics for pulsar_out_bytes_total and pulsar_out_messages_total should never decrease,
and they should match the output seen when using pulsar-admin.

### Modifications

Changed the calculation of pulsar_out_bytes_total and pulsar_out_messages_total to directly use the
subscription stats instead of calculating these values by summing the values of the currently connected
consumers.

### Verifying this change

- [X] Make sure that the change passes the CI checks.

Added a unit test to cover this case.

### Does this pull request potentially affect one of the following parts:

*If the box was checked, please highlight the changes*

- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] Anything that affects deployment

### Documentation

<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->

- [ ] `doc` <!-- Your PR contains doc changes. Please attach the local preview screenshots (run `sh start.sh` at `pulsar/site2/website`) to your PR description, or else your PR might not get merged. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [X] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

### Matching PR in forked repository

PR in forked repository: pgier#2

(cherry picked from commit c03e33e)
@michaeljmarshall michaeljmarshall added cherry-picked/branch-2.10 release/2.10.3 type/bug The PR fixed a bug or issue reported a bug labels Nov 19, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Nov 23, 2022
…ion (apache#18451)

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total per subscription metrics by adding the values from the currently connected consumers. This produces incorrect values as soon as one or more of the consumers disconnects from the subscription.

This changes these two metrics to directly use the subscription stats for these values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes apache#15819

### Motivation

The prometheus metrics for pulsar_out_bytes_total and pulsar_out_messages_total should never decrease,
and they should match the output seen when using pulsar-admin.

### Modifications

Changed the calculation of pulsar_out_bytes_total and pulsar_out_messages_total to directly use the
subscription stats instead of calculating these values by summing the values of the currently connected
consumers.

### Verifying this change

- [X] Make sure that the change passes the CI checks.

Added a unit test to cover this case.

### Does this pull request potentially affect one of the following parts:

*If the box was checked, please highlight the changes*

- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] Anything that affects deployment

### Documentation

<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->

- [ ] `doc` <!-- Your PR contains doc changes. Please attach the local preview screenshots (run `sh start.sh` at `pulsar/site2/website`) to your PR description, or else your PR might not get merged. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [X] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

### Matching PR in forked repository

PR in forked repository: pgier#2

(cherry picked from commit c03e33e)
(cherry picked from commit 54dccf9)
lifepuzzlefun pushed a commit to lifepuzzlefun/pulsar that referenced this pull request Dec 9, 2022
…ion (apache#18451)

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes apache#15819

The existing code calculates the pulsar_out_bytes_total and pulsar_out_messages_total per subscription metrics by adding the values from the currently connected consumers. This produces incorrect values as soon as one or more of the consumers disconnects from the subscription.

This changes these two metrics to directly use the subscription stats for these values, and match the output of `pulsar-admin topic stats`.

Signed-off-by: Paul Gier <paul.gier@datastax.com>

Fixes apache#15819

### Motivation

The prometheus metrics for pulsar_out_bytes_total and pulsar_out_messages_total should never decrease,
and they should match the output seen when using pulsar-admin.

### Modifications

Changed the calculation of pulsar_out_bytes_total and pulsar_out_messages_total to directly use the
subscription stats instead of calculating these values by summing the values of the currently connected
consumers.

### Verifying this change

- [X] Make sure that the change passes the CI checks.

Added a unit test to cover this case.

### Does this pull request potentially affect one of the following parts:

*If the box was checked, please highlight the changes*

- [ ] Dependencies (add or upgrade a dependency)
- [ ] The public API
- [ ] The schema
- [ ] The default values of configurations
- [ ] The threading model
- [ ] The binary protocol
- [ ] The REST endpoints
- [ ] The admin CLI options
- [ ] Anything that affects deployment

### Documentation

<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->

- [ ] `doc` <!-- Your PR contains doc changes. Please attach the local preview screenshots (run `sh start.sh` at `pulsar/site2/website`) to your PR description, or else your PR might not get merged. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [X] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

### Matching PR in forked repository

PR in forked repository: pgier#2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pulsar_out_bytes_total and pulsar_out_messages_total metrics act as gauges instead of counters
5 participants