Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Tomcat metric definitions to aggregate multiple MBeans. #1366

Merged
merged 5 commits into from
Sep 16, 2024

Conversation

jefchien
Copy link
Contributor

@jefchien jefchien commented Jul 8, 2024

Description: Bug fix - The current Tomcat metric definitions use the otel.mbean constructor function, which expects the query to the MBean server to return a single MBean. If multiple mbeans are returned, it takes the first one and drops the rest, which results in inaccurate/incomplete metric values. The tomcat metrics are all defined with wildcards (*), which are meant to match multiple MBeans.

Changed the Tomcat metric definitions to use otel.mbeans instead, so metric values aren't being dropped. Added an aggregation for doubleValueCallback/longValueCallback since both of those observable instruments will only accept the first value per collection interval.

Existing Issue(s): #1360

Testing: Added a unit test in the InstrumenterHelperTest to account validate the aggregation.

image image

Documentation: N/A

Outstanding items: Some activemq and jetty metric definitions need to be adjusted to otel.mbeans and tested.

@breedx-splk
Copy link
Contributor

Do we really think that aggregating is the correct approach here? For counter instruments (like the sessions originally reported in #1360) maybe this makes sense, but not necessarily for all instruments, right?

Isn't it possible that two different MBeans registered with the same name could be returning separate metrics with different units? Units aren't really a first-class thing in JMX, so it requires some out-of-band knowledge I think, and aggregating across different units seems problematic.

@jefchien
Copy link
Contributor Author

jefchien commented Jul 11, 2024

Do we really think that aggregating is the correct approach here? For counter instruments (like the sessions originally reported in #1360) maybe this makes sense, but not necessarily for all instruments, right?

Yeah, this is currently only going to be done for gauge instruments (doubleValueCallback/longValueCallback). Will not impact the other types.

Isn't it possible that two different MBeans registered with the same name could be returning separate metrics with different units? Units aren't really a first-class thing in JMX, so it requires some out-of-band knowledge I think, and aggregating across different units seems problematic.

I think this comes down to the metric definition files. It's possible to have multiple MBeans grouped together that share the same attribute name that are unrelated and shouldn't be aggregated, but that would be a case of misconfiguration. Even without this change, the current behavior would still be wrong since it'd only report the value from one of the MBeans.

@breedx-splk
Copy link
Contributor

Hey @jefchien . We talked this over in today's SIG meeting and it sounds like we'd prefer a slightly different approach.

In the case where the user has deployed multiple apps to one tomcat instance (or however they get multiple contexts, each with its own MBean manager), rather than aggregating in code like in this PR, we'd be better served by doing multiple recordings with different attributes, and the context name should be an attribute.

This allows the metric to be correctly aggregated later, and maintains the per-context data points.

Make sense?

@jefchien
Copy link
Contributor Author

jefchien commented Jul 18, 2024

Wouldn't this result in different metrics altogether since that would mean tomcat.sessions without attributes would no longer exist? Doesn't that break current users of the JMX gatherer who collect the metric?

Does this mean that collecting multiple MBeans into a single aggregated metric is not supported? Should setting a metric definition with a wildcard without dimensions for that wildcard be disallowed? Like for example, the ActiveMQ disk metrics where the intention is to aggregate all brokers into a single metric. Should this aggregation always be handled at the collector?

def activemqMetricsNoDestination = otel.mbean(
"org.apache.activemq:type=Broker,brokerName=*"
)
otel.instrument(activemqMetricsNoDestination,
"activemq.connection.count",
"The total number of current connections.",
"connections",
"CurrentConnectionsCount",
otel.&longUpDownCounterCallback)
otel.instrument(activemqMetricsNoDestination,
"activemq.disk.store_usage",
"The percentage of configured disk used for persistent messages.",
"%",
"StorePercentUsage",
otel.&doubleValueCallback)
otel.instrument(activemqMetricsNoDestination,
"activemq.disk.temp_usage",
"The percentage of configured disk used for non-persistent messages.",
"%",
"TempPercentUsage",
otel.&doubleValueCallback)

(setting aside that the definition is incorrect and otel.mbean should be otel.mbeans)

@breedx-splk
Copy link
Contributor

(Sorry for hashing this out in your PR, should have done this originally in the issue)
I want to try and clarify a few points, before trying to answer your questions.

First, I agree that the functionality today is not ideal and users are getting surprising and misleading data when there are multiple MBeans that match (like in the tomcat example above). I believe that the problem is no limited to tomcat, and many other targets use wildcard beans, so the problem can exist there as well. I also think that we should work toward making this better.

Today, in the cases where a jmx ObjectName search expression uses a wildcard and, at runtime , happens to match more than one MBean, we currently have no good good way to resolve this. The code, as originally written, expects to be able to record measurements for all of them (at least when passed as a list of bean names, as you discovered), but this fails in the SDK because it amounts to multiple recordings, which is not allowed and a warning is issued.

So maybe there are other options, but I see 3 ways the code could resolve this:

  1. Continue doing the wrong thing. 😿
  2. Aggregate within the groovy code
  3. Report the measurement from each bean, in a meaningful way that allows them to be filtered or aggregated later.

#2 above is what this PR is proposing. It assumes that the user always wants the values aggregated when multiple MBeans match the object name query. I have two primary concerns with this:

  1. I have low confidence that this is what users always want. I have no idea if there are other multiple MBeans in other targets who report very different things, such that aggregation is problematic or confusing.
  2. I am concerned that there could be two MBeans that report the same attribute, but with different implied units. This is caused in no small part by the fact that jmx wasn't designed with units in mind. For example, imagine two MBeans that report disk size or memory use or something. Doesn't matter. If one MBean provides the attribute in bytes and the other provides the attribute in megabytes, this becomes problematic. These two should not be aggregated.

#3 above is what the SIG basically concluded would be a better approach. For each measurement from each bean, we would include a set of attributes that are different, thereby working around the "it's just using the first one" problem, but also maintaining per-bean granularity that could be externally aggregated if later desired. The simplest way of doing this, IMO, is to add an attribute called jmx.mbean.name whose values is the name of the mbean. We'd probably want to add semconv for this eventually.

Ok, now to address the questions above:

Wouldn't this result in different metrics altogether since that would mean tomcat.sessions without attributes would no longer exist?

Attributes are technically not part of metric identity, so the short answer is no. The attributes do create different "metric streams" within a single metric. The name/value of each attribute is a metric dimension. This is a decent resource that describes this.

Doesn't that break current users of the JMX gatherer who collect the metric?

It could. We do not yet have stability guarantees for this component, it has alpha in the version, and so we are allowed to make breaking changes. Of course, it's never great to introduce breaking changes, but keep in mind that it's definitely reporting the WRONG thing now already, so fixing that should be considered a net improvement.

Does this mean that collecting multiple MBeans into a single aggregated metric is not supported?

The way I read this today before this PR, the answer is no. I didn't see anything that would aggregate across mbeans.

Should setting a metric definition with a wildcard without dimensions for that wildcard be disallowed?

I don't think so. I think there's value in being able to use multiples...like in the tomcat example above, you could filter just the contexts you care about (eg. remove /docs or whatever) and you could still aggregate, either in a collector or a vendor/backend tool. Similarly, the activemq example you cited would allow the user to see per-broker disk metrics, and aggregate them later if it's safe and/or desirable to do so.

@breedx-splk
Copy link
Contributor

@jefchien I hate to be the solo reviewer and blocker on this, so I want to make a proposal to move this forward.

Additional background context: We may see this thing get rewritten without groovy (native java) in the coming months, so expect some additional instability.

To move this forward, I'd like to propose that this PR includes the following:

  1. adds a commandline flag called aggregateAcrossMBeans that defaults to false. If the user then opts into this, then the functionality in this PR should be active. Otherwise we keep the old behavior.
  2. document the new commandline flag and related behavior in the README for this module.

Thanks!

@jefchien
Copy link
Contributor Author

What you're saying makes a lot of sense and #3 does seem like a better approach, but I think the details for what that looks like needs to be fleshed out a bit more. I appreciate the proposal to unblock this and will add the command line argument to make this opt-in.

Copy link
Contributor

@breedx-splk breedx-splk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this @jefchien ! Appreciate it. Looks like you just need to reformat with ./gradlew spotlessApply and you should be good.

@trask trask added this to the v1.39.0 milestone Sep 16, 2024
@trask trask merged commit 41c8891 into open-telemetry:main Sep 16, 2024
14 checks passed
@jefchien jefchien deleted the fix-tomcat-metrics branch September 16, 2024 20:04
robsunday pushed a commit to robsunday/opentelemetry-java-contrib that referenced this pull request Sep 17, 2024
@SylvainJuge SylvainJuge mentioned this pull request Sep 17, 2024
2 tasks
SylvainJuge added a commit to SylvainJuge/opentelemetry-java-contrib that referenced this pull request Sep 17, 2024
commit b84a73e
Merge: 352202f 7df9862
Author: SylvainJuge <763082+SylvainJuge@users.noreply.github.com>
Date:   Tue Sep 17 11:13:11 2024 +0200

    Merge pull request #2 from SylvainJuge/jmx-scraper-it

    basic JMX client implementation + tests

commit 7df9862
Author: Sylvain Juge <763082+SylvainJuge@users.noreply.github.com>
Date:   Tue Sep 17 11:11:50 2024 +0200

    fix bad merge again

commit e4a8f83
Author: Sylvain Juge <763082+SylvainJuge@users.noreply.github.com>
Date:   Tue Sep 17 11:10:26 2024 +0200

    fix bad merge again

commit dad197b
Author: Sylvain Juge <763082+SylvainJuge@users.noreply.github.com>
Date:   Tue Sep 17 11:09:18 2024 +0200

    fix bad merge

commit f8a73d7
Author: Sylvain Juge <763082+SylvainJuge@users.noreply.github.com>
Date:   Tue Sep 17 11:08:42 2024 +0200

    spotless

commit eab01e9
Author: Sylvain Juge <763082+SylvainJuge@users.noreply.github.com>
Date:   Tue Sep 17 11:01:47 2024 +0200

    spotless

commit f6667dd
Merge: 5b3ef9d 352202f
Author: Sylvain Juge <763082+SylvainJuge@users.noreply.github.com>
Date:   Tue Sep 17 10:55:34 2024 +0200

    Merge branch 'jmx-scrapper' of github.com:SylvainJuge/opentelemetry-java-contrib into jmx-scraper-it

commit 5b3ef9d
Author: Sylvain Juge <763082+SylvainJuge@users.noreply.github.com>
Date:   Tue Sep 17 10:51:11 2024 +0200

    add TODOs

commit 352202f
Merge: 5a82aac 40a2591
Author: SylvainJuge <763082+SylvainJuge@users.noreply.github.com>
Date:   Tue Sep 17 10:50:50 2024 +0200

    Merge pull request #1 from robsunday/jmx-scrapper

    Initial commit of JmxScraper code.

commit 40a2591
Author: jack-berg <34418638+jack-berg@users.noreply.github.com>
Date:   Mon Sep 16 16:05:50 2024 -0500

    Add declarative config support for RuleBasedRoutingSampler (open-telemetry#1440)

commit 6d39e93
Author: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Date:   Mon Sep 16 10:40:16 2024 -0700

    Update dependency com.uber.nullaway:nullaway to v0.11.3 (open-telemetry#1457)

    Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

commit eb5dc6f
Author: Bruno Baptista <brunobat@gmail.com>
Date:   Mon Sep 16 18:26:53 2024 +0100

    Fix native mode error cause by static init of random (open-telemetry#862)

commit 5ffa348
Author: jack-berg <34418638+jack-berg@users.noreply.github.com>
Date:   Mon Sep 16 12:23:41 2024 -0500

    Add declarative config support for aws xray propagators (open-telemetry#1442)

commit 151b744
Author: SylvainJuge <763082+SylvainJuge@users.noreply.github.com>
Date:   Mon Sep 16 19:23:19 2024 +0200

    add span stacktrace config option (open-telemetry#1414)

    Co-authored-by: jackshirazi <jacks@fasterj.com>

commit 9ee5fb1
Author: Yury Bubnov <ybubnov@gmail.com>
Date:   Mon Sep 16 10:18:51 2024 -0700

    Issue-1034 Short XRay Trace (open-telemetry#1036)

commit 11a2e1a
Author: Jeffrey Chien <chienjef@amazon.com>
Date:   Mon Sep 16 13:18:10 2024 -0400

    Fix Tomcat metric definitions to aggregate multiple MBeans. (open-telemetry#1366)

commit 9fa44be
Author: Pranav Sharma <sharmapranav@google.com>
Date:   Mon Sep 16 11:22:32 2024 -0400

    Fix incorrect cloud.platform value for GCF (open-telemetry#1454)

commit 11f2111
Author: Peter Findeisen <pfindeis@cisco.com>
Date:   Mon Sep 16 08:21:46 2024 -0700

    Composite Samplers prototype (open-telemetry#1443)

    Co-authored-by: Otmar Ertl <otmar.ertl@gmail.com>

commit b3386de
Author: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Date:   Mon Sep 16 08:20:44 2024 -0700

    Update plugin com.squareup.wire to v5.1.0 (open-telemetry#1452)

    Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

commit 27b631d
Author: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Date:   Mon Sep 16 08:20:21 2024 -0700

    Update errorProneVersion to v2.32.0 (open-telemetry#1453)

    Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

commit 791df4b
Author: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Date:   Mon Sep 16 09:50:03 2024 -0500

    Update dependency io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom-alpha to v2.8.0-alpha (open-telemetry#1456)

    Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
    Co-authored-by: Jack Berg <jberg@newrelic.com>

commit 19357e6
Author: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Date:   Tue Sep 10 15:20:46 2024 -0700

    Update dependency com.gradle.enterprise:com.gradle.enterprise.gradle.plugin to v3.18.1 (open-telemetry#1450)

    Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

commit 460470b
Author: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Date:   Tue Sep 10 15:20:23 2024 -0700

    Update plugin com.gradle.develocity to v3.18.1 (open-telemetry#1451)

    Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

commit b45cdab
Author: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Date:   Tue Sep 10 12:32:56 2024 +0300

    Update dependency com.linecorp.armeria:armeria-bom to v1.30.1 (open-telemetry#1449)

    Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

commit d36106e
Author: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Date:   Tue Sep 10 09:11:26 2024 +0300

    Update micrometer packages to v1.13.4 (open-telemetry#1448)

    Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

commit fb25c79
Author: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Date:   Mon Sep 9 15:44:53 2024 +0300

    Update dependency gradle to v8.10.1 (open-telemetry#1447)

    Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

commit f847561
Author: robsunday <rniedziela@splunk.com>
Date:   Tue Sep 17 09:22:21 2024 +0200

    Code review changes:
    Argument parsing moved to the main class.
    Argument validation now throw exception.
    More tests added for config factory.
    Created unit tests for JmxScraper class.

commit 03788ff
Author: Sylvain Juge <763082+SylvainJuge@users.noreply.github.com>
Date:   Mon Sep 16 17:26:12 2024 +0200

    add TODO for testing server SSL support

commit a3fbeb5
Author: Sylvain Juge <763082+SylvainJuge@users.noreply.github.com>
Date:   Mon Sep 16 17:08:16 2024 +0200

    add TODO to enable server-side SSL

commit 1619595
Author: Sylvain Juge <763082+SylvainJuge@users.noreply.github.com>
Date:   Mon Sep 16 16:59:17 2024 +0200

    wip

commit 164679f
Author: robsunday <rniedziela@splunk.com>
Date:   Wed Sep 11 12:34:12 2024 +0200

    Cleanup of config.
    Refactoring of JmxScraperConfigFactory.
    Added first unit tests.

commit 165fcb8
Author: robsunday <rniedziela@splunk.com>
Date:   Tue Sep 10 12:36:29 2024 +0200

    Initial commit of JmxScraper code. Application compiles and runs as standalone and can read config file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants