Skip to content

Conversation

@lahsivjar
Copy link
Contributor

@lahsivjar lahsivjar commented May 13, 2024

Hostmetrics remappers remap metrics produced by hostmetricsreceiver to Elastic system metrics. This allows powering Elastic's curated UIs based on OTel metrics.

The code from the remappers is adopted from the PoC done here.

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a high level review - looks good overall, main questions are whether we should be resilient to changes in the set of host metrics scraped, and whether this should be implementing the processor API directly rather than creating a new "remapper" abstraction.

// remapped metrics could be trivially converted into Elastic system metrics.
// The current remapping logic assumes that each Metric in the ScopeMetric
// will have datapoints for a single timestamp only.
func (r *Remapper) Remap(src pmetric.ScopeMetrics, out pmetric.MetricSlice) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for not implementing an opentelemetry-collector processor directly here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the full processor interface (ref)? -- I didn't do this because we didn't need to. The component.Component felt unnecessary complexity for our use-case.

The consumer interface (ref) was a better alternative but I decided against it because it would be too restrictive. My goal was to have an interface that can be used in a multiple ways and easily adapted to implement a consumer interface if needed - the reason why I separated the output pmetric.MetricSlice from the source's pmetric.ScopeMetrics. Let me know if it doesn't make sense, I can adopt the consumer interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What triggered me to ask is the assumption of using zap for logging. That is currently the case in opentelemetry-collector, but what if it changed? Then what if we wanted to add metrics (i.e. metrics measuring remapping)? So I was thinking maybe we should use the telemetry config struct and its logger.

It's not a big deal, and we can refactor as needed. I was wondering if there was something fundamentally preventing us - sounds like no.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal, and we can refactor as needed. I was wondering if there was something fundamentally preventing us - sounds like no.

Yeah, no blockers in doing that.

@lahsivjar
Copy link
Contributor Author

Putting this PR on draft after talking with @ishleenk17. Once #4 is merged I will rebase the PR and update the code.

@lahsivjar lahsivjar marked this pull request as draft May 14, 2024 11:47
@lahsivjar lahsivjar marked this pull request as ready for review May 14, 2024 14:16
@lahsivjar
Copy link
Contributor Author

Apologies for the force push, I have updated the PR after the original code for the remappers was merged. It is ready for review now.

@lahsivjar lahsivjar closed this May 14, 2024
@lahsivjar lahsivjar reopened this May 14, 2024
@lahsivjar lahsivjar requested a review from axw May 15, 2024 10:24
Copy link
Member

@ishleenk17 ishleenk17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Vishal!
I've left my initial comments on the PR. I'll be doing a more thorough review soon.

Also, let's change the title of the PR to something more appropriate.

if !r.Valid(src) {
return
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metricset.period calculation from the migrated code has been left out here. Was that intentional ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not intentional actually. Can you elaborate on how the metricset.period is used?

Copy link
Contributor Author

@lahsivjar lahsivjar May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I perused the actual usage here and I think this piece of code would better live outside the remappers (in the processor or in APM/MIS if required there) so that the interfaces don't get bloated. Let me know what you think.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added metricset.period because it's used in some of the UIs as a scale factor. i agree it's kinda clunky here - ideally you'd be able to read it from the receiver config (the scrape period) directly, but there's no way i could find to it.

we could add it in a downstream component, but i figured as it was a direct requirement of some of the elastic host UIs, it was ok to stick it here for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think the issue with doing it in library (or even in APM/MIS) would be that we would need to maintain a cache for all the hosts we see and calculate the metricset.period for each of it. @axw WDYT? Do you see a better way here?

but i figured as it was a direct requirement of some of the elastic host UIs, it was ok to stick it here for now

Do you have a list of UIs that break due to this? I checked the main inventory UI and that seems to work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the poc implementation we just stored a single value for the whole processor. this worked fine - it assumes that we only have one pmetrics struct per scrape (across all the datasets), which is currently true. could you think of a configuration which would break it?

@tommyers-elastic But in APM/MIS we would get data from more than one collector instances, so we would need to calculate the metricset.period for all such instances even if it is 1 value per instance. Did I answer your question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lahsivjar : Lets open a ticket to discuss about metricset.period, as this should not block the PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are dealing with cumulative sums (in network and process remappers) :(

Right, that's a pain. One option, which is a little brittle but probably not worse than other options, would be to extract the value from any of the non-cumulative sum metrics. All of the host metrics are scraped on the same interval IIANM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@axw : I have created a separate issue for this discussion to unblock this PR.
Lets take it forward there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lahsivjar : Lets open a ticket to discuss about metricset.period, as this should not block the PR

The ticket for metricset.period.

@lahsivjar lahsivjar changed the title Add hostmetrics remappers Introduce an interface to translate hostmetrics and add tests May 15, 2024
Copy link

@tommyers-elastic tommyers-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good

if !r.Valid(src) {
return
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the data point's timestamp - start_timestamp to calculate metricset.period? For gauges and delta sums, the start timestamp should be the beginning of the collection interval. It doesn't look like we're dealing with cumulative sums in this code.

Copy link
Member

@ishleenk17 ishleenk17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Member

@shmsr shmsr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments.

m.SetName(metric.name)

var dp pmetric.NumberDataPoint
switch metric.dataType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't there be a default case to skip the metric if any other dataType comes other than MetricTypeGauge/ MetricTypeSum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code shows as diff here but it is same as what we have in main. IMO, the best approach would be to throw an error (or even panic here) since skipping would mean we have appended an empty metric to the slice with a name but didn't handle it -- this would only occur when there is a bug in the library and the tests don't catch it for both the cases. I will leave it as is for now and will open another PR to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will take it as part of next refactoring Subham.

@lahsivjar lahsivjar requested a review from shmsr May 17, 2024 09:00
@ishleenk17 ishleenk17 merged commit 161b0f7 into elastic:main May 17, 2024
@lahsivjar lahsivjar deleted the hostmetrics branch May 17, 2024 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants