Dispatch the metrics via telemetry #503

slashmili · 2022-05-11T11:51:15Z

Hello!

I'm wondering if you are open to a contribution to introduce telemetry into this project?

There lots of important metric that I'd like to tap into especially when consumer connect and rebalance and would be a great addition to this library.

zmstone · 2022-05-11T14:02:59Z

Hi @slashmili

It'd have to be configurable though.
WDYT? @ieQu1 @qzhuyan @id
and @kjellwinblad @starbelly @vikas15bhardwaj inputs are appreciated.

slashmili · 2022-05-11T14:42:31Z

@zmstone configureable to disable it completely? Ok fair enough.

Do you prefer it to be done via Macros or via a condition that only runs during runtime? I'm asking since if Macro is they way to go, I need to do some study first.

zmstone · 2022-05-11T16:23:48Z

Hi @slashmili
Ideally the dependency should be optional for users who do not need beam-telemetry.
We can probably introduce a brod_telemetry module to confine the telemetry dependencies
and use a macro to control if it's a no-op or call telemetry instead.
e.g.

-ifndef(BROD_USE_TELEMETRY).
execute(_Name, _Measurements, _Metadata) -> ok.
-else.
execute(Name, Measurements, Metadata) -> telemetry:exeucte(Name, Measurements, Metadata).
-endif.

starbelly · 2022-05-13T13:30:34Z

Telemetry in brod would be a big win 👍 💯

spencerdcarlson · 2022-05-13T15:26:20Z

I'd love to help on this wherever I can. I think librdkafka's STATISTICS page could be helpful when trying to figure out exactly what metrics should be emitted.

spencerdcarlson · 2022-05-13T17:14:26Z

@zmstone regarding

Hi @slashmili Ideally the dependency should be optional for users who do not need beam-telemetry.

Does this actually need to be the case? I thought by nature telemetry uses dynamic dispatching to delegate the execution of the handler or "work" process to the the subscriber so the framework has an "optional" built in. If you subscribe to the events then you "opt in" and some overhead is added, if you do not subscribe, then there is no handler for the given event and there is no overhead added.

It looks like the only overhead added to brod would be an ets lookup to see if a handler for an event exists -- This seems like a very small cost to pay in favor of the added complexity of an optional dependency and macro.

emit event - If there is no handler for an event then lists:foreach(ApplyFun, Handlers). will not even execute since Handlers =:= []
handler lookup - returns []

I don't see telemetry as an optional dependency for any large packages like ecto_sql which can have an extremely high throughput.

zmstone · 2022-05-14T06:54:10Z

I’m thinking about it more from dependency hygiene perspective but not concerned about performance.

Given it’s essentially two apis to integrate, I assumed it’s not a big effort to implement.

Btw, the compression libs are also made optional (at runtime).

One org may use beam-telemetry, another may have a different.
Being able to “plug” different instrumentation code at compile time makes it easy to extend in the future, also less intrusive to the wrapping project’s dependency management

victorolinasc · 2022-05-24T12:41:50Z

Hi @zmstone ! I think the purpouse of the beam-telemetry is to be agnostic of the handling details. It is just a dispatch library that is maintained by the community and endorsed by the observability working group of the ErlEF https://github.com/erlef/observability-wg .

I think that it's goal is exactly to be this standard glueing layer of telemetry. It seems to be aligned with your goal of dependency hygiene (if no handlers are provided, it is almost a pass-through call with no side-effect) and to allow consumers to choose which instrumentation library. From the ones I know it seems they all integrate nicely with telemetry.

Compression is a different case as there is no standard endorsed by the ErlEF currently.

What do you think?

zmstone · 2022-05-24T18:04:57Z

fair enough,
Make sense to have the instrumentation code defined as BROD_ prefixed macros?

So a wrapping project still has the chance to change to other calls without going though beam-telemetry

slashmili · 2022-05-25T19:41:03Z

TBH I don't mind wrapping around a brod_telemetry module either way works for me. telemetry has 3 functions we are interested in :

-module(brod_telemetry).

-export([execute/2,
         execute/3,
         span/3]).

execute(EventName, Measurements, Metadata) -> telemetry:execute(EventName, Measurements, Metadata).
execute(EventName, Measurements) -> telemetry:execute(EventName, Measurements).
span(EventPrefix, StartMetadata, SpanFunction) -> telemetry:span(EventPrefix, StartMetadata, SpanFunction).

What I've been struggling to do are :

Where to add these telemetries. At @highmobility we are interested in brod_group_coordinator activities. Like rebalancing and the time it takes to join the group and so on. I suppose others have interest in other parts. I was thinking to start introducing telemetry in places I'm familiar with and then the rest could be added as separate PRs by others(and me ofc).

let me know if you think the first PR should cover telemetry in any other module.

The way to use telemetry:

We can use telemetry in two ways:
A. telemetrye:execute
B. telemetrye:span

While execute is easy and requires less change in the code. I found myself most of the time reaching for span. for example this block of code using span will be like :
FROM:

do_connect(Endpoint, State) ->
  ConnConfig = conn_config(State),
  kpro:connect(Endpoint, ConnConfig).

TO:

do_connect(Endpoint, #state{client_id = ClientId} = State) ->
  StartMetadata =  #{client_id => ClientId},
  brod_telemetry:span(
    [brod, client, connect],
    StartMetadata,
    fun() ->
      ConnConfig = conn_config(State),
      Result = kpro:connect(Endpoint, ConnConfig),
      Metadata = maps:merge(StartMetadata, connect_result_to_metadata(Result)),
      {Result, Metadata}
    end
  ).

connect_result_to_metadata({ok, _}) -> #{status => ok, reason => nil};
connect_result_to_metadata({error, Reason}) -> #{status => error, reason => Reason}.

It's easier to use span since it does keep track of time and also triggers: start, stop and exception if it happens.

do_connect/2 is really a small function, imagine if it's a more complex function. To keep the code simple I have to move a lot of codes from their main function to another function. e.g brod_group_coordinator:join_group/1 becomes like :

join_group(#state{ groupId                   = GroupId
                   , client                  = ClientId
                 } = State0) ->
  StartMetadata = #{client_id => ClientId, group_id => GroupId} ,
  brod_telemetry:span(
    [brod, group_coordinator, join_group],
    StartMetadata,
    fun () ->
        Result = do_join_group(State0),
        {Result, maps:merge(StartMetadata, join_group_to_metadata(Result))}
    end).

join_group_to_metadata({ok, #state{
                memberId     = MemberId
                , leaderId     = LeaderId
                , generationId = GenerationId
                , members      = Members
                }}) ->
  #{
    member_id => MemberId
    , leader_id => LeaderId
    , generation_id => GenerationId
    , members => Members}.

do_join_group....

The code becomes too much noisy that I know I'll hate my PR already 😭

What do you think? Is this kind of changes are inevitable ? or do you have an other suggestions?

spencerdcarlson · 2022-05-27T20:03:14Z

Just looking at ecto/db_connection suite as a reference — their approach seems to be to time the relevant parts and emit the timed event with execute. sql.ex#L1101-L1103

slashmili · 2022-05-28T03:48:33Z

Thanks for the pointer!

I think it was kinda easier choice for Ecto since it already has the data for logging.

I also looked at:
broadway which uses only span
Finch which uses execute to simulate span like events

Ok so it's not uncommon to prevent messing with the code run execute with start, stop manually!

I'll take that approach then

btkostner · 2022-06-13T19:19:56Z

This would be a very welcomed feature and would also help integrate open telemetry tracing similar to how they do the oban integration.

spencerdcarlson · 2022-06-19T04:27:12Z

@slashmili let me know what you think of #512. Just starting to take a very light stab at adding telemetry.

tsloughter · 2022-06-23T20:20:32Z

On the topic of telemetry (the general term, not the library) there is also a need for propagating distributed trace context in kafka messages and instrumentation with telemetry does not help with this.

So if adding some abstraction that maybe calls telemetry depending on user configuration, it'd be great if there is also a way for the user to set a callback or middleware for updating messages being produced so they can add the context as kafka headers.

If a non-telemetry based way to instrument was provided then we'd probably also create an OpenTelemetry instrumentation instead of doing brod->telemetry->opentelemetry. But having to use a different module for calls to make them instrumented is annoying since it means you can't instrument libraries you are using without forking and changing them.

The way OpenTelemetry tries to make integration zero-cost for a user who doesn't use it is the separation of the API and SDK. Including only opentelemetry_api as a library, like brod or a new app opentelemetry_brod would do, results in calls to start spans or record metrics are no-op's. If the user includes the SDK, opentelemetry, then it hooks itself up so those calls actually do something.

And just in case, the Otel semantic conventions for Kafka may be of use https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/instrumentation/kafka.md

zmstone · 2022-07-08T20:19:42Z

Thanks you all for the ideas.
I have pushed a branch dev/telemetry, let's start instrumenting the code in this branch,
when everybody is happy with the instrumentations, we can merge it to master and bump a minor.

@tsloughter
to make the instrumentation nearly no-op for people who do not need it,
my original naive idea was to use a compile flag to be switched on by the wrapping project.

I only had a quick look at opentelemetry_api, do I understand it correctly that: the overhead
for someone who does not need it is a few reads from persistent_term?

zmstone · 2022-07-08T20:24:13Z

Hi again @tsloughter
I am not quite sure what would opentelemetry_brod do, or why you brought it up,
is there any reason why someone should not include opentelemetry_api but create a opentelemetry_appname instead ?
There is probably a link somewhere I can read up? :D

tsloughter · 2022-07-11T18:56:39Z

@zmstone compile time flag can be nice for some, but being able to switch between no-op and active without recompiling is useful/easier for some.

And yes, the opentelemetry_api macros will lookup the tracer to use with persistent terms and get the no-op tracer in the cases that there is no SDK installed.

There isn't a reason to not include opentelemetry_api. It is just common to have an opentelemetry_appname for various reasons, like the author not wanting to tie themselves to a particular library or just to have it faster/easier to iterate on the instrumentation.

Graborg · 2024-03-06T09:07:09Z

Any updates on this? It would be very valuable.

dimitarvp · 2024-03-09T04:57:39Z

Just recently I picked kaffe which steps on brod and I was wondering if OTel is still considered or in progress?

zmstone added the help wanted label Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispatch the metrics via telemetry #503

Dispatch the metrics via telemetry #503

slashmili commented May 11, 2022

zmstone commented May 11, 2022

slashmili commented May 11, 2022

zmstone commented May 11, 2022

starbelly commented May 13, 2022

spencerdcarlson commented May 13, 2022 •

edited

Loading

spencerdcarlson commented May 13, 2022 •

edited

Loading

zmstone commented May 14, 2022

victorolinasc commented May 24, 2022

zmstone commented May 24, 2022

slashmili commented May 25, 2022

spencerdcarlson commented May 27, 2022 •

edited

Loading

slashmili commented May 28, 2022

btkostner commented Jun 13, 2022

spencerdcarlson commented Jun 19, 2022 •

edited

Loading

tsloughter commented Jun 23, 2022

zmstone commented Jul 8, 2022

zmstone commented Jul 8, 2022

tsloughter commented Jul 11, 2022

Graborg commented Mar 6, 2024

dimitarvp commented Mar 9, 2024

Dispatch the metrics via telemetry #503

Dispatch the metrics via telemetry #503

Comments

slashmili commented May 11, 2022

zmstone commented May 11, 2022

slashmili commented May 11, 2022

zmstone commented May 11, 2022

starbelly commented May 13, 2022

spencerdcarlson commented May 13, 2022 • edited Loading

spencerdcarlson commented May 13, 2022 • edited Loading

zmstone commented May 14, 2022

victorolinasc commented May 24, 2022

zmstone commented May 24, 2022

slashmili commented May 25, 2022

spencerdcarlson commented May 27, 2022 • edited Loading

slashmili commented May 28, 2022

btkostner commented Jun 13, 2022

spencerdcarlson commented Jun 19, 2022 • edited Loading

tsloughter commented Jun 23, 2022

zmstone commented Jul 8, 2022

zmstone commented Jul 8, 2022

tsloughter commented Jul 11, 2022

Graborg commented Mar 6, 2024

dimitarvp commented Mar 9, 2024

spencerdcarlson commented May 13, 2022 •

edited

Loading

spencerdcarlson commented May 13, 2022 •

edited

Loading

spencerdcarlson commented May 27, 2022 •

edited

Loading

spencerdcarlson commented Jun 19, 2022 •

edited

Loading