Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ActivitySource to start producer and consumer activities (spans) to support OpenTelemetry observability #2460

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sgryphon
Copy link

OpenTelemetry is a cross-platform open-source standard for distributed tracing, which allows you to collect and analyze data about the performance of your systems.

OpenTelemetry is now the default for new .NET applications: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/observability-with-otel

These changes add a Hangfire ActivitySource to start a Producer activity (span) when a job (either background or recurring) is created, and then a Consumer activity when it is processed.

It addresses part of the integration to Aspire, requested in #2408

It also provides a built in core solution, compared to using filters discussed in #2017

The creation context information is persisted with the job, so that the distributed activities can be correlated, even if processed on a different server.

No library user involvement or changes are needed except to register a listener for the ActivitySource, such as via OpenTelemetry. The Producer activity will pick up any existing context (such as from an ASP.NET request), or create a new one as needed.

The NetCore example has been updated to show how TraceId is correlated between job creation and execution. The PR also includes unit tests and a brief mention in the Readme.

Note: Some of the work has been based on the MassTransit implementation of ActivitySource.

@sgryphon
Copy link
Author

sgryphon commented Nov 5, 2024

@odinserj should I target this at the Main or Dev branch? Thanks.

@sgryphon sgryphon force-pushed the feature/add-activity-source branch 2 times, most recently from f9f305b to db9e404 Compare November 15, 2024 00:16
@odinserj
Copy link
Member

Hi @sgryphon! I like the idea, but the number of changes is really high. How does it compare with the following gist that's implemented as a job filter? https://gist.github.com/odinserj/06e18950b5bf3083a5aed0ed06d3d18a

@sgryphon
Copy link
Author

How does it compare with the following gist that's implemented as a job filter?

Thanks for reviewing.

I am currently use a similar (but limited) filter, as discussed in #2017

I do think that activity tracing, like metrics, is better as part of the core solution, turned on by default / with a standard name. I based some of that on MassTransit, where you just subscribe to a known event source; the .NET elements (e.g. HttpClient) are similar in that you can just easily enable them.

Decoupling from the flow is however a good idea, so happy with a different approach. My main thoughts are:

  1. Can the filter be a default system filter, i.e. turn on by default (with a fixed Activity Source name).
  2. Does it support both background jobs and recurring jobs.

Apart from that, some specific feedback on the gist:

  • OpenTelemetry has some specific conventions around span naming, e.g. operation.name is usually a single token (i.e. 'perform_job' instead of 'perform job'). See https://opentelemetry.io/docs/specs/semconv/messaging/messaging-spans/#span-name
  • While you can add custom tags, like 'job.state' or 'job.storage', there are semantic conventions for many common tags, e.g. 'messaging.destination.name' should match the second part of the span name, 'message.operation.type' should be from a specific list, etc.
  • I don't think you need to store created at, because spans are timestamped anyway.
  • Not sure why OnCreating has two StartActivity; the Consumer span should be a child of the Producer span, i.e. only one span needed (that is passed through as the parent to the second)
  • Should pass through both activity.Id and activity.TraceStateString (even though we aren't using trace state, an upstream trace may).
  • For consistency (e.g. with HTTP) the job parameter names should be "traceparent" and "tracestate" (HTTP is not case sensitive, so often as "TraceParent")
  • Not sure if isRemote needs to be set (if it is, maybe check it is remote??)
  • Just double check that activity creates correctly is parent is null/empty (e.g. from a client that doesn't set it). I think older versions of the library didn't handle empty/null well.
  • SetStatus(OK) ignores the description string (so not needed).
  • SetStatus(Error) is only in newer (6+) library; otherwise set attributes separately (in my code). Fine if you are using the newer library. There is also an AddException() in .NET 9.

If you can answer the first two items, I'd be happy to work on a revised solution using the filter.

@sgryphon
Copy link
Author

  1. Can the filter be a default system filter, i.e. turn on by default (with a fixed Activity Source name).
  2. Does it support both background jobs and recurring jobs.

I found GlobalJobFilters.cs, which I think answers my first question.

I will work on using the alternative approach. It is nicer, as it neatly decouples the work. I also like putting the activity in the context (doesn't work for job parameters, which may be serialized to another machine, but I presume the start and end events run in the same process).

@sgryphon sgryphon force-pushed the feature/add-activity-source branch from db9e404 to c7a1681 Compare November 17, 2024 03:13
@sgryphon
Copy link
Author

I've re-written the code as a global filter. I think it is much nicer to decouple like that -- and is shows the good architecture/structure of Hangfire that it is easy to use.

I used your filter code as a base, and then merged with my code (& applied my suggestions).

I added it as default to global filters, and then checked that the samples run okay (updated readme & samples for new location of the constant).

Some notes:

  • I didn't use the IApplyStateFilter. It seems largely separate to the job processing. (One of the main goals was to get a TraceID for the performed job).
  • Answering my question no. 2, it doesn't correlate recurring jobs. Although I was kind of on the fence about that anyway, i.e. each separate run of a recurring job gets its own TraceID (which meets the goal of having a TraceID when performing). In some respects this is better than the same TraceID for all recurring jobs.

The one thing I haven't updated yet is the unit tests, as the previous code was built into Worker, etc so I had tests for those. I'll have a look at the tests for other filters and see if I can add anything valuable.

@sgryphon
Copy link
Author

New unit tests, for the filter, now added.

Addresses tracing aspects of HangfireIO#2408 for integration with Aspire, as well as all other OpenTelemetery based diagnostics,  and addresses HangfireIO#2017.

Add a default filter to start producer activities (spans) when jobs created, and consumer
activities when jobs performed. Pass the creation context through as TraceParent and TraceState job parameters, so that distributed tracing works across job scheduling.

Note that activity supports is only from netstandard2.0 onwards, and only creates activities if there is a configured listener.
Update NetCoreSample with an OpenTelemetry based listener to enable
the Hangfire activity source.

Add initial activity creation along with logging of TraceId to show
job correlation in log output. Add background job examples, including
error examples.
@sgryphon sgryphon force-pushed the feature/add-activity-source branch from cbd3534 to 3925ba9 Compare November 23, 2024 02:04
@odinserj
Copy link
Member

Thanks @sgryphon, looks really good! One caveat here is related to a new dependency included in Hangfire.Core, and it's version is deprecated as per NuGet. Given .NET 9.0 became strict against transitive dependencies (as in #2468), I'm thinking on the following steps:

  1. Bring net6.0 platform to Hangfire.Core in the dev branch.
  2. Rebase this PR onto the new branch.
  3. Enable OpenTelemetry support only on net6.0 platform with the corresponding version of System.Diagnostics.DiagnosticSource that's not deprecated.

So on the modern platforms we'll have out-of-box support for telemetry without deprecation-related problems, for other platforms it will be still possible to use https://www.nuget.org/packages/OpenTelemetry.Instrumentation.Hangfire.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants