-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adding queue latency metrics for ActiveJob #414
feat: Adding queue latency metrics for ActiveJob #414
Conversation
@@ -81,6 +86,10 @@ def job_attributes(job) | |||
otel_attributes.compact | |||
end | |||
|
|||
def milliseconds_since(timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned about "polluting" the ActiveJob method namespace with this method, so I'm open to alternatives. It would also seem okay to me to inline this logic within the job_attributes
method 🤷♂️
retry_on StandardError, wait: 0, attempts: 2 | ||
|
||
def perform | ||
sleep(0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially tried setting wait
on this job, but then the retry attempt was not running in the tests. I think that makes sense as the retry would go back into the queue. I don't love throwing a sleep
in here, and I am open to suggestions. This seemed like the simplest way to simulate the passage of time, for the first_enqueued_at
tests, though.
'messaging.active_job.enqueued_duration_ms' => milliseconds_since(job.enqueued_at), | ||
'messaging.active_job.first_enqueued_at' => job.first_enqueued_at, | ||
'messaging.active_job.enqueued_and_retry_duration_ms' => milliseconds_since(job.first_enqueued_at), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no strong opinion on the naming of enqueued_duration_ms
or enqueued_and_retry_duration_ms
. My other thought was enqueued_wait_ms
and enqueued_and_retry_wait_ms
. I chose duration
because OTel already captures a duration_ms
metric for spans, so I opted for consistency with the naming here.
@@ -72,6 +73,10 @@ def job_attributes(job) | |||
'messaging.destination' => job.queue_name, | |||
'messaging.message_id' => job.job_id, | |||
'messaging.active_job.provider_job_id' => job.provider_job_id, | |||
'messaging.active_job.enqueued_at' => job.enqueued_at, | |||
'messaging.active_job.enqueued_duration_ms' => milliseconds_since(job.enqueued_at), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time that the job spent enqueued before this execution.
'messaging.active_job.enqueued_at' => job.enqueued_at, | ||
'messaging.active_job.enqueued_duration_ms' => milliseconds_since(job.enqueued_at), | ||
'messaging.active_job.first_enqueued_at' => job.first_enqueued_at, | ||
'messaging.active_job.enqueued_and_retry_duration_ms' => milliseconds_since(job.first_enqueued_at), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time since the job was first enqueued. This captures:
- Time spent enqueued
- Time spent waiting to be retried
- Duration of prior executions
This should not capture:
- Time elapsed because the Job was enqueued with a
wait
orwait_until
parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdlubrano Thank you for taking the time to contribute to OTel Ruby instrumentation.
We are not currently accepting instrumentation changes that adds metric data as span attributes. Normally I would recommend that we capture these as OTel Metrics but our implementation is incomplete at this time.
Some options that I am able to recommend:
- Represent latency metrics as spans or span events. It will be symmetric with how we represent this data with other instrumentations, e.g. Sidekiq1. The challenge here I think is that there are little known semantics around representing intermediaries, e.g. retries or time spent in the queue https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md
- Enrich or create custom spans: Trickier to do but in your
ApplicationJob
code perhaps you can grab the current span and add these attributes. The challenge I see here is that your custom implementation may not be compatible with new features your vendor may build around Intermediary message semconv and instrumentation.
Footnotes
Thanks for the suggestions @arielvalentin ! Closing this to pursue the options that you recommended within my own application 👍 |
Thank you for understanding. I really hope this does not do discourage you from submitting future contributions! |
Firstly, thank you, both, for this discussion. I'm once again on my semi-regular journey to figure out a semantic and consistent way to extract latency data from traces for background jobs/messaging. 😆 Regarding the first alternative proposed above: latency by way of Span Events. That is currently how Sidekiq works, but it still might not be enough to be usable for SLOs, triggers, etc… depending on your OTel vendor. I haven't checked in a while, but to my knowledge, Honeycomb, for example, cannot extract those event timestamps into something you can use to calculate latency (e.g., a derived column). Given that, I'm now considering the second option you proposed, @arielvalentin, wherein we enrich the current span (in Sidekiq-land, via a Server Middleware) to add the attributes ourselves. I agree that a downside of this is drift between our custom attributes/code and potential features our vendor (i.e., Honeycomb) might build around the The other downside, of course, is that we're all left to build this bespoke custom attributes machinery ourselves. Which might be preferable to an |
I just fiddled with using OTel Collector Contrib (v0.83.0) to hoist times from these span events onto the parent span with the transform processor. Maybe more experimentation can happen there to see what fits and to inform suggestions for semantic conventions around jobs and/or messaging. It went something like this: A Rails app with these bits inside:class HomeController < ApplicationController
def home
HardJob.perform_in(2.seconds) if params[:run_job] == 'true'
…
end
…
end
class HardJob
include Sidekiq::Job
def perform(*args); sleep(0.03) ; end
end Run ApacheBench to abuse it:❯ ab -n 10000 -c 20 "http://localhost:3000/?run_job=true" OTel Collector Contrib running with …# collector config snippet
processors:
…
transform:
error_mode: ignore
trace_statements:
- context: spanevent
statements:
- set(span.attributes["job.created_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "created_at"
- set(span.attributes["job.enqueued_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"
- set(span.attributes["job.latency_ms"], ((span.start_time_unix_nano - time_unix_nano) / 1000000)) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at" 👈🏻 (and if you'd like to remove those span events after getting what data you want off them)processors:
batch:
transform/joblatency:
error_mode: ignore
trace_statements:
- context: spanevent
statements:
- set(span.attributes["job.created_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "created_at"
- set(span.attributes["job.enqueued_at"], time_unix_nano) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"
- set(span.attributes["job.latency_ms"], ((span.start_time_unix_nano - time_unix_nano) / 1000000)) where IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"
filter/dropspanevents:
error_mode: ignore
traces:
spanevent:
- 'IsMatch(instrumentation_scope.name, "Sidekiq") and name == "created_at"'
- 'IsMatch(instrumentation_scope.name, "Sidekiq") and name == "enqueued_at"'
service:
pipelines:
traces:
receivers: […]
processors: [transform/joblatency, filter/dropspanevents, batch]
exporters: […] 🎉 |
Hello 👋 . I'm looking for feedback on whether OpenTelemetry would be willing to support some metrics for ActiveJobs that I have found useful throughout my career. If the maintainers have no interest in supporting these or think they are generally a bad idea, then no hard feelings☺️ .
I think the best way that I can explain these metrics is through a scenario. So, suppose a User needs to reset their password. Typically they click a link or button or whatever and eventually a job gets enqueued to send the User an email to reset their password. Now, we don't want our User waiting around very long to receive that email because that's an awfully frustrating user experience. Typically my teams have come up with SLOs like, "We should send a User a reset password email within 5 minutes of them requesting one."
Sometimes things can go wrong when we try to send an email. For example:
(Note: It can also take an email service provider an unexpectedly long time to send an email even after our application submits a request, but I don't know of a reasonable way to measure that...I consider that to be well outside of the scope of this PR 😀.)
So, essentially, when the User requests a password reset, how long does it take for our system to initiate that email? There's a (hopefully small) amount of time for the User's request to reach our application server, there's a (hopefully negligible) amount of time for the background job to get enqueued, there's the time the job spends enqueued, the time the job spends waiting to be retried (if necessary), and the time it takes to run the job. The metrics introduced here allow us to measure the last three legs of that journey.
For my own use, I'm trying to use the attributes introduced in this PR to enable SLOs around
enqueued_and_retry_duration_ms + duration_ms
for my ActiveJobs. I'm curious how much or how little of that would be of interest to the OTel community.Thank you in advance for any feedback that you have!
Shortcomings
Time.now
, which is not ensured to be monotonic and not ensured to be consistent between web and worker processes.Example Trace Attributes from Honeycomb
(Please note that the
messaging.active_job.enqueue_to_performed_duration_ms
is a derived column and is not included in this PR. It can be if that's of interest!)