Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document reliability guarantees #59

Closed
dv opened this issue Jul 25, 2020 · 8 comments
Closed

Document reliability guarantees #59

dv opened this issue Jul 25, 2020 · 8 comments

Comments

@dv
Copy link

dv commented Jul 25, 2020

What reliability specs is good_job built for? Is there a chance of jobs getting "lost" in certain (edge) cases such as workers being killed by the host, or network connections dieing, or is there a guarantee of at-least-once or exactly-once?

Your README right now mentions "run-once safety" but I'm not sure if you mean "exactly once" or "at least once", or if that's not a guarantee at all, since it's not explicit.

Also, it'd be great to know what behaviour to expect when a worker dies. I.e. does the in-progress job become available immediately for another worker to pick up, after the worker goes away, is there a timeout involved, does it require a cleanup process, etc...

In general, a paragraph in your documentation on the reliability of the jobs would be great. As it stands, I'm a bit hesitant to use it in production without having a clear idea of what behaviour to expect in failure cases.

@bensheldon
Copy link
Owner

Thanks for flagging this. That's important to document better. I'll write out the way it currently works here, maybe we can have some back and forth, and then I'll capture that in the readme. And probably definitely I'll discover some things to be improved.

GoodJob currently

An ActiveJob job enqueued with GoodJob can have 3 states:

  • Enqueued/Unfinished. The job will be queried and picked up by the next available execution thread or after the polling interval if no execution threads are awake.
  • Locked/Executing/Being performed. This is when the job is being executed. Jobs are locked with a session-level Advisory Lock and should (theoretically as far as I know) prevent the job from being executed more than once concurrently by multiple execution threads. Once the job has finished successfully executing, the job record is updated with a finished_at or is deleted. It is then unlocked.
  • Finished/Discarded/Deleted. This is when the job has "successfully" finished executing

The recommendation in the readme is to add this to the base ApplicationJob:

retry_on StandardError, wait: :exponentially_longer, attempts: Float::INFINITY`

This has the effect of rescuing any StandardErrors and results in ActiveJob enqueuing a new GoodJob job; to GoodJob, a retry_on (or a discard_on) appears as if the initial job ran successfully. That core GoodJob::Job#perform wrapper lives here:

begin
result = ActiveJob::Base.execute(params)
rescue StandardError => e
error = e
end
end
if error.nil? && result.is_a?(Exception)
error = result
result = nil
end
error_message = "#{error.class}: #{error.message}" if error
self.error = error_message
self.finished_at = Time.current
if destroy_after
destroy!
else
save!
end

When errors do bubble up to the GoodJob adapter:

  • When a ~StandardError bubbles all the way up to the GoodJob adapter, the job will be discarded (marked as finished, or destroyed). (I am committed to changing this. See notes below)
  • When an ~Exception bubbles all the way up to the GoodJob adapter, the job will not be marked as finished/deleted, the exception will bubble up through the Advisory Lock and unlock it, and the record will be left in a state where another execution thread can pick it up immediately.

ActiveJob is complicated

The core challenge here is that the interaction of ActiveJob and the backend can be ambiguous.

When the retry_on attempts: are exhausted (e.g. when retry_on StandardError, attempts: 5 is used, and the job fails a 6th time), ActiveJob will re-raise the error to the backend. This creates an unclear situation on the backend, that I've seen challenge other adapters too.

Where to go from here

Having written all of this, I think I've changed my mind about how GoodJob should work in the default state.

  • When a ~StandardError bubbles all the way up to the GoodJob adapter, the job will be discarded (marked as finished, or destroyed) the job will be preserved and re-run, unless GoodJob.standard_error_results_in_finished_job (maybe with more words-mithing 😅 ).

Your issue also triggered some more research and it seems that currently ActionMailer::MailDeliveryJob which is used when calling deliver_later on a Mailer) inherits from ActiveJob::Base which means that mailer errors may be incorrectly marked as finished/discarded. Yikes! I'll write a separate bug ticket for that.

https://github.com/rails/rails/blob/a9336a67b0b1f25784f5304b4477235352fc53c5/actionmailer/lib/action_mailer/delivery_job.rb#L10

What needs more details?

  • The behavior should change to be maximally safe by default
  • How much of GoodJob's documentation should explain Ruby exception handling practices?
  • How much of GoodJob's documentation needs to be explaining the ins-and-outs of ActiveJob?

@bensheldon
Copy link
Owner

@dv I've updated the README and made the behavior safer by default. Please let me know if it answers your questions. Thanks again!

@DannyBen
Copy link

I have a question. The README states "GoodJob guarantees at-least-once performance of jobs".

What if I want "once and once only"?

I have jobs that send messages, and I never want them to be performed more or less than once.

@bensheldon
Copy link
Owner

@DannyBen that's a very good question that unfortunately falls into the gap between the backend (like GoodJob, Sidekiq, Que, etc) and how you write your own job's perform logic. In other words, you point out that "job" is ambiguous in our documentation.

GoodJob guarantees that a completely performed job (e.g. no exception is raised in the middle of execution) will run once and only once. But can only guarantee that any particular step of the business logic within your job will be performed "at-least-once".

"Idempotency" is something all backends try to answer:

@DannyBen
Copy link

DannyBen commented Jul 28, 2020

Thanks for replying.

You point out that "job" is ambiguous in our documentation.

I am not sure it is. At least to me, a job means all the code invoked by the Job's perform method.

GoodJob guarantees that a completely performed job ... will run once and only once

That is great to know. This perhaps is the part that can be added to the README. I already assumed this was the case, but the text about "at least once" confused me.

As for other job "steps" that might run "at least once" - that is completely understandable, assuming I understand it correctly: If a job fails at "line 3" of its perform method, then lines 1 and 2 will be executed again (and again, until the break condition is met).

@bensheldon
Copy link
Owner

@DannyBen thank you for the feedback! I’ll integrate that into the Readme.

@dv
Copy link
Author

dv commented Aug 3, 2020

Thank you @bensheldon, you've answered my questions and more. I'll let you close this issue in case you want to continue this discussion with the other participants.

@bensheldon
Copy link
Owner

Thank you everyone for the feedback and discussion! 🙌 This has been very helpful for improving GoodJob's documentation and my own understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants