Postpone reenqueuing the iteration job until after callbacks are done #345

adrianna-chang-shopify · 2021-02-25T15:07:38Z

This PR fixes a bug that occurs when a job gets interrupted and is reenqueued faster than the original job takes to shut down. This results in race conditions with the @run and results in errors due to invalid status transitions (and notably results in a run showing up as interrupted when it is actually running).

To solve this, we should delay reenqueuing the job until after all callbacks have completed, which can be done by calling #reenqueue_iteration_job in a prepended after_perform callback in TaskJob, and skipping the call that JobIteration performs in #iterate_with_enumerator.

Considerations

Arguably callback ordering and ensuring that a job shuts down and runs its callbacks successfully before the new one starts up is something that should be upstreamed to JobIteration. I experimented with a PR for this, but the issue is that a number of jobs relying on JobIteration in Core are also dependent on the point at which the job is reenqueued in order to resume batch processing jobs where an error has occurred. (These jobs are currently able to resume from the last cursor when something goes wrong, but changing the order results in the new job not being pushed back to the queue, and things simply failing).

I think this is worth reinvestigating upstream at a later point, but I think we should get this patch out in the meantime, given it's affecting a number of users at the moment.

Tophatting

Try without the changes and then with to see the difference.
Steps to reproduce:

bundle add sidekiq

+++ test/dummy/config/application.rb
     # Application configuration can go into files in config/initializers
     # -- all .rb files in that directory are automatically loaded after loading
     # the framework and any gems in your application.
+    config.active_job.queue_adapter = :sidekiq
   end
 end

+++ app/jobs/maintenance_tasks/task_job.rb
    def on_shutdown
       if @run.cancelling?
         @run.status = :cancelled
         @run.ended_at = Time.now
       else
+        sleep(3)
         @run.status = @run.pausing? ? :paused : :interrupted

+++ test/dummy/config/initializers/maintenance_tasks.rb
+JobIteration.max_job_runtime = 4.seconds

And then run Maintenance::UpdatePostsTask from the UI

app/jobs/maintenance_tasks/task_job.rb

etiennebarrie · 2021-03-01T17:11:16Z

app/jobs/maintenance_tasks/task_job.rb

+    unless private_method_defined?(:reenqueue_iteration_job)
+      raise 'JobIteration::Iteration#reenqueue_iteration_job must be defined'
+    end
+    def reenqueue_iteration_job(should_ignore: true)


Technically we're not ignoring it, more like postponing it but ok 😅

I thought about using "postpone" instead, but then figured it's not really postponed because we call it again explicitly, it is more just outright ignored the first time 😛

etiennebarrie · 2021-03-01T17:11:20Z

lib/maintenance_tasks.rb

@@ -9,6 +9,10 @@
 require 'pagy'
 require 'pagy/extras/bulma'

+# Force the TaskJob class to load so we can verify upstream compatibility with
+# the JobIteration gem
+require_relative '../app/jobs/maintenance_tasks/task_job'


Maybe instead of requiring the file we could to the check here instead? But that separates it from the patch itself so it's probably worse.

Yeah I think it might be less confusing to keep them altogether for now

@run

This PR fixes a bug that occurs when a job gets interrupted and is reenqueued faster than the original job takes to shut down. This results in race conditions with the @run and results in errors due to invalid status transitions (and notably results in a run showing up as interrupted when it is actually running). To solve this, we should delay reenqueuing the job until after all callbacks have completed, which can be done by calling #reenqueue_iteration_job in a prepended after_perform callback in TaskJob, and skipping the call that JobIteration performs in #iterate_with_enumerator.

…ueue_iteration_job is defined upstream

adrianna-chang-shopify requested a review from etiennebarrie February 25, 2021 15:07

etiennebarrie reviewed Feb 25, 2021

View reviewed changes

app/jobs/maintenance_tasks/task_job.rb Outdated Show resolved Hide resolved

app/jobs/maintenance_tasks/task_job.rb Outdated Show resolved Hide resolved

adrianna-chang-shopify force-pushed the fix-reenqueue-before-shutdown branch from 44020eb to a2500ac Compare February 26, 2021 21:35

adrianna-chang-shopify requested a review from etiennebarrie March 1, 2021 14:18

etiennebarrie approved these changes Mar 1, 2021

View reviewed changes

adrianna-chang-shopify added 2 commits March 1, 2021 13:19

Ensure we are compatible with JobIteration::Iteration and that #reenq…

b91efac

…ueue_iteration_job is defined upstream

adrianna-chang-shopify force-pushed the fix-reenqueue-before-shutdown branch from 62397b3 to b91efac Compare March 1, 2021 18:20

adrianna-chang-shopify merged commit fe52716 into main Mar 1, 2021

adrianna-chang-shopify deleted the fix-reenqueue-before-shutdown branch March 1, 2021 20:10

adrianna-chang-shopify mentioned this pull request Mar 1, 2021

Race condition when a task is interrupted leads to exceptions #335

Closed

adrianna-chang-shopify temporarily deployed to rubygems March 1, 2021 21:40 Inactive

adrianna-chang-shopify mentioned this pull request Mar 2, 2021

Patch race condition in callbacks in JobIteration #358

Closed

adrianna-chang-shopify mentioned this pull request Mar 26, 2021

Sometimes tasks randomly fail with status validation error (Shipify) #378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postpone reenqueuing the iteration job until after callbacks are done #345

Postpone reenqueuing the iteration job until after callbacks are done #345

adrianna-chang-shopify commented Feb 25, 2021 •

edited

Loading

etiennebarrie Mar 1, 2021

adrianna-chang-shopify Mar 1, 2021

etiennebarrie Mar 1, 2021

adrianna-chang-shopify Mar 1, 2021

Postpone reenqueuing the iteration job until after callbacks are done #345

Postpone reenqueuing the iteration job until after callbacks are done #345

Conversation

adrianna-chang-shopify commented Feb 25, 2021 • edited Loading

Considerations

Tophatting

etiennebarrie Mar 1, 2021

Choose a reason for hiding this comment

adrianna-chang-shopify Mar 1, 2021

Choose a reason for hiding this comment

etiennebarrie Mar 1, 2021

Choose a reason for hiding this comment

adrianna-chang-shopify Mar 1, 2021

Choose a reason for hiding this comment

adrianna-chang-shopify commented Feb 25, 2021 •

edited

Loading