Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postpone reenqueuing the iteration job until after callbacks are done #345

Merged
merged 2 commits into from
Mar 1, 2021

Conversation

adrianna-chang-shopify
Copy link
Contributor

@adrianna-chang-shopify adrianna-chang-shopify commented Feb 25, 2021

For: #335

This PR fixes a bug that occurs when a job gets interrupted and is reenqueued faster than the original job takes to shut down. This results in race conditions with the @run and results in errors due to invalid status transitions (and notably results in a run showing up as interrupted when it is actually running).

To solve this, we should delay reenqueuing the job until after all callbacks have completed, which can be done by calling #reenqueue_iteration_job in a prepended after_perform callback in TaskJob, and skipping the call that JobIteration performs in #iterate_with_enumerator.

Considerations

Arguably callback ordering and ensuring that a job shuts down and runs its callbacks successfully before the new one starts up is something that should be upstreamed to JobIteration. I experimented with a PR for this, but the issue is that a number of jobs relying on JobIteration in Core are also dependent on the point at which the job is reenqueued in order to resume batch processing jobs where an error has occurred. (These jobs are currently able to resume from the last cursor when something goes wrong, but changing the order results in the new job not being pushed back to the queue, and things simply failing).

I think this is worth reinvestigating upstream at a later point, but I think we should get this patch out in the meantime, given it's affecting a number of users at the moment.

Tophatting

Try without the changes and then with to see the difference.
Steps to reproduce:

  • bundle add sidekiq
+++ test/dummy/config/application.rb
     # Application configuration can go into files in config/initializers
     # -- all .rb files in that directory are automatically loaded after loading
     # the framework and any gems in your application.
+    config.active_job.queue_adapter = :sidekiq
   end
 end
+++ app/jobs/maintenance_tasks/task_job.rb
    def on_shutdown
       if @run.cancelling?
         @run.status = :cancelled
         @run.ended_at = Time.now
       else
+        sleep(3)
         @run.status = @run.pausing? ? :paused : :interrupted
+++ test/dummy/config/initializers/maintenance_tasks.rb
+JobIteration.max_job_runtime = 4.seconds

And then run Maintenance::UpdatePostsTask from the UI

app/jobs/maintenance_tasks/task_job.rb Outdated Show resolved Hide resolved
app/jobs/maintenance_tasks/task_job.rb Outdated Show resolved Hide resolved
app/jobs/maintenance_tasks/task_job.rb Outdated Show resolved Hide resolved
unless private_method_defined?(:reenqueue_iteration_job)
raise 'JobIteration::Iteration#reenqueue_iteration_job must be defined'
end
def reenqueue_iteration_job(should_ignore: true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we're not ignoring it, more like postponing it but ok 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about using "postpone" instead, but then figured it's not really postponed because we call it again explicitly, it is more just outright ignored the first time 😛

@@ -9,6 +9,10 @@
require 'pagy'
require 'pagy/extras/bulma'

# Force the TaskJob class to load so we can verify upstream compatibility with
# the JobIteration gem
require_relative '../app/jobs/maintenance_tasks/task_job'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of requiring the file we could to the check here instead? But that separates it from the patch itself so it's probably worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it might be less confusing to keep them altogether for now

This PR fixes a bug that occurs when a job gets interrupted and is
reenqueued faster than the original job takes to shut down. This results
in race conditions with the @run and results in errors due to invalid
status transitions (and notably results in a run showing up as interrupted
when it is actually running).

To solve this, we should delay reenqueuing the job until after all callbacks
have completed, which can be done by calling #reenqueue_iteration_job
in a prepended after_perform callback in TaskJob, and skipping the call
that JobIteration performs in #iterate_with_enumerator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants