Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers eventually stop working #204

Open
carsoncole opened this issue Apr 10, 2024 · 42 comments
Open

Workers eventually stop working #204

carsoncole opened this issue Apr 10, 2024 · 42 comments
Assignees

Comments

@carsoncole
Copy link

carsoncole commented Apr 10, 2024

I seem to have a persistent problem of jobs being worked but new jobs just get enqueued and workers stop pulling jobs to be worked. If I restart Solid Queue (rake solid_queue:start) then the jobs are immediately worked.

This happens with both recurring jobs as well as jobs that are "perform_later".

Starting the solid queue looks all good and there aren't any subsequent log items that indicate any issues.

cc@Mac-mini funstuff % rake solid_queue:start
[SolidQueue] Starting Dispatcher(pid=76984, hostname=Mac-mini.local, metadata={:polling_interval=>1, :batch_size=>500, :concurrency_maintenance_interval=>600, :recurring_schedule=>{:"Expire subs"=>{:schedule=>"every day at 00:01", :class_name=>"ExpireSubsJob", :arguments=>[]}, :send_notifications_subs_expired=>{:schedule=>"every day at 00:15", :class_name=>"SendSubsExpirationNotificationJob", :arguments=>[]}, :send_notifications_subs_expiring=>{:schedule=>"every day at 10:00am", :class_name=>"SendSubsExpiringIn7DaysEmailJob", :arguments=>[]}, :"Send new posts email"=>{:schedule=>"every weekday at 5:00, 8:00, 11:00, 13:00, 15:00", :class_name=>"SendNewPostsEmailJob", :arguments=>[]}, :"Identify posts"=>{:schedule=>"every hour at minute 5", :class_name=>"IdentifyPostsJob", :arguments=>[]}, :"Add posts to Profiles"=>{:schedule=>"every 10 minutes", :class_name=>"ProcessNewPostsToProfilesJob", :arguments=>[]}, :"Destroy incomplete Posts"=>{:schedule=>"every day at 7:00am", :class_name=>"CleanUpPostsJob", :arguments=>[]}}})
[SolidQueue] Starting Worker(pid=76985, hostname=Mac-mini.local, metadata={:polling_interval=>1, :queues=>"*", :thread_pool_size=>3})

The only setting in my development.rb:
config.active_job.queue_adapter = :solid_queue

Solid Queue: 0.3.0
Rails: 7.1.3.2

@rosa
Copy link
Member

rosa commented Apr 10, 2024

Hey @carsoncole, sorry about that! Could you share a bit more about your DB setup? Also, could you try setting this:

config.solid_queue.silence_polling = false

And then starting solid_queue:start again? That might give us more clues about what's going on.

@ekampp
Copy link

ekampp commented Apr 10, 2024

I was seeing the same problem. Disabling polling silence seems to trigger it to work properly.

It turns out that it only works until a ruby file is changed.

@carsoncole
Copy link
Author

carsoncole commented Apr 11, 2024

@rosa Overnight, jobs stopped executing again. I set silence_polling = false. I've included the logs at the moment a job was created and enqueued. This is all running in development on my local Mac.

 SolidQueue::ScheduledExecution Pluck (3.0ms)  SELECT "solid_queue_scheduled_executions"."job_id" FROM "solid_queue_scheduled_executions" WHERE "solid_queue_scheduled_executions"."scheduled_at" <= $1 ORDER BY "solid_queue_scheduled_executions"."scheduled_at" ASC, "solid_queue_scheduled_executions"."priority" ASC, "solid_queue_scheduled_executions"."job_id" ASC LIMIT $2 FOR UPDATE SKIP LOCKED  [["scheduled_at", "2024-04-11 15:19:59.498234"], ["LIMIT", 500]]

  TRANSACTION (0.1ms)  COMMIT
[ActiveJob]   TRANSACTION (0.2ms)  BEGIN
[ActiveJob]   SolidQueue::Job Create (2.5ms)  INSERT INTO "solid_queue_jobs" ("queue_name", "class_name", "arguments", "priority", "active_job_id", "scheduled_at", "finished_at", "concurrency_key", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) RETURNING "id"  [["queue_name", "default"], ["class_name", "ProcessNewJobsToProfilesJob"], ["arguments", "{\"job_class\":\"ProcessNewJobsToProfilesJob\",\"job_id\":\"5fc05678-dbf5-45e6-81d6-f925734589f5\",\"provider_job_id\":null,\"queue_name\":\"default\",\"priority\":null,\"arguments\":[],\"executions\":0,\"exception_executions\":{},\"locale\":\"en\",\"timezone\":\"Pacific Time (US \\u0026 Canada)\",\"enqueued_at\":\"2024-04-11T15:20:00.008251000Z\",\"scheduled_at\":\"2024-04-11T15:20:00.008202000Z\"}"], ["priority", 0], ["active_job_id", "5fc05678-dbf5-45e6-81d6-f925734589f5"], ["scheduled_at", "2024-04-11 15:20:00.008202"], ["finished_at", nil], ["concurrency_key", "[FILTERED]"], ["created_at", "2024-04-11 15:20:00.008732"], ["updated_at", "2024-04-11 15:20:00.008732"]]
[ActiveJob]   TRANSACTION (0.2ms)  SAVEPOINT active_record_1
[ActiveJob]   SolidQueue::Job Load (0.2ms)  SELECT "solid_queue_jobs".* FROM "solid_queue_jobs" WHERE "solid_queue_jobs"."id" = $1 LIMIT $2  [["id", 907], ["LIMIT", 1]]
[ActiveJob]   SolidQueue::ReadyExecution Create (0.7ms)  INSERT INTO "solid_queue_ready_executions" ("job_id", "queue_name", "priority", "created_at") VALUES ($1, $2, $3, $4) RETURNING "id"  [["job_id", 907], ["queue_name", "default"], ["priority", 0], ["created_at", "2024-04-11 15:20:00.017665"]]
[ActiveJob]   TRANSACTION (0.1ms)  RELEASE SAVEPOINT active_record_1
[ActiveJob]   Log Create (0.4ms)  INSERT INTO "logs" ("status", "log_class", "class_id", "message", "request", "response_code", "source_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) RETURNING "id"  [["status", 0], ["log_class", "ProcessNewJobsToProfilesJob"], ["class_id", nil], ["message", "Job #5fc0567 enqueued"], ["request", nil], ["response_code", nil], ["source_id", nil], ["created_at", "2024-04-11 15:20:00.020842"], ["updated_at", "2024-04-11 15:20:00.020842"]]
[ActiveJob]    app/jobs/application_job.rb:9:in `block in <class:ApplicationJob>'
[ActiveJob] Enqueued ProcessNewJobsToProfilesJob (Job ID: 5fc05678-dbf5-45e6-81d6-f925734589f5) to SolidQueue(default)
  SolidQueue::Job Load (0.1ms)  SELECT "solid_queue_jobs".* FROM "solid_queue_jobs" WHERE "solid_queue_jobs"."id" = $1 LIMIT $2  [["id", 907], ["LIMIT", 1]]
  SolidQueue::RecurringExecution Create (0.6ms)  INSERT INTO "solid_queue_recurring_executions" ("job_id", "task_key", "run_at", "created_at") VALUES ($1, $2, $3, $4) RETURNING "id"  [["job_id", 907], ["task_key", "[FILTERED]"], ["run_at", "2024-04-11 15:20:00"], ["created_at", "2024-04-11 15:20:00.023454"]]
  TRANSACTION (1.5ms)  COMMIT
  Rendered admin/logs/_log.html.erb (Duration: 0.3ms | Allocations: 73)
[ActionCable] Broadcasting to logs: "<turbo-stream action=\"prepend\" target=\"logs\"><template><tr id=\"log_2893\" class=\"border-b\">\n  <td class=\"px-2 py-2 w-16 text-center dark:text-slate-700\">\n    <div class='rounded font-bold text-xs px-2 py-1 bg-blue-200'>INFO</div>\n  </td>\n  <td class=\"px-2 py-2\">\n    ProcessNewJob...
  TRANSACTION (0.2ms)  BEGIN
  SolidQueue::ScheduledExecution Pluck (1.9ms)  SELECT "solid_queue_scheduled_executions"."job_id" FROM "solid_queue_scheduled_executions" WHERE "solid_queue_scheduled_executions"."scheduled_at" <= $1 ORDER BY "solid_queue_scheduled_executions"."scheduled_at" ASC, "solid_queue_scheduled_executions"."priority" ASC, "solid_queue_scheduled_executions"."job_id" ASC LIMIT $2 FOR UPDATE SKIP LOCKED  [["scheduled_at", "2024-04-11 15:20:00.507423"], ["LIMIT", 500]]
  TRANSACTION (0.1ms)  COMMIT
  TRANSACTION (0.3ms)  BEGIN
  SolidQueue::ScheduledExecution Pluck (3.2ms)  SELECT "solid_queue_scheduled_executions"."job_id" FROM "solid_queue_scheduled_executions" WHERE "solid_queue_scheduled_executions"."scheduled_at" <= $1 ORDER BY "solid_queue_scheduled_executions"."scheduled_at" ASC, "solid_queue_scheduled_executions"."priority" ASC, "solid_queue_scheduled_executions"."job_id" ASC LIMIT $2 FOR UPDATE SKIP LOCKED  [["scheduled_at", "2024-04-11 15:20:01.515209"], ["LIMIT", 500]]

In the Solid Queue console here is the output since I launched it, which I notice undefined method queue_adapter_name' for nil:NilClass`...

carsoncole@Carsons-Mac-mini ahoyjob % rake solid_queue:start
[SolidQueue] Starting Dispatcher(pid=90937, hostname=Carsons-Mac-mini.local, metadata={:polling_interval=>1, :batch_size=>500, :concurrency_maintenance_interval=>600, :recurring_schedule=>{:"Expire subscriptions"=>{:schedule=>"every day at 00:01", :class_name=>"ExpireSubscriptionsJob", :arguments=>[]}, :send_notifications_subscriptions_expired=>{:schedule=>"every day at 00:15", :class_name=>"SendSubscriptionExpirationNotificationJob", :arguments=>[]}, :send_notifications_subscriptions_expiring=>{:schedule=>"every day at 10:00am", :class_name=>"SendSubscriptionExpiringIn7DaysEmailJob", :arguments=>[]}, :"Send new jobs email"=>{:schedule=>"every weekday at 5:00, 8:00, 11:00, 13:00, 15:00", :class_name=>"SendNewJobsEmailJob", :arguments=>[]}, :"Identify jobs"=>{:schedule=>"every 30 minutes at minute 5", :class_name=>"IdentifyJobsJob", :arguments=>[]}, :"Add jobs to Profiles"=>{:schedule=>"every 10 minutes", :class_name=>"ProcessNewJobsToProfilesJob", :arguments=>[]}, :"Destroy incomplete Jobs"=>{:schedule=>"every day at 7:00am", :class_name=>"CleanUpJobsJob", :arguments=>[]}, :"Clean up incomplete Jobs"=>{:schedule=>"every 4 hours at minute 45", :class_name=>"CleanUpIncompleteJobs", :arguments=>[]}}})
[SolidQueue] Starting Worker(pid=90938, hostname=Carsons-Mac-mini.local, metadata={:polling_interval=>1, :queues=>"*", :thread_pool_size=>3})
[SolidQueue] undefined method `queue_adapter_name' for nil:NilClass
[SolidQueue] Skipped recurring task Add jobs to Profiles at 2024-04-11 08:20:00 UTC — already dispatched
[SolidQueue] Skipped recurring task Identify jobs at 2024-04-11 08:30:00 UTC — already dispatched
[SolidQueue] Skipped recurring task Identify jobs at 2024-04-11 09:30:00 UTC — already dispatched
[SolidQueue] Skipped recurring task Add jobs to Profiles at 2024-04-11 09:30:00 UTC — already dispatched
[SolidQueue] undefined method `queue_adapter_name' for nil:NilClass
[SolidQueue] undefined method `queue_adapter_name' for nil:NilClass
[SolidQueue] Skipped recurring task Send new jobs email at 2024-04-11 12:00:00 UTC  already dispatched
[SolidQueue] Skipped recurring task Destroy incomplete Jobs at 2024-04-11 14:00:00 UTC  already dispatched
[SolidQueue] Skipped recurring task Add jobs to Profiles at 2024-04-11 14:20:00 UTC  already dispatched
[SolidQueue] Skipped recurring task Identify jobs at 2024-04-11 14:30:00 UTC  already dispatched

Also, I'm seeing 8 records in the table solid_queue_ready_executions. Are these the jobs that are ready to be executed but aren't being executed?

@carsoncole
Copy link
Author

carsoncole commented Apr 11, 2024

Another recurring job just added. It's in the table solid_queue_recurring_executions and I see it as an entry in solid_queue_ready_executions, but in the log I only see the constant query SELECT "solid_queue_scheduled_executions" around the time the job is created.

@rosa
Copy link
Member

rosa commented Apr 11, 2024

Thanks for all this extra info! All the logs correspond to the dispatcher, it's as if the worker wasn't doing anything 🤔 The errors

[SolidQueue] undefined method `queue_adapter_name' for nil:NilClass

aren't related to jobs not being processed; they correspond to recurring tasks whose job class might not be defined. I need to improve this to show a better error in that case and don't try to dispatch the task, but you can ignore the errors for now.

Could you share a bit more about your DB setup?

@carsoncole
Copy link
Author

carsoncole commented Apr 11, 2024

database.yml

default: &default
  adapter: postgresql
  encoding: unicode
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>

development:
  <<: *default
  database: ahoyjob_development

This may be relevant, I'm using Postgresql for Action Cable

cable.yml

development:
  adapter: postgresql

test:
  adapter: test

production:
  adapter: postgresql

@carsoncole
Copy link
Author

Another datapoint is that I have a Job class in my app that represents jobs, ie. developer, bus driver etc.

@carsoncole
Copy link
Author

So through my test case below, I can trigger the undefined method `queue_adapter_name' for nil:NilClass with an incorrectly named class.

solid_queue.yml

  "Test Job":
    class: Testjob
    schedule: every minute

Job file was named testjob.rb and the class name was TestJob

I've fixed that now and running again to see if that has anything to do with not working jobs.

@carsoncole
Copy link
Author

It does appear that the SolidQueue::ReadyExecution just dies and does not give any indication of why. In my logs it is executing at its interval of 1 second, and then stops. Here in my logs you see it actually finds a record when doing

SolidQueue::ReadyExecution Pluck (2.5ms)  SELECT "solid_queue_ready_executions"."job_id" FROM "solid_queue_ready_executions" ORDER BY "solid_queue_ready_executions"."priority" ASC, "solid_queue_ready_executions"."job_id" ASC LIMIT $1 FOR UPDATE SKIP LOCKED  [["LIMIT", 3]]

But then stops. Here is the full log and after the last entry for SELECT solid_queue_ready_executions it no longer makes the call.

 SolidQueue::Job Load (0.2ms)  SELECT "solid_queue_jobs".* FROM "solid_queue_jobs" WHERE "solid_queue_jobs"."id" = $1 LIMIT $2  [["id", 930], ["LIMIT", 1]]
  SolidQueue::RecurringExecution Create (1.4ms)  INSERT INTO "solid_queue_recurring_executions" ("job_id", "task_key", "run_at", "created_at") VALUES ($1, $2, $3, $4) RETURNING "id"  [["job_id", 930], ["task_key", "[FILTERED]"], ["run_at", "2024-04-11 17:50:00"], ["created_at", "2024-04-11 17:50:00.141010"]]
  TRANSACTION (0.4ms)  COMMIT
  Rendered admin/logs/_log.html.erb (Duration: 0.6ms | Allocations: 576)
[ActionCable] Broadcasting to logs: "<turbo-stream action=\"prepend\" target=\"logs\"><template><tr id=\"log_2970\" class=\"border-b\">\n  <td class=\"px-2 py-2 w-16 text-center dark:text-slate-700\">\n    <div class='rounded font-bold text-xs px-2 py-1 bg-blue-200'>INFO</div>\n  </td>\n  <td class=\"px-2 py-2\">\n    ProcessNewJob...
  SolidQueue::Pause Pluck (0.9ms)  SELECT "solid_queue_pauses"."queue_name" FROM "solid_queue_pauses"
  TRANSACTION (0.2ms)  BEGIN
  SolidQueue::ReadyExecution Pluck (2.5ms)  SELECT "solid_queue_ready_executions"."job_id" FROM "solid_queue_ready_executions" ORDER BY "solid_queue_ready_executions"."priority" ASC, "solid_queue_ready_executions"."job_id" ASC LIMIT $1 FOR UPDATE SKIP LOCKED  [["LIMIT", 3]]
  SolidQueue::ClaimedExecution Insert (1.9ms)  INSERT INTO "solid_queue_claimed_executions" ("job_id","process_id","created_at") VALUES (930, 148, CURRENT_TIMESTAMP) RETURNING "id"
  SolidQueue::ClaimedExecution Load (0.4ms)  SELECT "solid_queue_claimed_executions".* FROM "solid_queue_claimed_executions" WHERE "solid_queue_claimed_executions"."job_id" = $1 AND "solid_queue_claimed_executions"."process_id" = $2  [["job_id", 930], ["process_id", 148]]
  SolidQueue::ReadyExecution Delete All (1.3ms)  DELETE FROM "solid_queue_ready_executions" WHERE "solid_queue_ready_executions"."job_id" = $1  [["job_id", 930]]
  TRANSACTION (0.4ms)  COMMIT
  SolidQueue::Pause Pluck (0.8ms)  SELECT "solid_queue_pauses"."queue_name" FROM "solid_queue_pauses"
  TRANSACTION (1.0ms)  BEGIN
  SolidQueue::ReadyExecution Pluck (2.7ms)  SELECT "solid_queue_ready_executions"."job_id" FROM "solid_queue_ready_executions" ORDER BY "solid_queue_ready_executions"."priority" ASC, "solid_queue_ready_executions"."job_id" ASC LIMIT $1 FOR UPDATE SKIP LOCKED  [["LIMIT", 2]]
  TRANSACTION (0.4ms)  COMMIT
  SolidQueue::Job Load (0.8ms)  SELECT "solid_queue_jobs".* FROM "solid_queue_jobs" WHERE "solid_queue_jobs"."id" = $1 LIMIT $2  [["id", 930], ["LIMIT", 1]]
  TRANSACTION (0.2ms)  BEGIN
  SolidQueue::ScheduledExecution Pluck (1.8ms)  SELECT "solid_queue_scheduled_executions"."job_id" FROM "solid_queue_scheduled_executions" WHERE "solid_queue_scheduled_executions"."scheduled_at" <= $1 ORDER BY "solid_queue_scheduled_executions"."scheduled_at" ASC, "solid_queue_scheduled_executions"."priority" ASC, "solid_queue_scheduled_executions"."job_id" ASC LIMIT $2 FOR UPDATE SKIP LOCKED  [["scheduled_at", "2024-04-11 17:50:01.129034"], ["LIMIT", 500]]
  TRANSACTION (0.1ms)  COMMIT
  TRANSACTION (0.2ms)  BEGIN
  SolidQueue::ScheduledExecution Pluck (1.6ms)  SELECT "solid_queue_scheduled_executions"."job_id" FROM "solid_queue_scheduled_executions" WHERE "solid_queue_scheduled_executions"."scheduled_at" <= $1 ORDER BY "solid_queue_scheduled_executions"."scheduled_at" ASC, "solid_queue_scheduled_executions"."priority" ASC, "solid_queue_scheduled_executions"."job_id" ASC LIMIT $2 FOR UPDATE SKIP LOCKED  [["scheduled_at", "2024-04-11 17:50:02.138170"], ["LIMIT", 500]]

@SjoerdL
Copy link

SjoerdL commented Apr 12, 2024

We noticed a similar issue in our setup. I couldn't reproduce it with a clean Rails app, so that makes it harder to debug. For us, we could reliably reproduce the issue in the following way:

  1. Checkout a branch
  2. Start solidqueue
  3. Run anything that enqueues a job
    => This works fine. Solidqueue claims the job and executes it
  4. Checkout a different branch (which diverged from 1)
    => we see a code reload in solidqueue
  5. Run the same job from 3
    => This doesn't work anymore. Solidqueue claims the job, but it doesn't get executed

We use a default configuration of solidqueue. I tried to dive into it a bit further. I found that solidqueue picked up the job and dispatches it to ActiveSupport. Here it gets stuck in executing before callbacks. It gets very hairy here for me to understand what is going on, but it seemed that one of the before callbacks tries to obtain a ShareLock, which is already held by a dead solidqueue worker. So the before callback never acquires the lock, and therefore keeps waiting indefinetely.

{#<Thread:0x000000010fc929e8@io-worker-1 /Users/myname/.asdf/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:345 dead>=>[nil, [:load, false]]}

This specific thread was started by launch_process_prune. I am thinking that a code reload somehow kills this worker thread, but I can't really figure out why or what should be done to prevent it.

@rosa
Copy link
Member

rosa commented Apr 12, 2024

@carsoncole thanks for all these extra details 🙏🏻 @SjoerdL, that's super helpful! I'll try to investigate a bit more today, and see if I can get to to bottom of this 🔍💪🏻

@carsoncole
Copy link
Author

I got the issue to show itself quickly by changing the threads to 1. To hide the issue longer, increase the number of threads.

  workers:
    - queues: "*"
      threads: 1
      processes: 1
      polling_interval: 1

@carsoncole carsoncole changed the title Enqueuing, but not Working unless restarted Workers eventually stop working Apr 14, 2024
@carldr
Copy link

carldr commented Apr 14, 2024

I'm experiencing a similar problem too, and is reliably caused in development mode by saving any Ruby file in my Rails project. As soon as I do, new jobs are picked up but never finish, and I have to ctrl+c solid queue and re-start it. If I don't save a file, Solid Queue will continue to work fine for hours.

I've not checked if they actually run or not before getting stuck, I came across this issue while investigating whether I'd setup something up incorrectly, but this seems very similar behaviour to what @SjoerdL is experiencing.

@ekampp
Copy link

ekampp commented Apr 16, 2024

Hey, @rosa, thanks for digging into this one!

@bebbs
Copy link

bebbs commented Apr 16, 2024

I can reproduce this in a personal project in development, presenting exactly the same way as @carldr describes above.

@chorstikus
Copy link

Having the save problem. Seems to work fine until I edit and save a file.

@nzifnab
Copy link

nzifnab commented May 1, 2024

I'm experiencing a similar problem too, and is reliably caused in development mode by saving any Ruby file in my Rails project. As soon as I do, new jobs are picked up but never finish, and I have to ctrl+c solid queue and re-start it. If I don't save a file, Solid Queue will continue to work fine for hours.

I've not checked if they actually run or not before getting stuck, I came across this issue while investigating whether I'd setup something up incorrectly, but this seems very similar behaviour to what @SjoerdL is experiencing.

I'm currently able to reproduce doing this exact step. Perform an action in the app that enqueues a job, it executes flawlessly. Save a couple files (I added comments to 3 files and saved them), now that exact same action done in the exact same manner enqueues a job, but it gets stuck in "In Progress Jobs" for some reason.

FWIW I'm running solid_queue using a foreman Procfile.dev:

web: bundle exec rdbg --command --nonstop --open=vscode --sock_path=tmp/sockets/rdbg.socket -- bundle exec bin/rails server -p 3000
css: bundle exec bin/rails tailwindcss:watch[debug]
worker: if [ "$SKIP_SOLID_QUEUE" != "true" ]; then bundle exec rake solid_queue:start; else sleep 2147483647; fi

SolidQueue: 0.3.0
Rails 7.1.3.2

I have mission control too? Not sure that's relevant but figured I'd include it, it's what I use to see that jobs got stuck in "In Progress"

@rosa
Copy link
Member

rosa commented May 1, 2024

Thanks everyone for the info, and so sorry I haven't been able to work on this one yet. It's the next one on deck once I get #208 out of the way 💪

@ibrahima
Copy link

ibrahima commented May 6, 2024

Hmm, just to make sure I'm on the same page - is SolidQueue supposed to auto-reload code in development? I remember back in the day Sidekiq did not, but I just checked and it looks like it has done so for several years so my knowledge is a bit out of date. But I imagine for a first-party solution it should work more seamlessly anyway.

I feel like I may have experienced something like this but I am not 100% sure.

@rosa
Copy link
Member

rosa commented May 8, 2024

But I imagine for a first-party solution it should work more seamlessly anyway.

Absolutely, I need to get this fixed 100% 😅 So yes, this is definitely a problem of Solid Queue right now.

@rosa rosa self-assigned this May 9, 2024
@rosa
Copy link
Member

rosa commented May 9, 2024

Thanks again everyone for the patience and the steps to reproduce this, it has been incredibly helpful 🙇‍♀️ ❤️ . I reproduced it without issues thanks to that, and I think I've tracked down the problem to a fix put in place for code reloading 😅 I think I know how to fix but need to do some more tests and it's getting a bit late over here, hopefully I'll be able to wrap this up tomorrow! 🤞

@nzifnab
Copy link

nzifnab commented May 9, 2024

Awesome! Thanks for jumping on this rosa, you're a rockstar (And this project is really sweet), it's been bothering me and my team for a little bit now so it's good to see some momentum on this fix!

@rosa
Copy link
Member

rosa commented May 10, 2024

I was wrong with the root cause, it wasn't related to this fix but to something else (basically @SjoerdL was spot on! 🎯 ). I think #223 should fix this problem. I can't reproduce it anymore with that fix, but if anyone else could test it as well, I'd be super grateful 🙏🏻

@carsoncole
Copy link
Author

carsoncole commented May 10, 2024

I've been running the "reload-dev-fix" #223 branch and jobs still fail to run after a while and are just enqueued. At one point, "Release claimed job" runs and the jobs are run, but I've got them enqueuing again, and 30 mins have passed so not sure how often the release happens.

SolidQueue-0.3.0 Prune dead processes (1.9ms)  size: 0
SolidQueue-0.3.0 Prune dead processes (1.6ms)  size: 0
SolidQueue-0.3.0 Release claimed job (11.7ms)  job_id: 1256, process_id: 225
SolidQueue-0.3.0 Release claimed job (6.7ms)  job_id: 1257, process_id: 225
SolidQueue-0.3.0 Release claimed job (4.0ms)  job_id: 1258, process_id: 225
SolidQueue-0.3.0 Release claimed job (3.8ms)  job_id: 1259, process_id: 225
SolidQueue-0.3.0 Release claimed job (3.5ms)  job_id: 1260, process_id: 225

For me, it's easy to see when the jobs stop running as I'm using hooks in ApplicationJob to log every step of the process. This may be helpful for others trying to debug to use.

class ApplicationJob < ActiveJob::Base
  after_enqueue do |job|
    Log.create(message: "Job ##{ job.job_id[0, 7] } enqueued", log_class: job.class, status: :info, queue: job.queue_name)
  end

  before_perform do |job|
    Log.create(message: "Job ##{ job.job_id[0, 7] } started", log_class: job.class, status: :info, queue: job.queue_name)
  end

  after_perform do |job|
    Log.create(message: "Job ##{ job.job_id[0, 7] } completed", log_class: job.class, status: :info, queue: job.queue_name)
  end
end

@rosa
Copy link
Member

rosa commented May 10, 2024

Huh, I see! @carsoncole, but does this happen when you modify any code? Or just out of the blue without you doing anything? Wondering if this issue is the same as the one I've attempted to fix or if it's something else 🤔 The bit that I can't explain is why your jobs are released:

SolidQueue-0.3.0 Release claimed job (11.7ms)  job_id: 1256, process_id: 225
SolidQueue-0.3.0 Release claimed job (6.7ms)  job_id: 1257, process_id: 225
SolidQueue-0.3.0 Release claimed job (4.0ms)  job_id: 1258, process_id: 225
SolidQueue-0.3.0 Release claimed job (3.8ms)  job_id: 1259, process_id: 225
SolidQueue-0.3.0 Release claimed job (3.5ms)  job_id: 1260, process_id: 225

This means something happened to process 225, but it's strange that you didn't get any logs about that, about the process being deregistered... 😕 Did you happen to see if any of the logs lines about

SolidQueue-0.3.0 Prune dead processes (1.6ms)  size: 0

had any other size than 0?

@ekampp
Copy link

ekampp commented May 10, 2024

I'm also running that branch, and I'm getting this a lot. Probably every 5 or six seconds after I start changing code.

SolidQueue-0.3.0 Prune dead processes (1.9ms) size: 0

@carsoncole
Copy link
Author

I've been running this for a few hours and I don't have any logs indicating a size other than 0.

This log entry is occurring ever 1-2 seconds.

SolidQueue-0.3.0 Prune dead processes (5.9ms)  size: 0

As for modifying code, I've made multiple code changes across the app during this timeframe.

@rosa
Copy link
Member

rosa commented May 10, 2024

So, to be clear, this is expected:

SolidQueue-0.3.0 Prune dead processes (1.9ms) size: 0

But I realised I left a debugging setting there and changed the frequency of the dead processes check to every 2 seconds instead of every SolidQueue.process_alive_threshold, which is 5 minutes by default. I'll push a change for that. But in any case, the logs

SolidQueue-0.3.0 Prune dead processes (1.9ms) size: 0

are expected and don't mean something is wrong. The workers hanging and not processing jobs do mean something is wrong.

Going to push the change to stop running the process prune every 2 seconds.

Another question: @carsoncole: is it possible something external to Solid Queue is killing your worker processes? What you describe is different from the error I've fixed 🤔 To be clear, the error I've addressed is:

I'm experiencing a similar problem too, and is reliably caused in development mode by saving any Ruby file in my Rails project. As soon as I do, new jobs are picked up but never finish, and I have to ctrl+c solid queue and re-start it. If I don't save a file, Solid Queue will continue to work fine for hours.

@carsoncole
Copy link
Author

I understand on the "Prune dead processes" log entry. Sounds good.

As for the issue of not saving files results in everything working fine, I'll test for that. But as of now, I am changing files and the issue is occurring.

@rosa
Copy link
Member

rosa commented May 10, 2024

Ohhh bummer! And so odd! I couldn't reproduce it anymore with that change, while I could reproduce it consistently without it. I'll keep digging. Thank you so much for your patience 🙏

@carsoncole
Copy link
Author

I can confirm that the following is correct.

I'm experiencing a similar problem too, and is reliably caused in development mode by saving any Ruby file in my Rails project. As soon as I do, new jobs are picked up but never finish, and I have to ctrl+c solid queue and re-start it. If I don't save a file, Solid Queue will continue to work fine for hours.

I can't see the pattern as to what file change affects what, as I have different queues, and it'll be just one queue that no longer executes enqueued jobs unless restarted.

@nzifnab
Copy link

nzifnab commented May 13, 2024

So far in my limited testing, that branch has fixed my issue of code reloading. I'll continue to experiment w/ it; I haven't been testing with it for very long.

@jmdfm
Copy link
Contributor

jmdfm commented May 18, 2024

This is the full dump of the stack trace that happens in my environment for @rafaelfranca:

  • ruby 3.3.1 (2024-04-23 revision c56cd86388) [arm64-darwin23]
  • Rails 7.1.3.2
  • PostgresQL (running in a docker container)

This particular crash happened because I did a db:reset. The dump is too large to paste here, so it's a gist:

Stack Dump

@andyjeffries
Copy link

@rosa I was experiencing the same thing, came here to see what the latest update is and can confirm reload-dev-fix appears to fix it for me (at least it's been over an hour since I updated and it hasn't died yet, but normally wouldn't last more than 15-30 minutes)

@rosa
Copy link
Member

rosa commented May 20, 2024

Thanks a lot for letting me know, @andyjeffries and @nzifnab 🙏 I'm going to merge that fix and ship a new version but will keep this one open since it still fails for some people, so I'll keep investigating.

@jmdfm
Copy link
Contributor

jmdfm commented May 23, 2024

FWIW I have also been using the reload-dev-fix branch for around 5 days now, and I haven't seen this issue crop up once!

@andyjeffries
Copy link

I switched to the newly released version that includes this merged fix since Tuesday May 21 08:50 BST and it's been absolutely fine for me since then too. Awesome fix @rosa, thanks !

@rosa
Copy link
Member

rosa commented Jul 18, 2024

Hey @carsoncole, are you still experiencing this? I haven't heard any more reports since this fix went live, so I'd like to dig more into your case, because it's very strange. You mentioned:

as I have different queues, and it'll be just one queue that no longer executes enqueued jobs unless restarted.

But are these queues run by different workers, all individually? Or multiple queues can be run by the same worker? Just wondering if this might be a priority kind of problem 🤔

@searls
Copy link

searls commented Jul 31, 2024

👋 Came to the repo to open a similar issue, which I've been experiencing in local development since January. Until reading this, I was sort of at a loss as to why solid_queue workers would pick stuff up fine at startup and then eventually stop grabbing work—I hadn't considered that it might be due to (some consequence of) class reloading.

Unfortunately, I'm currently on the released version (0.3.4) and still experiencing this daily in development. I was just about to switch over to running inline or async locally to unblock myself. It's possible that I'm experiencing a different issue leading to the same outcome, but I don't feel well-equipped to debug it (any tips would be appreciated)

@rosa
Copy link
Member

rosa commented Jul 31, 2024

Ouch, sorry to hear that @searls! I wonder if you could try running the branch in #269, not via the Puma plugin or anything, but just directly, and see if that changes anything 🤔

Do you think your issue happens also due to class reloading or could it be something else? What DB are you using?

@searls
Copy link

searls commented Aug 1, 2024

Hi @rosa thank you for replying. I have been running off the branch in #269 since this morning and haven't seen it fail yet, but I'll keep you posted. I was almost sure it was failing again but it turns out that I was seeing a separate apparent issue

@rosa
Copy link
Member

rosa commented Aug 1, 2024

Thanks a lot for letting me know, @searls 🙏 Huh, that issue is strange indeed 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests