-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workers eventually stop working #204
Comments
Hey @carsoncole, sorry about that! Could you share a bit more about your DB setup? Also, could you try setting this: config.solid_queue.silence_polling = false And then starting |
I was seeing the same problem. It turns out that it only works until a ruby file is changed. |
@rosa Overnight, jobs stopped executing again. I set SolidQueue::ScheduledExecution Pluck (3.0ms) SELECT "solid_queue_scheduled_executions"."job_id" FROM "solid_queue_scheduled_executions" WHERE "solid_queue_scheduled_executions"."scheduled_at" <= $1 ORDER BY "solid_queue_scheduled_executions"."scheduled_at" ASC, "solid_queue_scheduled_executions"."priority" ASC, "solid_queue_scheduled_executions"."job_id" ASC LIMIT $2 FOR UPDATE SKIP LOCKED [["scheduled_at", "2024-04-11 15:19:59.498234"], ["LIMIT", 500]]
TRANSACTION (0.1ms) COMMIT
[ActiveJob] TRANSACTION (0.2ms) BEGIN
[ActiveJob] SolidQueue::Job Create (2.5ms) INSERT INTO "solid_queue_jobs" ("queue_name", "class_name", "arguments", "priority", "active_job_id", "scheduled_at", "finished_at", "concurrency_key", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) RETURNING "id" [["queue_name", "default"], ["class_name", "ProcessNewJobsToProfilesJob"], ["arguments", "{\"job_class\":\"ProcessNewJobsToProfilesJob\",\"job_id\":\"5fc05678-dbf5-45e6-81d6-f925734589f5\",\"provider_job_id\":null,\"queue_name\":\"default\",\"priority\":null,\"arguments\":[],\"executions\":0,\"exception_executions\":{},\"locale\":\"en\",\"timezone\":\"Pacific Time (US \\u0026 Canada)\",\"enqueued_at\":\"2024-04-11T15:20:00.008251000Z\",\"scheduled_at\":\"2024-04-11T15:20:00.008202000Z\"}"], ["priority", 0], ["active_job_id", "5fc05678-dbf5-45e6-81d6-f925734589f5"], ["scheduled_at", "2024-04-11 15:20:00.008202"], ["finished_at", nil], ["concurrency_key", "[FILTERED]"], ["created_at", "2024-04-11 15:20:00.008732"], ["updated_at", "2024-04-11 15:20:00.008732"]]
[ActiveJob] TRANSACTION (0.2ms) SAVEPOINT active_record_1
[ActiveJob] SolidQueue::Job Load (0.2ms) SELECT "solid_queue_jobs".* FROM "solid_queue_jobs" WHERE "solid_queue_jobs"."id" = $1 LIMIT $2 [["id", 907], ["LIMIT", 1]]
[ActiveJob] SolidQueue::ReadyExecution Create (0.7ms) INSERT INTO "solid_queue_ready_executions" ("job_id", "queue_name", "priority", "created_at") VALUES ($1, $2, $3, $4) RETURNING "id" [["job_id", 907], ["queue_name", "default"], ["priority", 0], ["created_at", "2024-04-11 15:20:00.017665"]]
[ActiveJob] TRANSACTION (0.1ms) RELEASE SAVEPOINT active_record_1
[ActiveJob] Log Create (0.4ms) INSERT INTO "logs" ("status", "log_class", "class_id", "message", "request", "response_code", "source_id", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) RETURNING "id" [["status", 0], ["log_class", "ProcessNewJobsToProfilesJob"], ["class_id", nil], ["message", "Job #5fc0567 enqueued"], ["request", nil], ["response_code", nil], ["source_id", nil], ["created_at", "2024-04-11 15:20:00.020842"], ["updated_at", "2024-04-11 15:20:00.020842"]]
[ActiveJob] ↳ app/jobs/application_job.rb:9:in `block in <class:ApplicationJob>'
[ActiveJob] Enqueued ProcessNewJobsToProfilesJob (Job ID: 5fc05678-dbf5-45e6-81d6-f925734589f5) to SolidQueue(default)
SolidQueue::Job Load (0.1ms) SELECT "solid_queue_jobs".* FROM "solid_queue_jobs" WHERE "solid_queue_jobs"."id" = $1 LIMIT $2 [["id", 907], ["LIMIT", 1]]
SolidQueue::RecurringExecution Create (0.6ms) INSERT INTO "solid_queue_recurring_executions" ("job_id", "task_key", "run_at", "created_at") VALUES ($1, $2, $3, $4) RETURNING "id" [["job_id", 907], ["task_key", "[FILTERED]"], ["run_at", "2024-04-11 15:20:00"], ["created_at", "2024-04-11 15:20:00.023454"]]
TRANSACTION (1.5ms) COMMIT
Rendered admin/logs/_log.html.erb (Duration: 0.3ms | Allocations: 73)
[ActionCable] Broadcasting to logs: "<turbo-stream action=\"prepend\" target=\"logs\"><template><tr id=\"log_2893\" class=\"border-b\">\n <td class=\"px-2 py-2 w-16 text-center dark:text-slate-700\">\n <div class='rounded font-bold text-xs px-2 py-1 bg-blue-200'>INFO</div>\n </td>\n <td class=\"px-2 py-2\">\n ProcessNewJob...
TRANSACTION (0.2ms) BEGIN
SolidQueue::ScheduledExecution Pluck (1.9ms) SELECT "solid_queue_scheduled_executions"."job_id" FROM "solid_queue_scheduled_executions" WHERE "solid_queue_scheduled_executions"."scheduled_at" <= $1 ORDER BY "solid_queue_scheduled_executions"."scheduled_at" ASC, "solid_queue_scheduled_executions"."priority" ASC, "solid_queue_scheduled_executions"."job_id" ASC LIMIT $2 FOR UPDATE SKIP LOCKED [["scheduled_at", "2024-04-11 15:20:00.507423"], ["LIMIT", 500]]
TRANSACTION (0.1ms) COMMIT
TRANSACTION (0.3ms) BEGIN
SolidQueue::ScheduledExecution Pluck (3.2ms) SELECT "solid_queue_scheduled_executions"."job_id" FROM "solid_queue_scheduled_executions" WHERE "solid_queue_scheduled_executions"."scheduled_at" <= $1 ORDER BY "solid_queue_scheduled_executions"."scheduled_at" ASC, "solid_queue_scheduled_executions"."priority" ASC, "solid_queue_scheduled_executions"."job_id" ASC LIMIT $2 FOR UPDATE SKIP LOCKED [["scheduled_at", "2024-04-11 15:20:01.515209"], ["LIMIT", 500]] In the Solid Queue console here is the output since I launched it, which I notice carsoncole@Carsons-Mac-mini ahoyjob % rake solid_queue:start
[SolidQueue] Starting Dispatcher(pid=90937, hostname=Carsons-Mac-mini.local, metadata={:polling_interval=>1, :batch_size=>500, :concurrency_maintenance_interval=>600, :recurring_schedule=>{:"Expire subscriptions"=>{:schedule=>"every day at 00:01", :class_name=>"ExpireSubscriptionsJob", :arguments=>[]}, :send_notifications_subscriptions_expired=>{:schedule=>"every day at 00:15", :class_name=>"SendSubscriptionExpirationNotificationJob", :arguments=>[]}, :send_notifications_subscriptions_expiring=>{:schedule=>"every day at 10:00am", :class_name=>"SendSubscriptionExpiringIn7DaysEmailJob", :arguments=>[]}, :"Send new jobs email"=>{:schedule=>"every weekday at 5:00, 8:00, 11:00, 13:00, 15:00", :class_name=>"SendNewJobsEmailJob", :arguments=>[]}, :"Identify jobs"=>{:schedule=>"every 30 minutes at minute 5", :class_name=>"IdentifyJobsJob", :arguments=>[]}, :"Add jobs to Profiles"=>{:schedule=>"every 10 minutes", :class_name=>"ProcessNewJobsToProfilesJob", :arguments=>[]}, :"Destroy incomplete Jobs"=>{:schedule=>"every day at 7:00am", :class_name=>"CleanUpJobsJob", :arguments=>[]}, :"Clean up incomplete Jobs"=>{:schedule=>"every 4 hours at minute 45", :class_name=>"CleanUpIncompleteJobs", :arguments=>[]}}})
[SolidQueue] Starting Worker(pid=90938, hostname=Carsons-Mac-mini.local, metadata={:polling_interval=>1, :queues=>"*", :thread_pool_size=>3})
[SolidQueue] undefined method `queue_adapter_name' for nil:NilClass
[SolidQueue] Skipped recurring task Add jobs to Profiles at 2024-04-11 08:20:00 UTC — already dispatched
[SolidQueue] Skipped recurring task Identify jobs at 2024-04-11 08:30:00 UTC — already dispatched
[SolidQueue] Skipped recurring task Identify jobs at 2024-04-11 09:30:00 UTC — already dispatched
[SolidQueue] Skipped recurring task Add jobs to Profiles at 2024-04-11 09:30:00 UTC — already dispatched
[SolidQueue] undefined method `queue_adapter_name' for nil:NilClass
[SolidQueue] undefined method `queue_adapter_name' for nil:NilClass
[SolidQueue] Skipped recurring task Send new jobs email at 2024-04-11 12:00:00 UTC — already dispatched
[SolidQueue] Skipped recurring task Destroy incomplete Jobs at 2024-04-11 14:00:00 UTC — already dispatched
[SolidQueue] Skipped recurring task Add jobs to Profiles at 2024-04-11 14:20:00 UTC — already dispatched
[SolidQueue] Skipped recurring task Identify jobs at 2024-04-11 14:30:00 UTC — already dispatched Also, I'm seeing 8 records in the table |
Another recurring job just added. It's in the table |
Thanks for all this extra info! All the logs correspond to the dispatcher, it's as if the worker wasn't doing anything 🤔 The errors
aren't related to jobs not being processed; they correspond to recurring tasks whose job class might not be defined. I need to improve this to show a better error in that case and don't try to dispatch the task, but you can ignore the errors for now. Could you share a bit more about your DB setup? |
database.yml default: &default
adapter: postgresql
encoding: unicode
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
development:
<<: *default
database: ahoyjob_development This may be relevant, I'm using Postgresql for Action Cable cable.yml development:
adapter: postgresql
test:
adapter: test
production:
adapter: postgresql |
Another datapoint is that I have a |
So through my test case below, I can trigger the solid_queue.yml "Test Job":
class: Testjob
schedule: every minute Job file was named I've fixed that now and running again to see if that has anything to do with not working jobs. |
It does appear that the SolidQueue::ReadyExecution Pluck (2.5ms) SELECT "solid_queue_ready_executions"."job_id" FROM "solid_queue_ready_executions" ORDER BY "solid_queue_ready_executions"."priority" ASC, "solid_queue_ready_executions"."job_id" ASC LIMIT $1 FOR UPDATE SKIP LOCKED [["LIMIT", 3]] But then stops. Here is the full log and after the last entry for SolidQueue::Job Load (0.2ms) SELECT "solid_queue_jobs".* FROM "solid_queue_jobs" WHERE "solid_queue_jobs"."id" = $1 LIMIT $2 [["id", 930], ["LIMIT", 1]]
SolidQueue::RecurringExecution Create (1.4ms) INSERT INTO "solid_queue_recurring_executions" ("job_id", "task_key", "run_at", "created_at") VALUES ($1, $2, $3, $4) RETURNING "id" [["job_id", 930], ["task_key", "[FILTERED]"], ["run_at", "2024-04-11 17:50:00"], ["created_at", "2024-04-11 17:50:00.141010"]]
TRANSACTION (0.4ms) COMMIT
Rendered admin/logs/_log.html.erb (Duration: 0.6ms | Allocations: 576)
[ActionCable] Broadcasting to logs: "<turbo-stream action=\"prepend\" target=\"logs\"><template><tr id=\"log_2970\" class=\"border-b\">\n <td class=\"px-2 py-2 w-16 text-center dark:text-slate-700\">\n <div class='rounded font-bold text-xs px-2 py-1 bg-blue-200'>INFO</div>\n </td>\n <td class=\"px-2 py-2\">\n ProcessNewJob...
SolidQueue::Pause Pluck (0.9ms) SELECT "solid_queue_pauses"."queue_name" FROM "solid_queue_pauses"
TRANSACTION (0.2ms) BEGIN
SolidQueue::ReadyExecution Pluck (2.5ms) SELECT "solid_queue_ready_executions"."job_id" FROM "solid_queue_ready_executions" ORDER BY "solid_queue_ready_executions"."priority" ASC, "solid_queue_ready_executions"."job_id" ASC LIMIT $1 FOR UPDATE SKIP LOCKED [["LIMIT", 3]]
SolidQueue::ClaimedExecution Insert (1.9ms) INSERT INTO "solid_queue_claimed_executions" ("job_id","process_id","created_at") VALUES (930, 148, CURRENT_TIMESTAMP) RETURNING "id"
SolidQueue::ClaimedExecution Load (0.4ms) SELECT "solid_queue_claimed_executions".* FROM "solid_queue_claimed_executions" WHERE "solid_queue_claimed_executions"."job_id" = $1 AND "solid_queue_claimed_executions"."process_id" = $2 [["job_id", 930], ["process_id", 148]]
SolidQueue::ReadyExecution Delete All (1.3ms) DELETE FROM "solid_queue_ready_executions" WHERE "solid_queue_ready_executions"."job_id" = $1 [["job_id", 930]]
TRANSACTION (0.4ms) COMMIT
SolidQueue::Pause Pluck (0.8ms) SELECT "solid_queue_pauses"."queue_name" FROM "solid_queue_pauses"
TRANSACTION (1.0ms) BEGIN
SolidQueue::ReadyExecution Pluck (2.7ms) SELECT "solid_queue_ready_executions"."job_id" FROM "solid_queue_ready_executions" ORDER BY "solid_queue_ready_executions"."priority" ASC, "solid_queue_ready_executions"."job_id" ASC LIMIT $1 FOR UPDATE SKIP LOCKED [["LIMIT", 2]]
TRANSACTION (0.4ms) COMMIT
SolidQueue::Job Load (0.8ms) SELECT "solid_queue_jobs".* FROM "solid_queue_jobs" WHERE "solid_queue_jobs"."id" = $1 LIMIT $2 [["id", 930], ["LIMIT", 1]]
TRANSACTION (0.2ms) BEGIN
SolidQueue::ScheduledExecution Pluck (1.8ms) SELECT "solid_queue_scheduled_executions"."job_id" FROM "solid_queue_scheduled_executions" WHERE "solid_queue_scheduled_executions"."scheduled_at" <= $1 ORDER BY "solid_queue_scheduled_executions"."scheduled_at" ASC, "solid_queue_scheduled_executions"."priority" ASC, "solid_queue_scheduled_executions"."job_id" ASC LIMIT $2 FOR UPDATE SKIP LOCKED [["scheduled_at", "2024-04-11 17:50:01.129034"], ["LIMIT", 500]]
TRANSACTION (0.1ms) COMMIT
TRANSACTION (0.2ms) BEGIN
SolidQueue::ScheduledExecution Pluck (1.6ms) SELECT "solid_queue_scheduled_executions"."job_id" FROM "solid_queue_scheduled_executions" WHERE "solid_queue_scheduled_executions"."scheduled_at" <= $1 ORDER BY "solid_queue_scheduled_executions"."scheduled_at" ASC, "solid_queue_scheduled_executions"."priority" ASC, "solid_queue_scheduled_executions"."job_id" ASC LIMIT $2 FOR UPDATE SKIP LOCKED [["scheduled_at", "2024-04-11 17:50:02.138170"], ["LIMIT", 500]] |
We noticed a similar issue in our setup. I couldn't reproduce it with a clean Rails app, so that makes it harder to debug. For us, we could reliably reproduce the issue in the following way:
We use a default configuration of solidqueue. I tried to dive into it a bit further. I found that solidqueue picked up the job and dispatches it to ActiveSupport. Here it gets stuck in executing
This specific thread was started by |
@carsoncole thanks for all these extra details 🙏🏻 @SjoerdL, that's super helpful! I'll try to investigate a bit more today, and see if I can get to to bottom of this 🔍💪🏻 |
I got the issue to show itself quickly by changing the threads to 1. To hide the issue longer, increase the number of threads. workers:
- queues: "*"
threads: 1
processes: 1
polling_interval: 1 |
I'm experiencing a similar problem too, and is reliably caused in development mode by saving any Ruby file in my Rails project. As soon as I do, new jobs are picked up but never finish, and I have to ctrl+c solid queue and re-start it. If I don't save a file, Solid Queue will continue to work fine for hours. I've not checked if they actually run or not before getting stuck, I came across this issue while investigating whether I'd setup something up incorrectly, but this seems very similar behaviour to what @SjoerdL is experiencing. |
Hey, @rosa, thanks for digging into this one! |
I can reproduce this in a personal project in development, presenting exactly the same way as @carldr describes above. |
Having the save problem. Seems to work fine until I edit and save a file. |
I'm currently able to reproduce doing this exact step. Perform an action in the app that enqueues a job, it executes flawlessly. Save a couple files (I added comments to 3 files and saved them), now that exact same action done in the exact same manner enqueues a job, but it gets stuck in "In Progress Jobs" for some reason. FWIW I'm running solid_queue using a foreman
SolidQueue: 0.3.0 I have mission control too? Not sure that's relevant but figured I'd include it, it's what I use to see that jobs got stuck in "In Progress" |
Thanks everyone for the info, and so sorry I haven't been able to work on this one yet. It's the next one on deck once I get #208 out of the way 💪 |
Hmm, just to make sure I'm on the same page - is SolidQueue supposed to auto-reload code in development? I remember back in the day Sidekiq did not, but I just checked and it looks like it has done so for several years so my knowledge is a bit out of date. But I imagine for a first-party solution it should work more seamlessly anyway. I feel like I may have experienced something like this but I am not 100% sure. |
Absolutely, I need to get this fixed 100% 😅 So yes, this is definitely a problem of Solid Queue right now. |
Thanks again everyone for the patience and the steps to reproduce this, it has been incredibly helpful 🙇♀️ ❤️ . I reproduced it without issues thanks to that, and I think I've tracked down the problem to a fix put in place for code reloading 😅 I think I know how to fix but need to do some more tests and it's getting a bit late over here, hopefully I'll be able to wrap this up tomorrow! 🤞 |
Awesome! Thanks for jumping on this rosa, you're a rockstar (And this project is really sweet), it's been bothering me and my team for a little bit now so it's good to see some momentum on this fix! |
I've been running the "reload-dev-fix" #223 branch and jobs still fail to run after a while and are just enqueued. At one point, "Release claimed job" runs and the jobs are run, but I've got them enqueuing again, and 30 mins have passed so not sure how often the release happens. SolidQueue-0.3.0 Prune dead processes (1.9ms) size: 0
SolidQueue-0.3.0 Prune dead processes (1.6ms) size: 0
SolidQueue-0.3.0 Release claimed job (11.7ms) job_id: 1256, process_id: 225
SolidQueue-0.3.0 Release claimed job (6.7ms) job_id: 1257, process_id: 225
SolidQueue-0.3.0 Release claimed job (4.0ms) job_id: 1258, process_id: 225
SolidQueue-0.3.0 Release claimed job (3.8ms) job_id: 1259, process_id: 225
SolidQueue-0.3.0 Release claimed job (3.5ms) job_id: 1260, process_id: 225 For me, it's easy to see when the jobs stop running as I'm using hooks in ApplicationJob to log every step of the process. This may be helpful for others trying to debug to use. class ApplicationJob < ActiveJob::Base
after_enqueue do |job|
Log.create(message: "Job ##{ job.job_id[0, 7] } enqueued", log_class: job.class, status: :info, queue: job.queue_name)
end
before_perform do |job|
Log.create(message: "Job ##{ job.job_id[0, 7] } started", log_class: job.class, status: :info, queue: job.queue_name)
end
after_perform do |job|
Log.create(message: "Job ##{ job.job_id[0, 7] } completed", log_class: job.class, status: :info, queue: job.queue_name)
end
end |
Huh, I see! @carsoncole, but does this happen when you modify any code? Or just out of the blue without you doing anything? Wondering if this issue is the same as the one I've attempted to fix or if it's something else 🤔 The bit that I can't explain is why your jobs are released:
This means something happened to process 225, but it's strange that you didn't get any logs about that, about the process being deregistered... 😕 Did you happen to see if any of the logs lines about
had any other |
I'm also running that branch, and I'm getting this a lot. Probably every 5 or six seconds after I start changing code.
|
I've been running this for a few hours and I don't have any logs indicating a size other than 0. This log entry is occurring ever 1-2 seconds. SolidQueue-0.3.0 Prune dead processes (5.9ms) size: 0 As for modifying code, I've made multiple code changes across the app during this timeframe. |
So, to be clear, this is expected:
But I realised I left a debugging setting there and changed the frequency of the dead processes check to every 2 seconds instead of every
are expected and don't mean something is wrong. The workers hanging and not processing jobs do mean something is wrong. Going to push the change to stop running the process prune every 2 seconds. Another question: @carsoncole: is it possible something external to Solid Queue is killing your worker processes? What you describe is different from the error I've fixed 🤔 To be clear, the error I've addressed is:
|
I understand on the "Prune dead processes" log entry. Sounds good. As for the issue of not saving files results in everything working fine, I'll test for that. But as of now, I am changing files and the issue is occurring. |
Ohhh bummer! And so odd! I couldn't reproduce it anymore with that change, while I could reproduce it consistently without it. I'll keep digging. Thank you so much for your patience 🙏 |
I can confirm that the following is correct.
I can't see the pattern as to what file change affects what, as I have different queues, and it'll be just one queue that no longer executes enqueued jobs unless restarted. |
So far in my limited testing, that branch has fixed my issue of code reloading. I'll continue to experiment w/ it; I haven't been testing with it for very long. |
This is the full dump of the stack trace that happens in my environment for @rafaelfranca:
This particular crash happened because I did a db:reset. The dump is too large to paste here, so it's a gist: |
@rosa I was experiencing the same thing, came here to see what the latest update is and can confirm |
Thanks a lot for letting me know, @andyjeffries and @nzifnab 🙏 I'm going to merge that fix and ship a new version but will keep this one open since it still fails for some people, so I'll keep investigating. |
FWIW I have also been using the |
I switched to the newly released version that includes this merged fix since Tuesday May 21 08:50 BST and it's been absolutely fine for me since then too. Awesome fix @rosa, thanks ! |
Hey @carsoncole, are you still experiencing this? I haven't heard any more reports since this fix went live, so I'd like to dig more into your case, because it's very strange. You mentioned:
But are these queues run by different workers, all individually? Or multiple queues can be run by the same worker? Just wondering if this might be a priority kind of problem 🤔 |
👋 Came to the repo to open a similar issue, which I've been experiencing in local development since January. Until reading this, I was sort of at a loss as to why solid_queue workers would pick stuff up fine at startup and then eventually stop grabbing work—I hadn't considered that it might be due to (some consequence of) class reloading. Unfortunately, I'm currently on the released version (0.3.4) and still experiencing this daily in development. I was just about to switch over to running inline or async locally to unblock myself. It's possible that I'm experiencing a different issue leading to the same outcome, but I don't feel well-equipped to debug it (any tips would be appreciated) |
Hi @rosa thank you for replying. I have been running off the branch in #269 since this morning and haven't seen it fail yet, but I'll keep you posted. I was almost sure it was failing again but it turns out that I was seeing a separate apparent issue |
Thanks a lot for letting me know, @searls 🙏 Huh, that issue is strange indeed 🤔 |
I seem to have a persistent problem of jobs being worked but new jobs just get enqueued and workers stop pulling jobs to be worked. If I restart Solid Queue (
rake solid_queue:start
) then the jobs are immediately worked.This happens with both recurring jobs as well as jobs that are "perform_later".
Starting the solid queue looks all good and there aren't any subsequent log items that indicate any issues.
The only setting in my
development.rb
:config.active_job.queue_adapter = :solid_queue
Solid Queue: 0.3.0
Rails: 7.1.3.2
The text was updated successfully, but these errors were encountered: