-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some jobs failing due to ActiveRecord::Deadlocked when trying to create a ScheduledExecution #162
Comments
@andbar, could you share the |
Here you go. Thanks for looking into it. Let me know if you need any more info. |
@andbar, I've looked into this and I see why the deadlock is happening but it's not clear to me why the transaction (2) is locking 361 records in the
Thank you! 🙏 |
Yep, here they are: solid_queue (0.2.1)
|
Agh, @andbar, I was completely stumped because I thought the code you were running included this change a30c2cb that I got about 3 weeks ago (GitHub shows last week because I had rebased) and that we're running in production because we're running the branch with support for recurring tasks. I noticed that one while I was working on recurring jobs because I hit a super similar deadlock, and fixed it there. Then, when you reported this one and I looked at your I'm going to try to ship a new version with support for recurring jobs and that fix so you can test it. I'm on call this week and had a busy Monday, but hopefully will get to it tomorrow! |
Ah! Haha, so easy to do. We'll be glad to test that fix with our situation. I'll watch for a new version. Thank you! |
Thank you so much! I just shipped version 0.2.2 with this fix, as I didn't have time to wrap up the recurring jobs PR, so I decided to just extract that fix. Could you try this one and see if you still encounter the deadlock? 🙏 Thank you! |
Hi, @rosa, we deployed that new version and thought it might have fixed it but unfortunately we got some more deadlocks today. Here's the latest deadlock log from the db, hopefully it helps pinpoint what might be causing it. |
Ouch 😞 Sorry about that, and thanks for the new debug info. I'll continue looking into it. |
Thanks, @rosa. I haven't had a chance to look at it much myself before this, due to some other needed work in our project that's using this gem, but I'm hoping to be able to dig into it further. I don't have much experience with deadlocks, though, so I'm trying to brush up on that first 😬. From a first glance, it appears that maybe the issue is locking due to both transactions (the insert and the delete) needing to lock this index: index_solid_queue_dispatch_all? Or more accurately, the insert appears to lock the PRIMARY index first and then tries to acquire a lock on the index_solid_queue_dispatch_all index, while the delete is going in the opposite direction and locks the index_solid_queue_dispatch_all index first and then tries to acquire a lock on the PRIMARY index. Does that sound right? Maybe that's why the delete transaction (transaction 2 in the logs) shows "5336 row lock(s)" even though it's only deleting something like 19 jobs - b/c of the index_solid_queue_dispatch_all index? |
Hey @andbar, sorry for the delay! I haven't forgotten about this one, but I've been working on other stuff in the last few weeks. I'm back looking at this, and I wonder if #199 might help in your case. Would you mind trying that one out if you're still using Solid Queue and experiencing this problem? |
Hi @rosa, I just realized I hadn't responded to you yet, I apologize for that. We ended up moving away from Solid Queue to using something else that was just a better fit for our particular needs due to the nature of the jobs we're running, so unfortunately I won't be able to test that fix. I'm sorry! |
Hi @andbar! Oh, no worries at all! Thanks so much for letting me know, really appreciate you taking the time to test and report this and your patience through the troubleshooting 🤗 |
Hey @paulhtrott, ohhhh, thanks for letting me know! It's a bummer it didn't solve it completely, but I'll take the reduction. I'll merge that and will start using it in HEY, as we're going to increase the number of scheduled jobs per day in ~4M tomorrow, so hopefully we'll hit this ourselves and that will give me more info to think of other solutions. |
Hi @rosa we have had zero luck being able to diagnose our deadlock issue. This is how our architecture is structured:
Our errors do not show much details outside of the following, plus a stack trace (attached):
We have resorted to setting wait_untils for most of our jobs, it seems the delay helps on most occasions, but it is inconvenient in some cases. Are there any other details that might be helpful for you? |
Ohhh, @paulhtrott, that's super helpful! This looks like a different kind of deadlock than the one I tried to fix! Let me try to put together another fix for this one. |
That deadlock is the same one as #229, going to tweak that one a bit and ship. |
@paulhtrott, I just released version 0.3.3 with a possible fix for this one deadlock: #240. |
Thank you @rosa , we will give that a try today, I'll report back after a couple of days 🎉 |
Hi @rosa I'm back sooner than I wanted to be 😆 . We are still having the issue after 0.3.3. Same two stack traces basically. |
Ohh, bummer 😞 I'll continue working on it. Thanks a lot for letting me know! |
Oh, @paulhtrott, I realised something... The query done now to delete records from Thank you so much again 🙏 |
Hey @rosa! Sure, here is the output |
Hi @rosa! Just want to see if you have had a chance to look into the deadlock issue? Here is a new deadlock file to show that this is still happening on production. Our Setup |
I've noticed that we are receiving Deadlock error occasionally, only in places where jobs were enqueued in the loops smth like this items.each do |item|
#....
ItemProcessingJob.perform_later(item, #...other params)
end In terms of best practices, it is clearly not the best code. |
Recent refactoring of my code with But another kind of Deadlocks still present, and there are quite a lot of them. The Pattern I see in Sentry tracks is always the same
P/S DB adapter does not have a lot of difference, I've tried Stacktrace part
|
For those searching the workaround (which at least reduces Deadlocks for MariaDB in Galera Cluster): Reduce transaction level isolation. |
Ah, good point! If you're running Solid Queue in its own DB, |
We are seeing some failed jobs due to hitting a deadlock when solid queue is trying to create the ScheduledExecution for the job. This is usually happening for us on jobs that are having to be retried due to a throttling constraint we are dealing with from an external api. Here is one example, with part of the backtrace. The job attempts to execute, gets the throttling constraint so it tries to schedule a retry, and it looks like it's trying to do the
ScheduledExecution.create_or_find_by!
on line 40 ofapp/models/solid_queue/job/schedulable.rb
when it hits the deadlock on the insert.Backtrace:
The text was updated successfully, but these errors were encountered: