-
-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs Stuck in Enqueued State #380
Comments
Did you try without the SkipWhenPreviousJobIsRunningAttribute? Can you confirm its the attribute that is cancelling the job? FYI. For job parameters Hangfire is using json serialization. Eg. you do not need the bsonElement attribute in the Route class. If you are only using it as param to the jobs that is. Thanks |
Kind of. I added this because sometimes, the jobs take longer to run than the cron schedule. This was then causing another instance of the same job to be enqueued before the first instance even had a chance to run.
Yes, and this is on purpose due to the above statement of why I added this attribute
We are not only using it as a param; we are also storing this class in a mongo collection. |
@gottscj Since I would like to keep this attribute, do you have any other suggestions? Any queries I can run? Any other stack traces I can get you? According to Sergey, "counters and actual contents should be consistent with each other." Thanks!! Matt |
Maybe test if the attribute is actually setting the state back to "No"? it seems like the issue is when it should run a second time?
But is it the attribute which is cancelling the job even though it has previously run successful? Eg. its cancelling the job because the running state is not reset somehow? im trying to figure out if this is caused by the attribute or if its a bug in Hangfire.Mongo, or both. |
looking at the gist: https://gist.github.com/odinserj/a6ad7ba6686076c9b9b2e03fcf6bf74e where the attribute is taken from, it also seems there is bugs, reading the comments. Are they still valid or are they addressed in your copy? Just something to check :) |
Is there a more private channel in which to share the "BSON - mongodump archive" files with you? And to provide some more context. In general, everything works just fine until "eventually" it doesn't. Sometimes it's days; sometimes it's merely hours. |
unfortunately I don't have any private channels for this purpose. :( if it happens again, you could try to find the specific job, sanitize the data, so there is nothing confidential information. it could be that you just want to obfuscate the parameters given to the recurring jobs? you could try to set up a test project to try to reproduce the issue outside the scope of your main application, Thanks |
Yes, I think that would suffice. Do you have any suggestions on how I could do that? I am unfamiliar with where these things are stored and how to properly sanitize them. |
I only need entries from the ".jobGraph" collection. you could filter it like so: note, you need to input your recurring job id instead of "your-job-recurring-id" this will give the entries which holds the state "yes/no" and all jobs currently in processing state |
When this happened again the other day, I had exactly what my initial screen shot shows: Enqueued 63/0 with nothing processing. I then exported all hangfire.* collections before getting production up and running again by dropping the collections and restarting the app. I restored hangfire.jobGraph to my local instance of mongo, sanitized what I could find 🤞 and started writing queries. When I try running a version of your suggested query, only the first part of the or-statement comes back (76 documents) because there aren't any where StateName is set to "Processing": When I remove StateName from the query, I get 19,024 documents: I've exported the whole sanitized collection (not just exporting query results) and attached it here. I exported it as a .agz and then changed the extension to .zip so I could upload it. hangfire.jobGraph.sanitized.zip I hope you can find something with all of this data!! 🤞 Thanks again!! |
thank you for the file. I'm struggling to consume it. Could you guide me how to restore the given file? Thanks! |
@gottscj of course!! When I exported it, I used Studio3T's free version. I chose their Oh, and don't forget to manually change the extension from I hope this helps!! |
Looking at the jobs in the DB, it seems they were all "Processing" when you requeued them from the dashboard. I have found a bug in Hangfire.Mongo where, changing the state (from processing to Enqueued) in UI might will cause the job not to be requeued again, as it has lost its queue. I'm currently working on a fix. In your case, its probably more a visual thing as enqueueing a "Processing" job would ultimately cause it to be cancelled anyway due to your filter. Requeuing a "Processing" job will not cause it restart, it will just spawn a new job which ultimately would run concurrently with the already running job. For now, I recommend not re-queueing "Processing" jobs using the dashboard. You could delete it and then trigger it again from the recurring jobs page. This would most likely also be what you wanted in the first place? This would however override your SkipWhenPreviousJobIsRunningAttribute attribute Thanks |
Hmmm...I know I didn't requeue these and even if one of my support team members did this from the UI, they certainly did not requeue 63 of them. I'm the only one really monitoring this issue at the moment and continuing to "fix it" as it happens. It did happen again on Groundhog's Day (coincidence? 😜) so I have additional data I could sanitize again and get to you if you'd like...? Anyway, I don't think it's just a visual thing because when this happens, those 63 jobs never get enqueued again because the system "thinks" they're running so they get cancelled. We then get (valid) alerts that certain jobs haven't run lately so I then go and "fix it" again. |
All right, I will tell the support team not to touch it and to let me know if there's an issue so I can better review the status of the solution before taking any action. I will post any updates as I have them. Thanks so much for looking into this with/for me!! Matt |
- Change strategy of how to determine whether a background job that a worker dequeued is still alive and being processed - Using same strategy as Hangfire.SqlServer using SlidingInvisibilityTimeout - Update to Hangfire v1.8.9
…ing (#380) (#381) * Fix job not enqueued when requeued while processing #380 - Change strategy of how to determine whether a background job that a worker dequeued is still alive and being processed - Using same strategy as Hangfire.SqlServer using SlidingInvisibilityTimeout - Update to Hangfire v1.8.9 * user server time for distributedlock heartbeats add unit tests * update version and changelog * minor visual update * update comment * update comment yet again
I have pushed a fix for jobs not being enqueued when requeued while processing, which is the issue you experienced. Bear in mind this will not change your systems behavior as your attribute will cancel the newly enqueued job if its already processing. I also changed the strategy for detecting stale jobs to use 'SlidingInvisibilityTimeout', same as the original library where jobs sends heartbeats. This is a breaking change which, if you are using 'InvisibilityTimeout', would require you to do some minor code changes. Did you talk to your team and get some insights into their use of the dashboard? Thanks! |
Excellent, thank you for getting a fix in place so quickly!! Regarding the SlidingInvisibilitTimeout, the last you and I spoke about this, here is what you told me: #370 (comment) and based on that, I set mine to: var so = new MongoStorageOptions
{
MigrationOptions = mo,
CheckConnection = hangfireSettings.CheckConnection,
InvisibilityTimeout = TimeSpan.FromMinutes(30)
}; I would like to implement your fix and get it into production today. What minor code changes will I need to make? I have not spoken to the support team about this yet, I have been way too busy with other things. Thanks again!! Matt |
The InvisibilityTimeout, has been replaced with SlidingInvisibilityTimeout, which is set to a default value of 5m. So you can just remove the line
|
I just deployed our solution with the new update you pushed and so far things are ok but we are getting the below exception many times through the job running. A quick Google shows me this SO answer and it is true, we are using an older version of Mongo. Is there any way that you could use Hangfire:ftusbridge_mugshotftps_euus72_mugshot - Unable to update heartbeat on the resource. Details:
MongoDB.Driver.MongoCommandException: Command aggregate failed: Unrecognized pipeline stage name: '$set'.
at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.ProcessResponse(ConnectionId connectionId, CommandMessage responseMessage)
at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.Execute(IConnection connection, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Servers.Server.ServerChannel.ExecuteProtocol[TResult](IWireProtocol`1 protocol, ICoreSession session, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.CommandOperationBase`1.ExecuteProtocol(IChannelHandle channel, ICoreSessionHandle session, ReadPreference readPreference, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.RetryableReadOperationExecutor.Execute[TResult](IRetryableReadOperation`1 operation, RetryableReadContext context, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.ReadCommandOperation`1.Execute(RetryableReadContext context, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.AggregateOperation`1.Execute(RetryableReadContext context, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.AggregateOperation`1.Execute(IReadBinding binding, CancellationToken cancellationToken)
at MongoDB.Driver.MongoCollectionImpl`1.ExecuteReadOperation[TResult](IClientSessionHandle session, IReadOperation`1 operation, ReadPreference readPreference, CancellationToken cancellationToken)
at MongoDB.Driver.MongoCollectionImpl`1.ExecuteReadOperation[TResult](IClientSessionHandle session, IReadOperation`1 operation, CancellationToken cancellationToken)
at MongoDB.Driver.MongoCollectionImpl`1.Aggregate[TResult](IClientSessionHandle session, PipelineDefinition`2 pipeline, AggregateOptions options, CancellationToken cancellationToken)
at MongoDB.Driver.MongoCollectionImpl`1.<>c__DisplayClass22_0`1.<Aggregate>b__0(IClientSessionHandle session)
at MongoDB.Driver.MongoCollectionImpl`1.UsingImplicitSession[TResult](Func`2 func, CancellationToken cancellationToken)
at Hangfire.Mongo.DistributedLock.MongoDistributedLock.<StartHeartBeat>b__16_0(Object _) |
Yes definitely. I will patch it tonight. Thanks for the feedback! |
@gottscj Excellent, thank you so much!! Any insight on what this will hurt if I leave it like this in production until you get it patched? |
Best not to. It will potentially mark locks stale and cause corrupt states. I'm on mobile. Sorry for the vague description. Else, I will be able to update it tonight. |
Thanks, waiting for the update sounds best from my end 😬 I'll release the previous version of our solution now. |
I have released v1.10.1 with the change of operator. However, I did not test it on an "old" mongo db. Could you tell me which version you are using? Thanks |
That was fast, thank you!! I can tell you as long as you don't make fun of me/us 😬 It's 4.0.13 |
Got it. thanks. I will spin up a docker and just sanity check it. |
Got the patch deployed to production and checked a few jobs and found 1 so far showing the below exception. Do you think that is also due to our old version of mongo? I noticed in the code the use of 2 dollar signs in Hangfire:scpweb_orders_efii_euus38 - Unable to update heartbeat on the resource. Details:
MongoDB.Driver.MongoCommandException: Command aggregate failed: Use of undefined variable: NOW.
at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.ProcessResponse(ConnectionId connectionId, CommandMessage responseMessage)
at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.Execute(IConnection connection, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Servers.Server.ServerChannel.ExecuteProtocol[TResult](IWireProtocol`1 protocol, ICoreSession session, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.CommandOperationBase`1.ExecuteProtocol(IChannelHandle channel, ICoreSessionHandle session, ReadPreference readPreference, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.RetryableReadOperationExecutor.Execute[TResult](IRetryableReadOperation`1 operation, RetryableReadContext context, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.ReadCommandOperation`1.Execute(RetryableReadContext context, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.AggregateOperation`1.Execute(RetryableReadContext context, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.AggregateOperation`1.Execute(IReadBinding binding, CancellationToken cancellationToken)
at MongoDB.Driver.MongoCollectionImpl`1.ExecuteReadOperation[TResult](IClientSessionHandle session, IReadOperation`1 operation, ReadPreference readPreference, CancellationToken cancellationToken)
at MongoDB.Driver.MongoCollectionImpl`1.ExecuteReadOperation[TResult](IClientSessionHandle session, IReadOperation`1 operation, CancellationToken cancellationToken)
at MongoDB.Driver.MongoCollectionImpl`1.Aggregate[TResult](IClientSessionHandle session, PipelineDefinition`2 pipeline, AggregateOptions options, CancellationToken cancellationToken)
at MongoDB.Driver.MongoCollectionImpl`1.<>c__DisplayClass22_0`1.<Aggregate>b__0(IClientSessionHandle session)
at MongoDB.Driver.MongoCollectionImpl`1.UsingImplicitSession[TResult](Func`2 func, CancellationToken cancellationToken)
at MongoDB.Driver.MongoCollectionImpl`1.Aggregate[TResult](PipelineDefinition`2 pipeline, AggregateOptions options, CancellationToken cancellationToken)
at Hangfire.Mongo.DistributedLock.MongoDistributedLock.<StartHeartBeat>b__16_0(Object _) |
$$NOW is an aggregation pipeline variable, with double $. There might something wrong with my tests, as I did test this against v4.0.13. :-/ Anyway, I will revert this part as it is a little bit of a hack query-wise. At a later stage I will create a new release with focus on using mongodb server time instead of Datetime.UtcNow Give me 20m. then I have a v1.10.2 with this part reverted |
Man, thank you so much!! 🙏 |
its released, it should be indexed in nuget.org shortly. |
Things are looking good so far, thanks!! I'll continue to monitor things and let you know if I see anymore red. Thanks again!! |
Great. Yes, please let me know! Thanks 👍 |
Well, it happened again 😢 Would you be willing to browse more data and see if you can figure out what's happening? The 16 "Enqueued" jobs match the ones getting canceled because (I think) it sees that they're already running even though they're clearly not. |
Damnit! I thought we had it. Off course I'll take a look! We need to figure this out! 🙂 Nothing in the log? Thanks |
Me too!! Thank you!! In our app's log? I'll see what I can find for those 16 jobs. |
I am seeing that around 00:35:00Z it says that the processing job was triggered from the UI. I'm asking our support team right now. ...And they just responded that none of them triggered any Hangfire Jobs last night. Below is what I have in our logs for jobId 65c57340a23bc4b9d74f35a2: |
I believe I have found the error. The
if you look in the locks collection there should be unresolved lock entries for the failing recurring jobs. I would advice you you remove the Recurring jobs will always have one worker. Let me know what you think. |
Interesting!! I didn't export the locks collection this time, only the jobGraph since that's the only one we needed last time. I will remove the |
I am able to reproduce the same behavior using the original SQL storage, so the I did find a bug in the With the I Noticed you are also applying the I hope this helps. :) |
I'm glad you found more things, I'll definitely get rid of that attribute today and also pull down your latest update...thanks!! Yes, I'm most definitely aware of where to queue a job from 😜 I am considering changing the password to the dashboard so I can absolutely rule out the possibility of anyone requeuing a job incorrectly. Ah, yes, now that you mention it, I have It helps a bunch, thank you again!! I'll report back with any updates. |
Thanks! Hopefully this will resolve your issue. 🙂 |
I've deployed to production so we'll see what happens 🤞 With any of your findings and updates to Hangfire.Mongo, do you think any changes should also be made to Hangfire.SqlServer? |
I dit think so. Im not sure if it makes sense to handle this case though as it will only happen if a user requeued it from dashboard even though it was processing. |
So far so good!! Although, it ran all weekend without this issue popping up. Since I only pushed your suggestions yesterday, I'll be more convinced if/when we make it to the end of the week without an issue 🤞 |
Sounds good! The bug is triggered by requeuing an already processing job, so you can try that if you like. It should be cancelled by your attribute. Thanks 🙏 |
Everything's been looking really good this week, thank you!! The root cause was an InvisibilityTimeout bug along with someone on our end triggering the jobs from the "Jobs" page instead of the "Recurring Jobs" page, right? |
Great news! The root bug was that I did not properly set job states when job instance was requeued while processing. It was triggered by someone on your end requeuing it from the dashboard even though the job was still running. This enqueued the same job again, but as I had not set the required fields, the job was stuck in "Enqueued" state which was also reflected in the dashboard. This in combination with the DisableConcurrentExection filter which caused a deadlock. (only when requeuing an already running job) Triggering an already running job is allowed by the hangfire dashboard. the user will, however loose job history for the first running instance of the job as the second's state history will be written instead as they both have the same JobId. Triggering from the "Recurring jobs" page will create a new instance (with new JobId). This also happens in Hangfire.SqlStorage. When a running job is requeued, the job history will look like:
The first two state belongs the first job which was started by hangfire by its CRON SlidingInvisibilityTimeout would with default values mark a job stale after 5 minutes without heartbeat (Hangfire.Mongo will send heartbeats x5 by given timeout), allowing hangfire to start the job in a different worker. This obsoletes the InvisibilityTimeout which was a fixed value ALL jobs was expected to run to completion within. With SlidingInvisibilityTimeout the job can take as long as needed and will not be marked stale as long as it has a heartbeat. its a little complex to explain, but I hope it makes sense. Thanks! |
It does make sense, thank you!! I'd say we should close this now, don't you? |
Lets close it for now. We will reopen if something else comes up. |
Thank you for all of your expedient assistance!! 🙏 |
First posted on GitHub/HangfireIO (#2355)
I can provide any additional information that you require to help troubleshoot this with me.
Issue
Jobs are staying/stuck in the Enqueued state. The system thinks those jobs are still running and won’t enqueue them again. So in the instance from the screen shot, we have 63 unique recurring jobs that never get enqueued again. The only way I can find to get them running again is to stop the app pool, drop all hangfire.* collections from mongo, and then start the app pool again. (we add the recurring jobs on startup)
Solution State
ASP.NET Core .NET 8
Hangfire.AspNetCore Version="1.8.9"
Hangfire.Console Version="1.4.2"
Hangfire.Core Version="1.8.9"
Hangfire.Dashboard.BasicAuthorization Version="1.0.2"
Hangfire.Mongo Version="1.9.16"
The most recent stdump file: stdump_hangfire2.txt
Classes have this attribute applied: SkipWhenPreviousJobIsRunningAttribute.txt
Execute Methods have
[DisableConcurrentExecution("{0}", 3)]
appliedThe text was updated successfully, but these errors were encountered: