Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs Stuck in Enqueued State #380

Closed
mwasson74 opened this issue Jan 26, 2024 · 51 comments
Closed

Jobs Stuck in Enqueued State #380

mwasson74 opened this issue Jan 26, 2024 · 51 comments

Comments

@mwasson74
Copy link

First posted on GitHub/HangfireIO (#2355)

I can provide any additional information that you require to help troubleshoot this with me.

Issue

Jobs are staying/stuck in the Enqueued state. The system thinks those jobs are still running and won’t enqueue them again. So in the instance from the screen shot, we have 63 unique recurring jobs that never get enqueued again. The only way I can find to get them running again is to stop the app pool, drop all hangfire.* collections from mongo, and then start the app pool again. (we add the recurring jobs on startup)

image

Solution State

ASP.NET Core .NET 8
Hangfire.AspNetCore Version="1.8.9"
Hangfire.Console Version="1.4.2"
Hangfire.Core Version="1.8.9"
Hangfire.Dashboard.BasicAuthorization Version="1.0.2"
Hangfire.Mongo Version="1.9.16"

The most recent stdump file: stdump_hangfire2.txt

Classes have this attribute applied: SkipWhenPreviousJobIsRunningAttribute.txt

Execute Methods have [DisableConcurrentExecution("{0}", 3)] applied

image

image

@gottscj
Copy link
Owner

gottscj commented Jan 27, 2024

@mwasson74,

Did you try without the SkipWhenPreviousJobIsRunningAttribute? Can you confirm its the attribute that is cancelling the job?

FYI. For job parameters Hangfire is using json serialization. Eg. you do not need the bsonElement attribute in the Route class. If you are only using it as param to the jobs that is.

Thanks

@mwasson74
Copy link
Author

@gottscj

Did you try without the SkipWhenPreviousJobIsRunningAttribute?

Kind of. I added this because sometimes, the jobs take longer to run than the cron schedule. This was then causing another instance of the same job to be enqueued before the first instance even had a chance to run.

Can you confirm its the attribute that is cancelling the job?

Yes, and this is on purpose due to the above statement of why I added this attribute

FYI. For job parameters Hangfire is using json serialization. Eg. you do not need the bsonElement attribute in the Route class. If you are only using it as param to the jobs that is.

We are not only using it as a param; we are also storing this class in a mongo collection.

@mwasson74
Copy link
Author

@gottscj Since I would like to keep this attribute, do you have any other suggestions? Any queries I can run? Any other stack traces I can get you? According to Sergey, "counters and actual contents should be consistent with each other."

Thanks!!

Matt

@gottscj
Copy link
Owner

gottscj commented Jan 30, 2024

@mwasson74,

Maybe test if the attribute is actually setting the state back to "No"? it seems like the issue is when it should run a second time?
Else if you could share the hangfire db with the offending recurring jobs, I could take a look?

Yes, and this is on purpose due to the above statement of why I added this attribute

But is it the attribute which is cancelling the job even though it has previously run successful? Eg. its cancelling the job because the running state is not reset somehow?
Does the attribute override the default behavior? Eg. is the job history/state set correctly when cancelling?
Does this only happen when a job has been cancelled? Eg. the previous job exceeds the time before next run?
Could you try without the attribute and exceed the CRON?

im trying to figure out if this is caused by the attribute or if its a bug in Hangfire.Mongo, or both.
Thanks

@gottscj
Copy link
Owner

gottscj commented Jan 30, 2024

looking at the gist: https://gist.github.com/odinserj/a6ad7ba6686076c9b9b2e03fcf6bf74e where the attribute is taken from, it also seems there is bugs, reading the comments. Are they still valid or are they addressed in your copy?

Just something to check :)

@mwasson74
Copy link
Author

mwasson74 commented Jan 30, 2024

Else if you could share the hangfire db with the offending recurring jobs, I could take a look?

Is there a more private channel in which to share the "BSON - mongodump archive" files with you?

And to provide some more context. In general, everything works just fine until "eventually" it doesn't. Sometimes it's days; sometimes it's merely hours.

@gottscj
Copy link
Owner

gottscj commented Jan 30, 2024

@mwasson74,

unfortunately I don't have any private channels for this purpose. :( if it happens again, you could try to find the specific job, sanitize the data, so there is nothing confidential information. it could be that you just want to obfuscate the parameters given to the recurring jobs?

you could try to set up a test project to try to reproduce the issue outside the scope of your main application,
or try or look into some of the other suggestions I have given? that would enable us to factor out some possibilities.

Thanks

@mwasson74
Copy link
Author

it could be that you just want to obfuscate the parameters given to the recurring jobs

Yes, I think that would suffice. Do you have any suggestions on how I could do that? I am unfamiliar with where these things are stored and how to properly sanitize them.

@gottscj
Copy link
Owner

gottscj commented Jan 31, 2024

@mwasson74,

I only need entries from the ".jobGraph" collection. you could filter it like so:
{$or: [{Key: "recurring-job:your-job-recurring-id"}, {_t:"JobDto", StateName: "Processing"}]}

note, you need to input your recurring job id instead of "your-job-recurring-id"

this will give the entries which holds the state "yes/no" and all jobs currently in processing state

@mwasson74
Copy link
Author

When this happened again the other day, I had exactly what my initial screen shot shows: Enqueued 63/0 with nothing processing. I then exported all hangfire.* collections before getting production up and running again by dropping the collections and restarting the app.

I restored hangfire.jobGraph to my local instance of mongo, sanitized what I could find 🤞 and started writing queries.

When I try running a version of your suggested query, only the first part of the or-statement comes back (76 documents) because there aren't any where StateName is set to "Processing": db.getCollection("hangfire.jobGraph").find({ $or: [{ Key: { $regex: /^recurring-job:/i } }, { _t: "JobDto", StateName: "Processing" }] })

When I remove StateName from the query, I get 19,024 documents: db.getCollection("hangfire.jobGraph").find({ $or: [{ Key: { $regex: /^recurring-job:/i } }, { _t: "JobDto" }] })

I've exported the whole sanitized collection (not just exporting query results) and attached it here. I exported it as a .agz and then changed the extension to .zip so I could upload it.

hangfire.jobGraph.sanitized.zip

I hope you can find something with all of this data!! 🤞

Thanks again!!

@gottscj
Copy link
Owner

gottscj commented Feb 3, 2024

@mwasson74,

thank you for the file. I'm struggling to consume it. Could you guide me how to restore the given file?

Thanks!

@mwasson74
Copy link
Author

@gottscj of course!!

When I exported it, I used Studio3T's free version. I chose their BSON - mongodump archive option. It says this option is created using MongoDB's commandline option mongodump --archive. And then to import it, I chose Studio3T's BSON - mongodump archive option which says uses or can use the commandline mongorestore.

Oh, and don't forget to manually change the extension from .zip to .agz

Studio3T's Export Tooltip:
image

Studio3T's Import Info:
image

I hope this helps!!

@gottscj
Copy link
Owner

gottscj commented Feb 4, 2024

@mwasson74,

Looking at the jobs in the DB, it seems they were all "Processing" when you requeued them from the dashboard.

image

I have found a bug in Hangfire.Mongo where, changing the state (from processing to Enqueued) in UI might will cause the job not to be requeued again, as it has lost its queue.

I'm currently working on a fix.

In your case, its probably more a visual thing as enqueueing a "Processing" job would ultimately cause it to be cancelled anyway due to your filter.

Requeuing a "Processing" job will not cause it restart, it will just spawn a new job which ultimately would run concurrently with the already running job.
There is currently no way of stopping or deleting an already running job. Deleting it, will only set the job state. the actual worker will still run to completion.
It does look like the SqlStorage provided with the Hangfire would suffer the same behavior, I will try to reproduce using the SqlStorage.

For now, I recommend not re-queueing "Processing" jobs using the dashboard. You could delete it and then trigger it again from the recurring jobs page. This would most likely also be what you wanted in the first place? This would however override your SkipWhenPreviousJobIsRunningAttribute attribute

Thanks
Jonas

@mwasson74
Copy link
Author

Hmmm...I know I didn't requeue these and even if one of my support team members did this from the UI, they certainly did not requeue 63 of them. I'm the only one really monitoring this issue at the moment and continuing to "fix it" as it happens. It did happen again on Groundhog's Day (coincidence? 😜) so I have additional data I could sanitize again and get to you if you'd like...?

Anyway, I don't think it's just a visual thing because when this happens, those 63 jobs never get enqueued again because the system "thinks" they're running so they get cancelled. We then get (valid) alerts that certain jobs haven't run lately so I then go and "fix it" again.

@gottscj
Copy link
Owner

gottscj commented Feb 4, 2024

You can requeue multiple jobs at a time. I can say with very high confidence that Neither Hangfire, nor Hangfire.Mongo Triggers jobs via tha Dashboard UI.
image

The jobs will have run to completion. however, they will count as "Enqueued" due to the bug in Hangfire.Mongo. Hangfire should schedule a new instance according to the specified cron given for the recurring job. That will be a new job with a different instance id. The bug is only if you requeue the same job from the dashboard UI.

There have been interaction from the UI in all 63 jobs. And all of them have been "Processing" when they have been re-queued. Some of them have been requeued multiple times from the dashboard like so:
image

as stated, triggering a running job will in your case not be desirable, as you only want one instance to run at a time. so who-ever is requeuing them might be aware of this attribute and the desired behavior?

it is not the SkipWhenPreviousJobIsRunningAttribute which is causing the jobs not to be processed again its the discovered bug in Hangfire.Mongo as the job state in the db is corrupt which keeps the scheduler from running the job.
However this only happens in Hangfire.Mongo if you requeue a job which is "Processing".

Thanks

@mwasson74
Copy link
Author

All right, I will tell the support team not to touch it and to let me know if there's an issue so I can better review the status of the solution before taking any action.

I will post any updates as I have them.

Thanks so much for looking into this with/for me!!

Matt

gottscj added a commit that referenced this issue Feb 6, 2024
- Change strategy of how to determine whether a background job that a worker dequeued is still alive and being processed
- Using same strategy as Hangfire.SqlServer using SlidingInvisibilityTimeout
- Update to Hangfire v1.8.9
gottscj added a commit that referenced this issue Feb 7, 2024
…ing (#380) (#381)

* Fix job not enqueued when requeued while processing #380

- Change strategy of how to determine whether a background job that a worker dequeued is still alive and being processed
- Using same strategy as Hangfire.SqlServer using SlidingInvisibilityTimeout
- Update to Hangfire v1.8.9

* user server time for distributedlock heartbeats

add unit tests

* update version and changelog

* minor visual update

* update comment

* update comment yet again
@gottscj
Copy link
Owner

gottscj commented Feb 7, 2024

@mwasson74,

I have pushed a fix for jobs not being enqueued when requeued while processing, which is the issue you experienced. Bear in mind this will not change your systems behavior as your attribute will cancel the newly enqueued job if its already processing.
Also, consider triggering the job from the recurring jobs page instead of requeuing the same job as you will bloat the job history.

I also changed the strategy for detecting stale jobs to use 'SlidingInvisibilityTimeout', same as the original library where jobs sends heartbeats. This is a breaking change which, if you are using 'InvisibilityTimeout', would require you to do some minor code changes.

Did you talk to your team and get some insights into their use of the dashboard?

Thanks!

@mwasson74
Copy link
Author

@gottscj

Excellent, thank you for getting a fix in place so quickly!!

Regarding the SlidingInvisibilitTimeout, the last you and I spoke about this, here is what you told me: #370 (comment) and based on that, I set mine to:

var so = new MongoStorageOptions
{
  MigrationOptions = mo,
  CheckConnection = hangfireSettings.CheckConnection,
  InvisibilityTimeout = TimeSpan.FromMinutes(30)
};

I would like to implement your fix and get it into production today. What minor code changes will I need to make?

I have not spoken to the support team about this yet, I have been way too busy with other things.

Thanks again!!

Matt

@gottscj
Copy link
Owner

gottscj commented Feb 7, 2024

@mwasson74,

The InvisibilityTimeout, has been replaced with SlidingInvisibilityTimeout, which is set to a default value of 5m. So you can just remove the line

InvisibilityTimeout = TimeSpan.FromMinutes(30)

@mwasson74
Copy link
Author

@gottscj

I just deployed our solution with the new update you pushed and so far things are ok but we are getting the below exception many times through the job running. A quick Google shows me this SO answer and it is true, we are using an older version of Mongo. Is there any way that you could use $addFields instead of $set in the code here (I assume) since it's supposed to work with older versions? MongoDistributedLock.cs#L295

Hangfire:ftusbridge_mugshotftps_euus72_mugshot - Unable to update heartbeat on the resource. Details:
MongoDB.Driver.MongoCommandException: Command aggregate failed: Unrecognized pipeline stage name: '$set'.
   at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.ProcessResponse(ConnectionId connectionId, CommandMessage responseMessage)
   at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.Execute(IConnection connection, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Servers.Server.ServerChannel.ExecuteProtocol[TResult](IWireProtocol`1 protocol, ICoreSession session, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.CommandOperationBase`1.ExecuteProtocol(IChannelHandle channel, ICoreSessionHandle session, ReadPreference readPreference, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.RetryableReadOperationExecutor.Execute[TResult](IRetryableReadOperation`1 operation, RetryableReadContext context, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.ReadCommandOperation`1.Execute(RetryableReadContext context, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.AggregateOperation`1.Execute(RetryableReadContext context, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.AggregateOperation`1.Execute(IReadBinding binding, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.ExecuteReadOperation[TResult](IClientSessionHandle session, IReadOperation`1 operation, ReadPreference readPreference, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.ExecuteReadOperation[TResult](IClientSessionHandle session, IReadOperation`1 operation, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.Aggregate[TResult](IClientSessionHandle session, PipelineDefinition`2 pipeline, AggregateOptions options, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.<>c__DisplayClass22_0`1.<Aggregate>b__0(IClientSessionHandle session)
   at MongoDB.Driver.MongoCollectionImpl`1.UsingImplicitSession[TResult](Func`2 func, CancellationToken cancellationToken)
   at Hangfire.Mongo.DistributedLock.MongoDistributedLock.<StartHeartBeat>b__16_0(Object _)

@gottscj
Copy link
Owner

gottscj commented Feb 7, 2024

@mwasson74,

Yes definitely. I will patch it tonight.

Thanks for the feedback!

@mwasson74
Copy link
Author

@gottscj Excellent, thank you so much!!

Any insight on what this will hurt if I leave it like this in production until you get it patched?

@gottscj
Copy link
Owner

gottscj commented Feb 7, 2024

Best not to. It will potentially mark locks stale and cause corrupt states.
You can override the callback by creating your own lock implementation which inherits the class. You also need to override the mongofactory class and return your own lock implementation. Almost all methods are virtual to allow users to override behaviour.

I'm on mobile. Sorry for the vague description.

Else, I will be able to update it tonight.

@mwasson74
Copy link
Author

Thanks, waiting for the update sounds best from my end 😬 I'll release the previous version of our solution now.

@gottscj
Copy link
Owner

gottscj commented Feb 7, 2024

@mwasson74,

I have released v1.10.1 with the change of operator. However, I did not test it on an "old" mongo db. Could you tell me which version you are using?

Thanks

@mwasson74
Copy link
Author

@gottscj

That was fast, thank you!! I can tell you as long as you don't make fun of me/us 😬 It's 4.0.13

@gottscj
Copy link
Owner

gottscj commented Feb 7, 2024

Got it. thanks. I will spin up a docker and just sanity check it.

@mwasson74
Copy link
Author

@gottscj

Got the patch deployed to production and checked a few jobs and found 1 so far showing the below exception. Do you think that is also due to our old version of mongo? I noticed in the code the use of 2 dollar signs in $$NOW, should it just be 1 dollar sign?

Hangfire:scpweb_orders_efii_euus38 - Unable to update heartbeat on the resource. Details:
MongoDB.Driver.MongoCommandException: Command aggregate failed: Use of undefined variable: NOW.
   at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.ProcessResponse(ConnectionId connectionId, CommandMessage responseMessage)
   at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.Execute(IConnection connection, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Servers.Server.ServerChannel.ExecuteProtocol[TResult](IWireProtocol`1 protocol, ICoreSession session, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.CommandOperationBase`1.ExecuteProtocol(IChannelHandle channel, ICoreSessionHandle session, ReadPreference readPreference, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.RetryableReadOperationExecutor.Execute[TResult](IRetryableReadOperation`1 operation, RetryableReadContext context, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.ReadCommandOperation`1.Execute(RetryableReadContext context, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.AggregateOperation`1.Execute(RetryableReadContext context, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.AggregateOperation`1.Execute(IReadBinding binding, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.ExecuteReadOperation[TResult](IClientSessionHandle session, IReadOperation`1 operation, ReadPreference readPreference, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.ExecuteReadOperation[TResult](IClientSessionHandle session, IReadOperation`1 operation, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.Aggregate[TResult](IClientSessionHandle session, PipelineDefinition`2 pipeline, AggregateOptions options, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.<>c__DisplayClass22_0`1.<Aggregate>b__0(IClientSessionHandle session)
   at MongoDB.Driver.MongoCollectionImpl`1.UsingImplicitSession[TResult](Func`2 func, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.Aggregate[TResult](PipelineDefinition`2 pipeline, AggregateOptions options, CancellationToken cancellationToken)
   at Hangfire.Mongo.DistributedLock.MongoDistributedLock.<StartHeartBeat>b__16_0(Object _)

@gottscj
Copy link
Owner

gottscj commented Feb 8, 2024

@mwasson74,

$$NOW is an aggregation pipeline variable, with double $. There might something wrong with my tests, as I did test this against v4.0.13. :-/

Anyway, I will revert this part as it is a little bit of a hack query-wise. At a later stage I will create a new release with focus on using mongodb server time instead of Datetime.UtcNow

Give me 20m. then I have a v1.10.2 with this part reverted

@mwasson74
Copy link
Author

Man, thank you so much!! 🙏

@gottscj
Copy link
Owner

gottscj commented Feb 8, 2024

its released, it should be indexed in nuget.org shortly.

@mwasson74
Copy link
Author

Things are looking good so far, thanks!! I'll continue to monitor things and let you know if I see anymore red.

Thanks again!!

@gottscj
Copy link
Owner

gottscj commented Feb 8, 2024

Great. Yes, please let me know!

Thanks 👍

@mwasson74
Copy link
Author

@gottscj

Well, it happened again 😢 Would you be willing to browse more data and see if you can figure out what's happening?

The 16 "Enqueued" jobs match the ones getting canceled because (I think) it sees that they're already running even though they're clearly not.

hangfire.jobGraph.sanitized.zip

Dashboard
CanceledRecurringJobs

@gottscj
Copy link
Owner

gottscj commented Feb 9, 2024

Damnit! I thought we had it.

Off course I'll take a look! We need to figure this out! 🙂

Nothing in the log?

Thanks

@mwasson74
Copy link
Author

Me too!!

Thank you!!

In our app's log? I'll see what I can find for those 16 jobs.

@mwasson74
Copy link
Author

@gottscj

I am seeing that around 00:35:00Z it says that the processing job was triggered from the UI. I'm asking our support team right now. ...And they just responded that none of them triggered any Hangfire Jobs last night.

Below is what I have in our logs for jobId 65c57340a23bc4b9d74f35a2:

image

@gottscj
Copy link
Owner

gottscj commented Feb 9, 2024

@mwasson74,

I believe I have found the error.

The DisableConcurrentExecution will create a lock for the job for event "OnPerforming" and release it for event "OnPerformed".
This is what I think is happening:

  1. the job is enqueued by the scheduler.
    • the event "OnPerforming" is triggered and a lock is acquired for the specific recurring job
  2. the job is re-queued in the dashboard.
    • as the job goes from "Processing" -> "Enqueued". The "OnPerforming" event is triggered again and it will try to acquire the lock which is already acquired when the job was enqueued by the job scheduler. It will never get it as "OnPerformed" has not be raised.
  3. When the lock in step 2 times out it will fail the job and the AutomaticRetry attribute will delete it, removing it from the queue. The lock from step 1, is still not released.

if you look in the locks collection there should be unresolved lock entries for the failing recurring jobs.
I have a hard time reproducing the states the jobs are in. Can you elaborate on your how you have configured the MongoOptions?

I would advice you you remove the DisableConcurrentExecution attribute. I believe this in combination with the jobs being requeued while processing creates a deadlock which will cause the locks to time out and fail the jobs.
I believe SkipWhenPreviousJobIsRunning would suffice, unless you trigger these jobs elsewhere from the recurring context?

Recurring jobs will always have one worker.
If your team wants to run the job, its best to trigger it from the "recurring jobs" page

Let me know what you think.

@mwasson74
Copy link
Author

@gottscj

Interesting!! I didn't export the locks collection this time, only the jobGraph since that's the only one we needed last time. I will remove the DisableConcurrentExecution attribute on Monday (need to leave soon). I wish I could find who is triggering these manually. I may have to temporarily change the password to the dashboard so I'm the only one that can login so we can rule that out. They swear they're not triggering them... 🤔

Are these the MongoOptions you're wondering about?
image
image

@gottscj
Copy link
Owner

gottscj commented Feb 11, 2024

I am able to reproduce the same behavior using the original SQL storage, so the DisableConcurrentExecution attribute definitely needs to go.

I did find a bug in the SlidingInvisibilityTimeout feature where, if a running job is requeued again, the heartbeat should not be running on the first, so I will create a new release with a fix for this.

With the SlidingInvisibilityTimeout feature the job will be enqueued automatically if the worker process dies. There should be no need for users to requeue it from the dashboard. If they need to run it outside the configured cron schedule, they should trigger it from the "Recurring Jobs" page, not the "Jobs" page. Requeuing a in process job will work, but you will loose the state history for the first scheduled job. Im sure you already know this, but it could seem your support team is unaware?

image

I Noticed you are also applying the AutomaticRetry attribute twice. This doesn't seem to have any effect though.

I hope this helps. :)

@mwasson74
Copy link
Author

@gottscj

I'm glad you found more things, I'll definitely get rid of that attribute today and also pull down your latest update...thanks!!

Yes, I'm most definitely aware of where to queue a job from 😜 I am considering changing the password to the dashboard so I can absolutely rule out the possibility of anyone requeuing a job incorrectly.

Ah, yes, now that you mention it, I have AutomaticRetry set in a .UseFilter() on the GlobalConfiguration and applied to the jobs themselves, too. I'll leave only 1.

It helps a bunch, thank you again!! I'll report back with any updates.

@gottscj
Copy link
Owner

gottscj commented Feb 12, 2024

@mwasson74,

Thanks! Hopefully this will resolve your issue. 🙂

@mwasson74
Copy link
Author

I've deployed to production so we'll see what happens 🤞

With any of your findings and updates to Hangfire.Mongo, do you think any changes should also be made to Hangfire.SqlServer?

@gottscj
Copy link
Owner

gottscj commented Feb 12, 2024

I dit think so.
Hangfire.Mongo now behaves exactly like Hangfire.SqlStorage.
One thing, which applies to both libraries is that if the job is in fact processing when requeued and not stale, the state of the first job will be lost as the second enqueued job will override the states.

Im not sure if it makes sense to handle this case though as it will only happen if a user requeued it from dashboard even though it was processing.

@mwasson74
Copy link
Author

@gottscj

So far so good!! Although, it ran all weekend without this issue popping up. Since I only pushed your suggestions yesterday, I'll be more convinced if/when we make it to the end of the week without an issue 🤞

@gottscj
Copy link
Owner

gottscj commented Feb 14, 2024

Sounds good! The bug is triggered by requeuing an already processing job, so you can try that if you like. It should be cancelled by your attribute.

Thanks 🙏

@mwasson74
Copy link
Author

@gottscj

Everything's been looking really good this week, thank you!! The root cause was an InvisibilityTimeout bug along with someone on our end triggering the jobs from the "Jobs" page instead of the "Recurring Jobs" page, right?

@gottscj
Copy link
Owner

gottscj commented Feb 15, 2024

Great news!

The root bug was that I did not properly set job states when job instance was requeued while processing. It was triggered by someone on your end requeuing it from the dashboard even though the job was still running. This enqueued the same job again, but as I had not set the required fields, the job was stuck in "Enqueued" state which was also reflected in the dashboard. This in combination with the DisableConcurrentExection filter which caused a deadlock. (only when requeuing an already running job)

Triggering an already running job is allowed by the hangfire dashboard. the user will, however loose job history for the first running instance of the job as the second's state history will be written instead as they both have the same JobId. Triggering from the "Recurring jobs" page will create a new instance (with new JobId). This also happens in Hangfire.SqlStorage.

When a running job is requeued, the job history will look like:

  1. "Enqueued"
  2. "Processing"
  3. "Enqueued" (by dashboard user)
  4. "Processing"
  5. "Succeeded"

The first two state belongs the first job which was started by hangfire by its CRON
the last three belongs the second job which was started by the user. Whatever happens to the first job (succeeds, fails) is lost.

SlidingInvisibilityTimeout would with default values mark a job stale after 5 minutes without heartbeat (Hangfire.Mongo will send heartbeats x5 by given timeout), allowing hangfire to start the job in a different worker. This obsoletes the InvisibilityTimeout which was a fixed value ALL jobs was expected to run to completion within. With SlidingInvisibilityTimeout the job can take as long as needed and will not be marked stale as long as it has a heartbeat.

its a little complex to explain, but I hope it makes sense.

Thanks!

@mwasson74
Copy link
Author

It does make sense, thank you!! I'd say we should close this now, don't you?

@gottscj
Copy link
Owner

gottscj commented Feb 15, 2024

@mwasson74,

Lets close it for now. We will reopen if something else comes up.
Thank you for the generous contribution!

@gottscj gottscj closed this as completed Feb 15, 2024
@mwasson74
Copy link
Author

Thank you for all of your expedient assistance!! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants