AWS RDS Failover hangs hangfire for 3 days and it cannot start again 10K+ pending jobs #81

sseyalioglu · 2021-04-20T16:08:32Z

Hi,

I recently upgraded my Aurora DB (Mysql) to the latest version and during the process DB rebooted, failed over etc. But one thing I was unable notice was, hangfire process get into partial working state on all servers (4)
It was able to queue jobs but not able to process anything at all.
As a result, I restarted the instances so it to be kicked in.
Here are the problems happened as far as I understand

All servers sent so many requests to DB which made DB get into 100% and not stopping to run the commands with a good amount of queries being in somewhat dead-lock state.
Causing DB to be 100% CPU.
Followed with timeouts and exceptions related as it can be seen here :
I suspect 1 & 2 start the issue in general, causing so many bulk hits to DB
Due to so many process getting stuck in DB, timeouts happen just like in "e"

To handle the case, I made another database so systems will use fresh empty DB and it worked fine. But problem is I want to process locked up 10K+ jobs so I put one of the servers back to old DB connection. But now, this single server gets locked up and stuck just like the screenshot above. Problem is that, I think it uses so much memory that I cannot even get connected to the server via SSH. the only option I have is to reboot the server and once it is up, stop the service so I can modify connectionstring so server does respond and process properly.

Looking for the following

Trying to prevent such condition for anyone in the future.
A way to process my 10+ jobs
A good suggestion on alerts for hangfire so I will know that it is not working.

For #1, DB failover shall not stop processing completely (like defined above, taking new queue but not processing any) see below

Again for #1: Once DB is available again, do not bombard with whatever is having to completely lock up the systems and be in not able to process any DB operations like can be seen below:

Environment information:

Ubuntu servers, running .net core 3.1 and Hangfire.MySqlStorage 2.0.2 with Hangfire.Core. 1.7.20.
I cannot use latest version of 2.0.3 because it requires MySqlConnector >= 1.0.0 where we use Pomelo that has not released latest to use it.

sseyalioglu · 2021-04-20T18:38:55Z

I started to get the following frequently also after DB upgrade to Aurora 5.7.mysql_aurora.2.09.2

Hangfire.MySql.MySqlDistributedLockException: cannot acquire lock
   at Hangfire.MySql.MySqlDistributedLock.Acquire()
   at Hangfire.MySql.MySqlStorageConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
   at Hangfire.Server.RecurringJobScheduler.UseConnectionDistributedLock[T](JobStorage storage, Func`2 action)
   at Hangfire.Server.RecurringJobScheduler.EnqueueNextRecurringJobs(BackgroundProcessContext context)
   at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context)
   at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state)
   at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)

sseyalioglu · 2021-04-20T18:44:17Z

@arnoldasgudas any idea, I am having serious issues. This is also blocking my server too.

sseyalioglu · 2021-04-21T10:47:56Z

I think I had something wrong with the tables, one record was unable to delete so I dumped and rebuild the tables with the right data. Now above issues don't happen however, I keep on getting the following:

MySql.Data.MySqlClient.MySqlException (0x80004005): Lock wait timeout exceeded; try restarting transaction
 ---> MySql.Data.MySqlClient.MySqlException (0x80004005): Lock wait timeout exceeded; try restarting transaction
   at MySqlConnector.Core.ResultSet.ReadResultSetHeaderAsync(IOBehavior ioBehavior) in /_/src/MySqlConnector/Core/ResultSet.cs:line 51
   at MySql.Data.MySqlClient.MySqlDataReader.ActivateResultSet() in /_/src/MySqlConnector/MySql.Data.MySqlClient/MySqlDataReader.cs:line 116
   at MySql.Data.MySqlClient.MySqlDataReader.CreateAsync(CommandListPosition commandListPosition, ICommandPayloadCreator payloadCreator, IDictionary`2 cachedProcedures, IMySqlCommand command, CommandBehavior behavior, IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySql.Data.MySqlClient/MySqlDataReader.cs:line 391
   at MySqlConnector.Core.CommandExecutor.ExecuteReaderAsync(IReadOnlyList`1 commands, ICommandPayloadCreator payloadCreator, CommandBehavior behavior, IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/Core/CommandExecutor.cs:line 62
   at MySql.Data.MySqlClient.MySqlCommand.ExecuteNonQueryAsync(IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySql.Data.MySqlClient/MySqlCommand.cs:line 218
   at MySql.Data.MySqlClient.MySqlCommand.ExecuteNonQuery() in /_/src/MySqlConnector/MySql.Data.MySqlClient/MySqlCommand.cs:line 68
   at Dapper.SqlMapper.ExecuteCommand(IDbConnection cnn, CommandDefinition& command, Action`2 paramReader) in /_/Dapper/SqlMapper.cs:line 2822
   at Dapper.SqlMapper.ExecuteImpl(IDbConnection cnn, CommandDefinition& command) in /_/Dapper/SqlMapper.cs:line 572
   at Dapper.SqlMapper.Execute(IDbConnection cnn, String sql, Object param, IDbTransaction transaction, Nullable`1 commandTimeout, Nullable`1 commandType) in /_/Dapper/SqlMapper.cs:line 443
   at Hangfire.MySql.CountersAggregator.<>c__DisplayClass6_0.<Execute>b__0(MySqlConnection connection)
   at Hangfire.MySql.MySqlStorage.<>c__DisplayClass20_0.<UseConnection>b__0(MySqlConnection connection)
   at Hangfire.MySql.MySqlStorage.UseConnection[T](Func`2 func)
   at Hangfire.MySql.MySqlStorage.UseConnection(Action`1 action)
   at Hangfire.MySql.CountersAggregator.Execute(CancellationToken cancellationToken)
   at Hangfire.Server.ServerProcessDispatcherBuilder.ExecuteComponent(Guid executionId, Object state)
   at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)

Any idea?
I strongly suspect, there is something to do with "5.7.mysql_aurora.2.09.2".

There is no pressure on DB, only one hangfire instance, no other so why is this happening?
Hangfire.MySqlStorage team, please respond, I am clueless.

sseyalioglu · 2021-04-21T11:55:46Z

Figured the issue completely.
During Aurora upgrade from 1.22 something to 2.09 it changed row format to COMPACT from DYNAMIC.
Changing back to DYNAMIC solved the issue but I have lot's of whys in my mind.

sseyalioglu mentioned this issue Apr 20, 2021

AWS RDS Failover hangs hangfire for 3 days and it cannot start again 10K+ pending jobs HangfireIO/Hangfire#1849

Closed

sseyalioglu closed this as completed Apr 20, 2021

sseyalioglu reopened this Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS RDS Failover hangs hangfire for 3 days and it cannot start again 10K+ pending jobs #81

AWS RDS Failover hangs hangfire for 3 days and it cannot start again 10K+ pending jobs #81

sseyalioglu commented Apr 20, 2021 •

edited

Loading

sseyalioglu commented Apr 20, 2021 •

edited

Loading

sseyalioglu commented Apr 20, 2021

sseyalioglu commented Apr 21, 2021 •

edited

Loading

sseyalioglu commented Apr 21, 2021

AWS RDS Failover hangs hangfire for 3 days and it cannot start again 10K+ pending jobs #81

AWS RDS Failover hangs hangfire for 3 days and it cannot start again 10K+ pending jobs #81

Comments

sseyalioglu commented Apr 20, 2021 • edited Loading

sseyalioglu commented Apr 20, 2021 • edited Loading

sseyalioglu commented Apr 20, 2021

sseyalioglu commented Apr 21, 2021 • edited Loading

sseyalioglu commented Apr 21, 2021

sseyalioglu commented Apr 20, 2021 •

edited

Loading

sseyalioglu commented Apr 20, 2021 •

edited

Loading

sseyalioglu commented Apr 21, 2021 •

edited

Loading