Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS RDS Failover hangs hangfire for 3 days and it cannot start again 10K+ pending jobs #81

Open
sseyalioglu opened this issue Apr 20, 2021 · 4 comments

Comments

@sseyalioglu
Copy link

sseyalioglu commented Apr 20, 2021

Hi,

I recently upgraded my Aurora DB (Mysql) to the latest version and during the process DB rebooted, failed over etc. But one thing I was unable notice was, hangfire process get into partial working state on all servers (4)
It was able to queue jobs but not able to process anything at all.
As a result, I restarted the instances so it to be kicked in.
Here are the problems happened as far as I understand

  1. All servers sent so many requests to DB which made DB get into 100% and not stopping to run the commands with a good amount of queries being in somewhat dead-lock state.
  2. Causing DB to be 100% CPU.
  3. Followed with timeouts and exceptions related as it can be seen here :
    image
  4. I suspect 1 & 2 start the issue in general, causing so many bulk hits to DB
  5. Due to so many process getting stuck in DB, timeouts happen just like in "e"

To handle the case, I made another database so systems will use fresh empty DB and it worked fine. But problem is I want to process locked up 10K+ jobs so I put one of the servers back to old DB connection. But now, this single server gets locked up and stuck just like the screenshot above. Problem is that, I think it uses so much memory that I cannot even get connected to the server via SSH. the only option I have is to reboot the server and once it is up, stop the service so I can modify connectionstring so server does respond and process properly.

Looking for the following

  1. Trying to prevent such condition for anyone in the future.
  2. A way to process my 10+ jobs
  3. A good suggestion on alerts for hangfire so I will know that it is not working.

For #1, DB failover shall not stop processing completely (like defined above, taking new queue but not processing any) see below
image

Again for #1: Once DB is available again, do not bombard with whatever is having to completely lock up the systems and be in not able to process any DB operations like can be seen below:
image

Environment information:

Ubuntu servers, running .net core 3.1 and Hangfire.MySqlStorage 2.0.2 with Hangfire.Core. 1.7.20.
I cannot use latest version of 2.0.3 because it requires MySqlConnector >= 1.0.0 where we use Pomelo that has not released latest to use it.

@sseyalioglu
Copy link
Author

sseyalioglu commented Apr 20, 2021

I started to get the following frequently also after DB upgrade to Aurora 5.7.mysql_aurora.2.09.2

Hangfire.MySql.MySqlDistributedLockException: cannot acquire lock
   at Hangfire.MySql.MySqlDistributedLock.Acquire()
   at Hangfire.MySql.MySqlStorageConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
   at Hangfire.Server.RecurringJobScheduler.UseConnectionDistributedLock[T](JobStorage storage, Func`2 action)
   at Hangfire.Server.RecurringJobScheduler.EnqueueNextRecurringJobs(BackgroundProcessContext context)
   at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context)
   at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state)
   at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)

@sseyalioglu
Copy link
Author

@arnoldasgudas any idea, I am having serious issues. This is also blocking my server too.

@sseyalioglu sseyalioglu reopened this Apr 20, 2021
@sseyalioglu
Copy link
Author

sseyalioglu commented Apr 21, 2021

I think I had something wrong with the tables, one record was unable to delete so I dumped and rebuild the tables with the right data. Now above issues don't happen however, I keep on getting the following:

MySql.Data.MySqlClient.MySqlException (0x80004005): Lock wait timeout exceeded; try restarting transaction
 ---> MySql.Data.MySqlClient.MySqlException (0x80004005): Lock wait timeout exceeded; try restarting transaction
   at MySqlConnector.Core.ResultSet.ReadResultSetHeaderAsync(IOBehavior ioBehavior) in /_/src/MySqlConnector/Core/ResultSet.cs:line 51
   at MySql.Data.MySqlClient.MySqlDataReader.ActivateResultSet() in /_/src/MySqlConnector/MySql.Data.MySqlClient/MySqlDataReader.cs:line 116
   at MySql.Data.MySqlClient.MySqlDataReader.CreateAsync(CommandListPosition commandListPosition, ICommandPayloadCreator payloadCreator, IDictionary`2 cachedProcedures, IMySqlCommand command, CommandBehavior behavior, IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySql.Data.MySqlClient/MySqlDataReader.cs:line 391
   at MySqlConnector.Core.CommandExecutor.ExecuteReaderAsync(IReadOnlyList`1 commands, ICommandPayloadCreator payloadCreator, CommandBehavior behavior, IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/Core/CommandExecutor.cs:line 62
   at MySql.Data.MySqlClient.MySqlCommand.ExecuteNonQueryAsync(IOBehavior ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySql.Data.MySqlClient/MySqlCommand.cs:line 218
   at MySql.Data.MySqlClient.MySqlCommand.ExecuteNonQuery() in /_/src/MySqlConnector/MySql.Data.MySqlClient/MySqlCommand.cs:line 68
   at Dapper.SqlMapper.ExecuteCommand(IDbConnection cnn, CommandDefinition& command, Action`2 paramReader) in /_/Dapper/SqlMapper.cs:line 2822
   at Dapper.SqlMapper.ExecuteImpl(IDbConnection cnn, CommandDefinition& command) in /_/Dapper/SqlMapper.cs:line 572
   at Dapper.SqlMapper.Execute(IDbConnection cnn, String sql, Object param, IDbTransaction transaction, Nullable`1 commandTimeout, Nullable`1 commandType) in /_/Dapper/SqlMapper.cs:line 443
   at Hangfire.MySql.CountersAggregator.<>c__DisplayClass6_0.<Execute>b__0(MySqlConnection connection)
   at Hangfire.MySql.MySqlStorage.<>c__DisplayClass20_0.<UseConnection>b__0(MySqlConnection connection)
   at Hangfire.MySql.MySqlStorage.UseConnection[T](Func`2 func)
   at Hangfire.MySql.MySqlStorage.UseConnection(Action`1 action)
   at Hangfire.MySql.CountersAggregator.Execute(CancellationToken cancellationToken)
   at Hangfire.Server.ServerProcessDispatcherBuilder.ExecuteComponent(Guid executionId, Object state)
   at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)

Any idea?
I strongly suspect, there is something to do with "5.7.mysql_aurora.2.09.2".

There is no pressure on DB, only one hangfire instance, no other so why is this happening?
Hangfire.MySqlStorage team, please respond, I am clueless.

@sseyalioglu
Copy link
Author

Figured the issue completely.
During Aurora upgrade from 1.22 something to 2.09 it changed row format to COMPACT from DYNAMIC.
Changing back to DYNAMIC solved the issue but I have lot's of whys in my mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant