Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition for batches callback #832

Closed
v2kovac opened this issue Feb 7, 2023 · 10 comments
Closed

race condition for batches callback #832

v2kovac opened this issue Feb 7, 2023 · 10 comments

Comments

@v2kovac
Copy link

v2kovac commented Feb 7, 2023

i have a TestJob

def perform(id, test)
    sleep(10)
    raise 'test error (random)' if rand(10) == 1
    raise 'test error (id = 2)' if id == 2
  end

I run like 20 of these at once in the enqueue and i get almost as many callback notifications from slack 10 seconds later (the callback pings slack). Works for a small amount of jobs too. This is only in production where we've got 2 processes with 25 max_threads each, running aurora with a multiple database setup with mysql. Might be a config issue on our part i need to look into the devops side more.

@bensheldon
Copy link
Owner

@v2kovac could you share how you're enqueuing them?

Also, I see "Aurora" and "MySQL" which might be the cause. GoodJob is only compatible with Postgres. I'm surprised it works at all, but the Advisory Locks might not function the same.

@v2kovac
Copy link
Author

v2kovac commented Feb 7, 2023

Aurora PostgreSQL sorry, also it's an on_finish callback and i guess it's just being called like a ton of times all at the same time. I think this is just a real race condition somehow happening, very consistent btw with the setup here.

i'm still on 3.10.0, enqueuing like so:

batch = GoodJob::Batch.enqueue(properties) do
        arguments.each do |arg_row|
          job_klass.constantize.perform_later(*arg_row)
        end
      end

@v2kovac
Copy link
Author

v2kovac commented Feb 7, 2023

Btw i just ran a massive batch job 1 hour long 15k plus jobs at full throttle and the callback only got called once. This might just be because of the sleep here, something weird.

@bensheldon
Copy link
Owner

@v2kovac hmm, I'm at a loss for how that would be happening 🤔

I put together a more aggressive async test for Batches in #833, if that maybe helps you to reproduce it.

@v2kovac
Copy link
Author

v2kovac commented Feb 13, 2023

hmm yeah i had it happen again for a real job this time:

 a.callback_active_jobs.map(&:enqueued_at)
=> ["2023-02-12T17:00:20Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:19Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:19Z",
 "2023-02-12T17:00:19Z",
...

enqueued the callback 70 times, the only thing i can think of is sometimes i get these errors:

ActiveRecord::StatementInvalid
PG::ConnectionBad: PQconsumeInput() SSL SYSCALL error: Success

we switched off serverless aurora to r6g.larges because of this error, but this multiple callback issue still just happened. It doesn't happen every time though pretty rare so far, but enough to be concerning.

@bensheldon
Copy link
Owner

Hmm. Have you upgraded to 3.10.1 or later? I think this PR might help because it has the reload: #824

@v2kovac
Copy link
Author

v2kovac commented Feb 13, 2023

i'll upgrade to latest and stress test again

@v2kovac
Copy link
Author

v2kovac commented Feb 13, 2023

looks good! haven't had it happen after a dozen stress tests

@v2kovac
Copy link
Author

v2kovac commented Feb 24, 2023

this still looks good now i consider it solved, thanks

@v2kovac v2kovac closed this as completed Feb 24, 2023
@bensheldon
Copy link
Owner

@v2kovac Yay! Thanks for working through this with me 🙌🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants