race condition for batches callback #832

v2kovac · 2023-02-07T20:26:37Z

i have a TestJob

def perform(id, test)
    sleep(10)
    raise 'test error (random)' if rand(10) == 1
    raise 'test error (id = 2)' if id == 2
  end

I run like 20 of these at once in the enqueue and i get almost as many callback notifications from slack 10 seconds later (the callback pings slack). Works for a small amount of jobs too. This is only in production where we've got 2 processes with 25 max_threads each, running aurora with a multiple database setup with mysql. Might be a config issue on our part i need to look into the devops side more.

The text was updated successfully, but these errors were encountered:

bensheldon · 2023-02-07T21:03:37Z

@v2kovac could you share how you're enqueuing them?

Also, I see "Aurora" and "MySQL" which might be the cause. GoodJob is only compatible with Postgres. I'm surprised it works at all, but the Advisory Locks might not function the same.

v2kovac · 2023-02-07T21:59:15Z

Aurora PostgreSQL sorry, also it's an on_finish callback and i guess it's just being called like a ton of times all at the same time. I think this is just a real race condition somehow happening, very consistent btw with the setup here.

i'm still on 3.10.0, enqueuing like so:

batch = GoodJob::Batch.enqueue(properties) do
        arguments.each do |arg_row|
          job_klass.constantize.perform_later(*arg_row)
        end
      end

v2kovac · 2023-02-07T22:01:34Z

Btw i just ran a massive batch job 1 hour long 15k plus jobs at full throttle and the callback only got called once. This might just be because of the sleep here, something weird.

bensheldon · 2023-02-08T01:40:41Z

@v2kovac hmm, I'm at a loss for how that would be happening 🤔

I put together a more aggressive async test for Batches in #833, if that maybe helps you to reproduce it.

v2kovac · 2023-02-13T05:15:17Z

hmm yeah i had it happen again for a real job this time:

 a.callback_active_jobs.map(&:enqueued_at)
=> ["2023-02-12T17:00:20Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:19Z",
 "2023-02-12T17:00:18Z",
 "2023-02-12T17:00:19Z",
 "2023-02-12T17:00:19Z",
...

enqueued the callback 70 times, the only thing i can think of is sometimes i get these errors:

ActiveRecord::StatementInvalid
PG::ConnectionBad: PQconsumeInput() SSL SYSCALL error: Success

we switched off serverless aurora to r6g.larges because of this error, but this multiple callback issue still just happened. It doesn't happen every time though pretty rare so far, but enough to be concerning.

bensheldon · 2023-02-13T05:33:47Z

Hmm. Have you upgraded to 3.10.1 or later? I think this PR might help because it has the reload: #824

v2kovac · 2023-02-13T05:36:37Z

i'll upgrade to latest and stress test again

v2kovac · 2023-02-13T07:24:36Z

looks good! haven't had it happen after a dozen stress tests

v2kovac · 2023-02-24T02:49:38Z

this still looks good now i consider it solved, thanks

bensheldon · 2023-02-24T18:21:29Z

@v2kovac Yay! Thanks for working through this with me 🙌🏻

bensheldon mentioned this issue Feb 8, 2023

Add aggressive test for async Batches #833

Merged

v2kovac closed this as completed Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race condition for batches callback #832

race condition for batches callback #832

v2kovac commented Feb 7, 2023

bensheldon commented Feb 7, 2023

v2kovac commented Feb 7, 2023

v2kovac commented Feb 7, 2023

bensheldon commented Feb 8, 2023

v2kovac commented Feb 13, 2023

bensheldon commented Feb 13, 2023

v2kovac commented Feb 13, 2023

v2kovac commented Feb 13, 2023

v2kovac commented Feb 24, 2023

bensheldon commented Feb 24, 2023

race condition for batches callback #832

race condition for batches callback #832

Comments

v2kovac commented Feb 7, 2023

bensheldon commented Feb 7, 2023

v2kovac commented Feb 7, 2023

v2kovac commented Feb 7, 2023

bensheldon commented Feb 8, 2023

v2kovac commented Feb 13, 2023

bensheldon commented Feb 13, 2023

v2kovac commented Feb 13, 2023

v2kovac commented Feb 13, 2023

v2kovac commented Feb 24, 2023

bensheldon commented Feb 24, 2023