Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many mutex fails after #49000a30fd34ac21cabdc7d #57

Closed
Saicheg opened this issue Jul 16, 2018 · 9 comments
Closed

Many mutex fails after #49000a30fd34ac21cabdc7d #57

Saicheg opened this issue Jul 16, 2018 · 9 comments

Comments

@Saicheg
Copy link
Contributor

Saicheg commented Jul 16, 2018

Hey @pokonski !

We start seeing many mutex-failed issues after you commit #49000a30fd34ac21cabdc7d

Problem is that when you have 1000s of really fast-jobs they all start checking if final job should be
enqueued and conflicting too much.

I am thinking that you are actually wrapping wrong part with mutex, client.enqueue_job(workflow_id, out) feels like more right one

def enqueue_outgoing_jobs
  job.outgoing.each do |job_name|
    out = client.find_job(workflow_id, job_name)

     if out.ready_to_start?
       RedisMutex.with_lock("gush_enqueue_outgoing_jobs_#{workflow_id}-#{job_name}", sleep: 0.3, block: 2) do
          client.enqueue_job(workflow_id, out)
        end
      end
  end
end

What do you think? @pokonski

@Saicheg Saicheg changed the title Many mutex-fails after #49000a30fd34ac21cabdc7d Many mutex fails after #49000a30fd34ac21cabdc7d Jul 16, 2018
@pokonski
Copy link
Contributor

pokonski commented Jul 16, 2018

Hey @Saicheg!

I'm not sure though if this will prevent two quick jobs from duplicating, because if they are checking without blocking they might have the outdated info and still en-queue the same child twice. What do you think?

@Saicheg
Copy link
Contributor Author

Saicheg commented Jul 16, 2018

@pokonski i think we actually should increase blocking timeout ( up to 5-10 seconds ) and instead of waiting for lock - we should fail all other jobs who are trying to enter that piece.

So even in case they will pass, only 1 job will enqueue final one and other will fail.

@Saicheg
Copy link
Contributor Author

Saicheg commented Jul 18, 2018

@pokonski do you have any chance to think through this?

@pokonski
Copy link
Contributor

Yeah quietly failing makes sense, since the job will be enqueued anyway.

@schorsch
Copy link

schorsch commented Nov 6, 2018

I also got those mutex errors with 1200 jobs in the queue. I tried the proposed fix but when the last workers ran, the job to be run :after the batch got enqueued x-times, once for each running worker.

@treyperkins
Copy link

treyperkins commented Jun 26, 2019

I was able to fix this by setting block to 0 and rescuing it. I found that since the jobs that couldn't get the mutex lock don't enqueue anything, the one that did acquire the lock ran and enqueued or skipped as appropriate.

However I did notice that in my case this only occurs on the first batch of jobs run on a Sidekiq server and only if they finish too closely to each other. Subsequent jobs on the same server do not seem to have this lock conflic issue.

@anirbanmu
Copy link
Contributor

Was a resolution reached here on what the proper fix is?

suonlight added a commit to suonlight/gush that referenced this issue Oct 10, 2019
# Summary

This commit is trying to fix issue RedisMutex::LockError on
chaps-io#57.
Whenever we have the error, it simply enqueues another worker.
At the begin of the worker, if the job is succedeed, we simply call
`enqueue_outgoing_jobs` again.
@suonlight
Copy link
Contributor

Can anyone help me to review my PR?

pokonski pushed a commit that referenced this issue Oct 10, 2019
* Try to enqueue outgoing jobs in another worker

# Summary

This commit is trying to fix issue RedisMutex::LockError on
#57.
Whenever we have the error, it simply enqueues another worker.
At the begin of the worker, if the job is succedeed, we simply call
`enqueue_outgoing_jobs` again.

* Delay next job's execution to avoid hammering the lock
@pokonski
Copy link
Contributor

PR was merged back then so I am closing it this issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants