Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When concurrency is forbidding and dkron exiting before it's done - job will never run again #349

Closed
zacharya19 opened this issue Apr 20, 2018 · 5 comments
Labels
Milestone

Comments

@zacharya19
Copy link

zacharya19 commented Apr 20, 2018

Quick reproduce:

  1. docker-compose up -d.
  2. create job with this json:
{
    "name": "test_job_1",
    "command": "sleep 50",
    "schedule": "@every 10s",
    "concurrency": "forbid"
}
  1. docker-compose restart dkron.

Now you can see dkron thinks the job is running, but it's not and forever we will see scheduler: Skipping execution.

EDIT: it's happening also when restarting dkron service (not only SIGKILL).

@vcastellm vcastellm added this to the v0.10.0 milestone Apr 25, 2018
@vcastellm vcastellm added bug and removed bug labels Apr 25, 2018
@vcastellm vcastellm removed this from the v0.10.0 milestone Apr 25, 2018
@koolay
Copy link

koolay commented Apr 27, 2018

I have the same problem.

This code has a concurrency scene bug, should be with a distributed lock.

func (j *Job) isRunnable() bool {
	status := j.Status()

	if status == Running {
		if j.Concurrency == ConcurrencyAllow {
			return true
		} else if j.Concurrency == ConcurrencyForbid {
			log.WithFields(logrus.Fields{
				"job":         j.Name,
				"concurrency": j.Concurrency,
				"job_status":  status,
			}).Debug("scheduler: Skipping execution")
			return false
		}
	}

	return true
}

@zacharya19
Copy link
Author

I believe each execution should save the process id so dkron would be able to check the status.

@vcastellm vcastellm added the bug label Apr 30, 2018
@vcastellm vcastellm added this to the v0.10.0 milestone Apr 30, 2018
vcastellm pushed a commit that referenced this issue May 3, 2018
As #349 describes, a job with forbidden concurrency doesn't execute again if the target node is restarted.

This PR tries to sove it by implementing a mechanism that asks the running nodes about the job status before checking if the job finished and before running the job.
@vcastellm
Copy link
Member

Fixed in #359

@zacharya19
Copy link
Author

@Victorcoder I think the issue still exists.
I added GetStatus to the log and I see this:
concurrency=forbid get_status_function=failed job=test_job_1 job_status=running node=39d4b2198b72

I think it's because the check for:
if j.Status == StatusNotSet {

@vcastellm
Copy link
Member

Should be fixed in #383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants