Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workflow invocation grabbing with db-skipped-lock #10177

Merged
merged 3 commits into from
Sep 22, 2020

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Sep 2, 2020

and db-transaction-isolation.
Closes #8209.

Needs some tests and the grabbing logic should be its own class that can be shared with the job grabber.

@galaxybot galaxybot added this to the 20.09 milestone Sep 2, 2020
@mvdbeek mvdbeek force-pushed the db_skip_locked branch 6 times, most recently from e142050 to 7f95da9 Compare September 2, 2020 18:19
@mvdbeek mvdbeek marked this pull request as ready for review September 2, 2020 18:25
@mvdbeek mvdbeek requested a review from natefoo September 2, 2020 18:25
@natefoo
Copy link
Member

natefoo commented Sep 3, 2020

I had grabbable workflow scheduler assignment working back when I added it for jobs and was told by @jmchilton not to enable it because then workflow invocations would interleave outputs when run in a single history. It was my recollection that you could enable it by explicit configuration in the workflow_schedulers_conf.xml, was that not working?

@mvdbeek
Copy link
Member Author

mvdbeek commented Sep 3, 2020

I don't think it was finally implemented, that's what #8209 is about.

history_local_serial_workflow_scheduling is optional and not on by default, I don't think that should prevent deployers from using db-skip-locked.
I also think history_local_serial_workflow_scheduling probably still works ?
The logic for this is here https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/workflow/scheduling_manager.py#L306

That said, I think the history_local_serial_workflow_scheduling logic may be incompatible with subworkflows, where an intermediate invocation output is required to finish scheduling of an outer step, but I guess I am getting off-topic there.

@natefoo
Copy link
Member

natefoo commented Sep 3, 2020

There is a bunch of effort to override db-skip-locked in a variety of handler assignment scenarios unless the admin explicitly sets it in the workflow schedulers config, it looks like you didn't touch that (although changing incompatible methods might have an effect on that)?

@mvdbeek
Copy link
Member Author

mvdbeek commented Sep 3, 2020

Yeah, I haven't checked if just setting db-skip-locked in the job handlers works ... I guess not.

@natefoo
Copy link
Member

natefoo commented Sep 3, 2020

I guess that override should probably follow the value of history_local_serial_workflow_scheduling? Not sure if I missed that option or it didn't exist at the time.

@mvdbeek
Copy link
Member Author

mvdbeek commented Sep 18, 2020

There is a bunch of effort to override db-skip-locked in a variety of handler assignment scenarios unless the admin explicitly sets it in the workflow schedulers config, it looks like you didn't touch that (although changing incompatible methods might have an effect on that)?

Yeah, I haven't checked if just setting db-skip-locked in the job handlers works ... I guess not.

Works fine, I've added a test case for this.

I guess that override should probably follow the value of history_local_serial_workflow_scheduling? Not sure if I missed that option or it didn't exist at the time.

Not sure I understand this. Do you agree with me that history_local_serial_workflow_scheduling works regardless of the workflow scheduling method ? I'd add a test but it seems a bit tricky to make sure all is happening serially (not impossible though, if you insist). What probably doesn't work is parallelize_workflow_scheduling_within_histories: false (which is the default) ... I am still trying to wrap my head around this, but we could update the grabbing query to filter out invocations within histories that already have another active invocation scheduled by another handler.

But all these concerns also apply to standalone workflow schedulers, so maybe that can be a followup ?

@natefoo
Copy link
Member

natefoo commented Sep 21, 2020

👍

I think this is all ok and that in-history serialization is being addressed via other means?

@mvdbeek
Copy link
Member Author

mvdbeek commented Sep 21, 2020

That's my thinking!

@innovate-invent
Copy link
Contributor

Can this be backported to 20.05?

@dannon
Copy link
Member

dannon commented Sep 22, 2020

@innovate-invent I'll leave that up to Marius, going to go ahead and get this into the dev branch though.

@dannon dannon merged commit 1a4052b into galaxyproject:dev Sep 22, 2020
@mvdbeek
Copy link
Member Author

mvdbeek commented Sep 22, 2020

I don't think we'd want to make these large-ish changes with a couple of different consequences (see the discussion about serial workflow scheduling above) to an existing release. We're hoping to get 20.09 out in 2 weeks though, so this shouldn't be far away from appearing in a stable release.

@innovate-invent
Copy link
Contributor

2 weeks sounds great! Thanks!

@innovate-invent
Copy link
Contributor

This PR does not seem to work. Invocations are not being grabbed.

<handlers assign_with="db-skip-locked"></handlers>
galaxy.jobs DEBUG 2020-10-06 16:53:19,747 Loaded job runner 'galaxy.jobs.runners.kubernetes:KubernetesJobRunner' as 'k8s'
galaxy.jobs.handler DEBUG 2020-10-06 16:53:19,748 Loaded job runners plugins: local:k8s
galaxy.jobs.handler INFO 2020-10-06 16:53:19,753 Handler job grabber initialized with 'db-skip-locked' assignment method for handler 'galaxy-worker-99cd6f84d-drpwg', tag(s): _default_
galaxy.jobs.handler INFO 2020-10-06 16:53:19,757 job handler stop queue started
galaxy.jobs.handler DEBUG 2020-10-06 16:53:19,758 Handler queue starting for jobs assigned to handler: galaxy-worker-99cd6f84d-drpwg
galaxy.web_stack.message DEBUG 2020-10-06 16:53:19,812 Bound default message handler 'JobHandlerMessage.default_handler' to <bound method TaskMessage.default_handler of 
galaxy.jobs.handler INFO 2020-10-06 16:53:19,812 job handler queue started
galaxy.jobs.handler INFO 2020-10-06 16:53:19,812 job handler stop queue started
galaxy.web_stack DEBUG 2020-10-06 16:53:19,886 WorkflowSchedulingManager: No job handler assignment methods were configured but this server is configured to attach to the 'job-handlers' pool, automatically enabling the 'db-skip-locked' assignment method
galaxy.web_stack DEBUG 2020-10-06 16:53:19,887 WorkflowSchedulingManager: Removed 'db-self' from handler assignment methods due to use of mules
galaxy.web_stack DEBUG 2020-10-06 16:53:19,887 WorkflowSchedulingManager: handler assignment methods updated to: db-skip-locked
galaxy.web_stack.handlers INFO 2020-10-06 16:53:19,887 WorkflowSchedulingManager: No job handler assignment method is set, defaulting to 'db-skip-locked', set the `assign_with` attribute on <handlers> to override the default
galaxy.workflow.scheduling_manager INFO 2020-10-06 16:53:19,887 Workflow scheduling handler assignment method(s): db-skip-locked
galaxy.workflow.scheduling_manager INFO 2020-10-06 16:53:19,887 Tag [_default_] handlers: galaxy-worker-99cd6f84d-drpwg
galaxy.workflow.scheduling_manager DEBUG 2020-10-06 16:53:19,887 Starting workflow schedulers
galaxy.queue_worker INFO 2020-10-06 16:53:19,914 Binding and starting galaxy control worker for galaxy-worker-99cd6f84d-drpwg
galaxy.queue_worker INFO 2020-10-06 16:53:19,935 Queuing async task rebuild_toolbox_search_index for galaxy-worker-99cd6f84d-drpwg.
galaxy.app INFO 2020-10-06 16:53:20,069 Galaxy app startup finished (13154.525 ms)
galaxy.web_stack INFO 2020-10-06 16:53:20,070 Galaxy server instance 'galaxy-worker-99cd6f84d-drpwg' is running
galaxy.queue_worker INFO 2020-10-06 16:53:20,080 Instance 'galaxy-worker-99cd6f84d-drpwg' received 'rebuild_toolbox_search_index' task, executing now.
galaxy.queue_worker DEBUG 2020-10-06 16:53:20,081 App is not a webapp, not building a search index
galaxy.web_stack.handlers INFO 2020-10-06 17:08:43,250 [p:14,w:1,m:0] [uWSGIWorker1Core1] (WorkflowInvocation[unflushed]) Handler '_default_' assigned using 'db-skip-locked' assignment method
select * from workflow_invocation order by create_time desc limit 20;
 id  |        create_time         |        update_time         | workflow_id |   state   | scheduler |       handler       |               uuid               | history_id 
-----+----------------------------+----------------------------+-------------+-----------+-----------+---------------------+----------------------------------+------------
 332 | 2020-10-06 17:08:43.278402 | 2020-10-06 17:08:43.278406 |         109 |           |           |                     | 9672f30407f611eb98d5a25ce9e9badb |         97
 331 | 2020-10-06 17:08:43.277707 | 2020-10-06 17:08:43.27771  |         107 |           |           |                     | 967140a407f611eb98d5a25ce9e9badb |         97
 330 | 2020-10-06 17:08:43.276959 | 2020-10-06 17:08:43.276963 |         106 |           |           |                     | 966f5ba407f611eb98d5a25ce9e9badb |         97
 329 | 2020-10-06 17:08:43.275917 | 2020-10-06 17:08:43.275924 |         105 |           |           |                     | 966e482c07f611eb98d5a25ce9e9badb |         97
 328 | 2020-10-06 17:08:43.266406 | 2020-10-06 17:08:43.266412 |         114 | new       | core      | _default_           | 966d183007f611eb98d5a25ce9e9badb |         97

Tried adding --attach-to-pool=workflow-schedulers to no effect.

@pcm32
Copy link
Member

pcm32 commented Feb 19, 2021

Did this work for you in the end @innovate-invent ? I think I use it in a similar way than you do (not specifying handlers in the job conf and just joining them to the pool). I currently have to use a trick with gxadmin to assign those workflows to handlers, but I was hoping to get out of that trick.

@innovate-invent
Copy link
Contributor

innovate-invent commented Feb 20, 2021

My job handlers run with --attach-to-pool=job-handlers and I just use the default workflow scheduler configs otherwise. The job handlers are configured to use db-skip-locked. There was an issue with separating the job handlers and workflow invocation handlers related to the maximum_workflow_jobs_per_scheduling_iteration config. I don't know if that was ever resolved.

Edit: Going back through the issues, it looks like I got this to work with #10371 but never left it enabled for some reason.

@mvdbeek mvdbeek deleted the db_skip_locked branch March 1, 2021 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflows not being scheduled when workflow handlers set to db-skip-locked
6 participants