Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows not being scheduled when workflow handlers set to db-skip-locked #8209

Closed
afgane opened this issue Jun 20, 2019 · 19 comments · Fixed by #10177
Closed

Workflows not being scheduled when workflow handlers set to db-skip-locked #8209

afgane opened this issue Jun 20, 2019 · 19 comments · Fixed by #10177

Comments

@afgane
Copy link
Contributor

afgane commented Jun 20, 2019

Running web (calling uwsgi directly) and job handlers (using scripts/galaxy-main) separately and having the following config/workflow_schedulers.xml (or not having that file at all), leads to workflow invocations never being scheduled (they remain in new state):

<?xml version="1.0"?>
    <workflow_schedulers default="core">
    <core id="core" />
    <handlers assign_with="db-skip-locked" />
</workflow_schedulers>

Changing the handlers assignment method as follows triggers job scheduling.

   <handlers assign_with="db-self" />

Pouring through the logs with @natefoo, everything looks like is should, including the database values in the workflow_invocations table (which is _default_) the missing link is somewhere deeper.

@hexylena
Copy link
Member

Ok, so, same issue I saw @natefoo. Cool, glad it's a bug and not just our weird setup.

@hexylena
Copy link
Member

@afgane for an interim solution I just have the following bash script running which makes things work well enough.

#!/bin/bash
while true; do
        psql -c "update workflow_invocation set handler = 'handler_main_' || (random() * 10)::integer where state = 'new' and handler = '_default_';" | grep -v 'UPDATE 0'
        sleep 1;
done

@hexylena
Copy link
Member

I've added this to gxadmin

@bgruening
Copy link
Member

bump this issue again. It seems to be a severe bug or we need to pull this option from the documentation.

@pcm32
Copy link
Member

pcm32 commented Nov 3, 2019

Is it safe to use db-self for workflows while using db-skip-lock for normal job handlers in a multi master webless setup with dynamic handlers? I came accross this issue on current tip of release_19.05. Thanks

@pcm32
Copy link
Member

pcm32 commented Nov 3, 2019

apparently the same happens when using db-transaction-isolation instead of db-skip-lock :-(.

@pcm32
Copy link
Member

pcm32 commented Nov 3, 2019

Ok, I'm using the gxadmin call. However, I wonder, if this is being called from multiple hosts at the same time (because handler prefixes are host dependent) so that the workflows are balanced to handlers in different host, is this transactionally safe from the database point of view? Thanks!

@natefoo
Copy link
Member

natefoo commented Nov 5, 2019

I thought I did a better job documenting the deal with workflow schedulers and assignment methods but the only thing I see is what's in the sample config.

db-skip-locked and db-transaction-isolation are both supposed to work but are discouraged because they can't guarantee serial workflow execution in a single history. Either using mules or db-preassign with a statically configured <handlers> solution are preferred for that reason. If you can run a single static workflow scheduler with --server-name=whatever and <handlers><handler id="whatever"/></handlers> in your workflow schedulers config, that should solve the issue.

That said, this bug ought to be addressed, and I'll try to find the time this week to look at it.

@scholtalbers
Copy link
Contributor

I walked into this trap when doing the update to 20.01 and following the documentation ☹️

The preferred method depends on your deployment strategy:

    uWSGI + Mules - uWSGI Mule Messaging is preferred.
    uWSGI + Webless - Either Database SKIP LOCKED or Database Transaction Isolation is preferred.
    uWSGI + Hybrid - Either Database SKIP LOCKED or Database Transaction Isolation is preferred. If your mule and webless handlers are in non-overlapping pools (i.e. tags, or untagged), you can alternatively use both uWSGI Mule Messaging followed by either Database SKIP LOCKED or Database Transaction Isolation. If pools overlap, using uWSGI Mule Messaging would prevent any non-mule handlers in that pool from being assigned jobs.

@hexylena
Copy link
Member

hexylena commented May 4, 2020

@natefoo

So then with a job conf like this:

        <handlers assign_with="db-skip-locked" max_grab="8">
                <handler id="handler_main_0"/>
                <handler id="handler_main_1"/>
                <handler id="handler_main_2"/>
                <handler id="handler_main_3"/>
                <handler id="handler_main_4"/>
                <handler id="handler_main_5"/>
                <handler id="handler_main_6"/>
                <handler id="handler_main_7"/>
        </handlers>

this is wrong? There should only be a single workflow scheduler? Then it works or?

<?xml version="1.0"?>
    <workflow_schedulers default="core">
    <core id="core" />
    <handlers default="schedulers">
        <handler id="workflow_scheduler_main_0" tags="schedulers"/>
        <handler id="workflow_scheduler_main_1" tags="schedulers"/>
    </handlers>
</workflow_schedulers>

still an issue for EU

@bgruening
Copy link
Member

@natefoo do you have any ideas here?

Is the following a valid and recommended config?

<?xml version="1.0"?>

<workflow_schedulers default="core">
    <core id="core" />
    <handlers assign_with="db-self" default="schedulers">
        <handler id="workflow_scheduler_main_0" tags="schedulers"/>
        <handler id="workflow_scheduler_main_1" tags="schedulers"/>
    </handlers>
</workflow_schedulers>

@natefoo
Copy link
Member

natefoo commented May 26, 2020

Use assign_with="db-preassign" rather than db-self. You can use multiple workflow schedulers (.org does).

@hexylena we figured out in Barcelona what the issue was but I am not sure if we recorded that revelation - do you recall? Is the issue that a db-skip-locked job conf without a workflow scheduler conf is broken?

@natefoo
Copy link
Member

natefoo commented May 26, 2020

Here is .org's workflow scheduler conf, job conf handlers section (individual handlers are only defined here for plugin loading restrictions), and the workflow scheduler and handler supervisor configs.

@hexylena
Copy link
Member

Oh gosh, that revelation is lost to the 11 weeks of quarantine I've been in since barcelona, sorry @natefoo.

So your workflow schedulers, matches ours., i.e. we have not specified db-self. But it works for you? likewise we're db-skip-lock in our job conf handlers section. So our configuration matches yours currently. Do we need to set it explicitly, like @bgruening did in usegalaxy-eu/infrastructure-playbook#187 ? are the workflow handlers detecting that job handksr are db-skip-locked and choosing to do the same? Which wouldn't make sense, given your configuration.

@natefoo
Copy link
Member

natefoo commented May 27, 2020

The code should default it to db-preassign. It doesn't hurt to be explicit, but I only saw that my workflow scheduler assignment method wasn't set after I made that suggestion, which I made because .eu's was set to db-self.

One thing I mentioned to Björn on Gitter yesterday - the web workers (uwsgi) must have the same workflow schedulers conf as the workflow schedulers and job handlers. Just as they do with jobs, the web workers create the invocation and set the handler column according to the assignment method and handler definitions, which it can only do properly if it has the workflow scheduler config.

@innovate-invent
Copy link
Contributor

innovate-invent commented Sep 2, 2020

I am not sure I understand how everything is working but I get the following exception:

galaxy.workflow.run_request INFO 2020-09-02 06:25:38,301 [p:9,w:1,m:0] [uWSGIWorker1Core1] Creating a step_state for step.id 953
galaxy.workflow.run_request INFO 2020-09-02 06:25:38,302 [p:9,w:1,m:0] [uWSGIWorker1Core1] Creating a step_state for step.id 954
galaxy.workflow.run_request INFO 2020-09-02 06:25:38,302 [p:9,w:1,m:0] [uWSGIWorker1Core1] Creating a step_state for step.id 955
galaxy.web_stack.handlers ERROR 2020-09-02 06:25:38,302 [p:9,w:1,m:0] [uWSGIWorker1Core1] Caught exception in handler assignment method: db-preassign
Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/web_stack/handlers.py", line 447, in assign_handler
    handler = self._handler_assignment_method_methods[method](
  File "/srv/galaxy/lib/galaxy/web_stack/handlers.py", line 370, in _assign_db_preassign_handler
    handler_id = self._get_single_item(self.handlers[handler], index=index)
KeyError: '_default_'
galaxy.web_stack.handlers ERROR 2020-09-02 06:25:38,303 [p:9,w:1,m:0] [uWSGIWorker1Core1] (WorkflowInvocation[unflushed]) Failed to select handler
<?xml version="1.0"?>
<workflow_schedulers default="core">
  <core id="core" />
  <handlers assign_with="db-preassign" />
</workflow_schedulers>

job_conf.xml:

...
<handlers assign_with="db-skip-locked" />
...

I have a uwsgi + webless setup.

@bgruening
Copy link
Member

Use only db-preassign like:

<workflow_schedulers default="core">
    <core id="core" />
    <handlers assign_with="db-preassign" default="schedulers">
        <handler id="workflow_scheduler_main_0" tags="schedulers"/>
        <handler id="workflow_scheduler_main_1" tags="schedulers"/>
        <handler id="workflow_scheduler_main_2" tags="schedulers"/>
        <handler id="workflow_scheduler_main_3" tags="schedulers"/>
    </handlers>
</workflow_schedulers>

@innovate-invent
Copy link
Contributor

innovate-invent commented Sep 2, 2020

How do I get this to work without listing the handlers? My handlers autoscale and I can't explicitly declare them.
Do workflow schedulers also handle sending the individual jobs to their destinations? or do they simply manipulate the database to populate the workflow invocation?
This is a significant issue if I cant scale the workers.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 2, 2020

I don't think that's a scenario we support (#8209 (comment)). The only mode(s) that would work without knowing the available workers are db-skip-locked and db-transaction-isolation (and uwsgi-mule-messaging if running on the same host, in theory, but with autoscaling I guess that won't help you). In db-skip-locked and db-transaction-isolation mode handlers poll for new invocations. But as you discovered that doesn't work for workflow schedulers at this moment. I know @natefoo explained why that is earlier this year, but I have forgotten again.

If you can run a single static workflow scheduler with --server-name=whatever and in your workflow schedulers config, that should solve the issue.

That would be one solution that should allow you to scale workflow handlers between 0 and 1.

mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Sep 2, 2020
and db-transaction-isolation.
Closes galaxyproject#8209.

Needs some tests and the grabbing logic should be its own class.
mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Sep 2, 2020
and db-transaction-isolation.
Closes galaxyproject#8209.

Needs some tests and the grabbing logic should be its own class.
mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Sep 2, 2020
and db-transaction-isolation.
Closes galaxyproject#8209.

Needs some tests and the grabbing logic should be its own class.
mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Sep 2, 2020
mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Sep 2, 2020
mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Sep 2, 2020
mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants