Add support for abort notification #1172

onlyann · 2024-08-25T14:46:20Z

Successful PR Checklist:

Tests
- (not applicable?)
Documentation
- (not applicable?)

PR label(s):

PR type: breaking 💥 Contains breaking changes
PR type: feature ⭐️ Contains new features
PR type: bugfix 🕵️ Contains bug fix
PR type: miscellaneous 👾 Contains misc changes
PR type: dependencies 🤖 Contains only dependencies updates
PR type: documentation 📚 Contains documentation updates

Context

So far, the only use of LISTEN/NOTIFY has been to let the worker know that a new job is ready to be processed.

This extends listen_notify to accept different types of notification through using the payload:

job_inserted: the existing notification
abort_job_requested: when a running job is requested to be aborted.

Instead of making explicit calls to the database every time a job calls should_abort, the worker uses the same mechanism than for detecting new jobs, a combination of Postgres notification and polling.

Other changes

The edge case of retrying a job to be aborted is handled at the database level instead of the worker.

Next steps

Once this change lands in v3, it will help with cancellation during the shutdown event and #1084

github-actions · 2024-08-25T14:49:16Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
procrastinate
connector.py
job_context.py
jobs.py
manager.py
psycopg_connector.py
testing.py
utils.py
worker.py					363-364
procrastinate/contrib/aiopg
aiopg_connector.py
Project Total

_{This report was generated by python-coverage-comment-action}

medihack · 2024-08-26T13:38:58Z

Excellent. I will have a look at it tomorrow 🙂. I think it's a good idea to handle more stuff at the database level (as long as we can still test it).

onlyann · 2024-08-26T20:29:52Z

Excellent. I will have a look at it tomorrow 🙂. I think it's a good idea to handle more stuff at the database level (as long as we can still test it).

It also removes some race condition with the previous approach.

Glad you had an acceptance test written against it already or I will have surely missed it.

procrastinate/sql/schema.sql

medihack · 2024-08-27T20:01:41Z

Excellent. I will have a look at it tomorrow 🙂. I think it's a good idea to handle more stuff at the database level (as long as we can still test it).

It also removes some race condition with the previous approach.

Glad you had an acceptance test written against it already or I will have surely missed it.

Cool, let me know when it's ready for review (still a draft currently and some connector excluded from tests/acceptance/test_async.py). But looks pretty good at first glance.

medihack · 2024-08-30T17:05:01Z

docs/howto/advanced/cancellation.md

-database and might flood the database. Ensure you do it only sometimes and
-not from too many parallel tasks.
-:::
+The worker receives a Postgres notification every time a running job is requested to abort, unless `listen_notify=False`.


How about adding something like "Internally, the worker receives ..."? It would make it clearer that this is nothing the user has to handle but something that works behind the curtain.

We should also update the docstrings of listen_notify and polling_interval that this is not just for requesting jobs.

So, should_abort() is now used for async and sync tasks. Correct? Haven't you originally planned to cancel the async tasks automatically on abort? (The thing I was a bit ambivalent about). Just wondering.

So, should_abort() is now used for async and sync tasks. Correct?

Correct

Haven't you originally planned to cancel the async tasks automatically on abort?

Yes. I aim to keep this PR focused on notification support and open another about issuing an asyncio cancel on async tasks.

I amended the documentation.
I also noticed I need to rewrite some of the discussion section around the worker. I will however leave that part for another PR as it is not related to abort notification.

Sure, but maybe we can have updated docstrings for listen_notify and polling_interval.

Currently, it says for polling_interval:

polling_interval : ``float`` Indicates the maximum duration (in seconds) the worker waits between each database job poll. Raising this parameter can lower the rate at which the worker makes queries to the database for requesting jobs. (defaults to 5.0)

How about:

polling_interval : ``float`` Indicates the maximum duration (in seconds) the worker waits between each database job poll. Raising this parameter can lower the rate at which the worker makes queries to the database for requesting jobs or checking for job abortion requests. (defaults to 5.0)

And for listen_notify it says:

listen_notify : ``bool`` If ``True``, the worker will dedicate a connection from the pool to listening to database events, notifying of newly available jobs. If ``False``, the worker will just poll the database periodically (see ``polling_interval``). (defaults to ``True``)

How about:

listen_notify : ``bool`` If ``True``, the worker will dedicate a connection from the pool to listening to database events, notifying of newly available jobs, or job abortion requests. If ``False``, the worker will just poll the database periodically (see ``polling_interval``). (defaults to ``True``)

(while we're at it,

If ``False``, the worker will just poll the database periodically

might be misleading in implying that this doesn't happen with True
Maybe:

If False, the worker won't listen to database events, so it will only
know about new jobs & abortion requests through the polling mechanism

?)

Pushing some update on this from your feedbacks.
Let me know what you think

What does the bullet point Fetching updates for running jobs mean? Does the polling more than checking for new jobs and abortion requests?

procrastinate/sql/schema.sql

procrastinate/worker.py

ewjoachim

Well, that was awesome :) Still plenty of comments, but definitly going the right way. Thank you!

procrastinate/job_context.py

ewjoachim · 2024-08-31T13:34:02Z

procrastinate/job_context.py

-    async def should_abort_async(self) -> bool:
-        assert self.job.id
-        job_id = self.job.id
-        return await self.app.job_manager.get_job_abort_requested_async(job_id)


I think it makes sense that should_abort is able to trigger something, especially if listen_notify is False. Consequently, I think it makes sense if we keep this function, even if it currently only returns static info that doesn't need async. I'd rather the async users keep using the async method rather than have them change to the sync method, to later re-change to the async method if we change something.

(EDIT: well, the polling_interval actually handles the listen_notify=False scenario but... I still think it's worth using a dedicated async func for now. Probably worth a comment too)

What about extending the Job class instead to have abort_requested. This way, the existing list_job operation covers it and there is no need to have distinct operations for every piece of state?

Seems to make sense. I agree!

Ah, I might not have understood your proposal.

I think it's worth keeping a should_abort_async on the context just because we advertised it before because it's really not going to get in the way of maintenance, independently from how we can make the info available on the job. It's literally:

async def should_abort_async(self): # As it stands, `should_abort` doesn't do I/O as of now. return self.should_abort()

I don't see a case where this get in our way.

Given how short lived that function is, how simple it is to migrate to the non async version and the opportunity to have breaking changes as part of v3, it seems reasonable to remove it.

Ok, won't fight you on this.

procrastinate/jobs.py

ewjoachim · 2024-08-31T13:38:54Z

procrastinate/manager.py

@@ -12,6 +12,12 @@
 QUEUEING_LOCK_CONSTRAINT = "procrastinate_jobs_queueing_lock_idx"


+class Notify(Protocol):


You define 2 class Notify(Protocol): (the other one is in connector.py. Is this intended ? It looks like a remnant of refactoring.

If it's intended, they should not be named the same, and potentially be defined next to each other so it's easier to tell the difference (though circular imports will be annoying)

Also, I believe that a more descriptive name than Notify could help understand the code better. OnNotificationCallback ?

They are not exactly the same.
The one in connector has payload: string because the connector doesn't concern itself with how to parse the payload while the manager handles parsing the payload to expose a typed notification.

I'll change the name of the manager one to NotificationCallback.
FYI, Notify is the name used by psycopg

ewjoachim · 2024-08-31T13:49:32Z

procrastinate/sql/schema.sql

-        priority = COALESCE(new_priority, priority),
-        queue_name = COALESCE(new_queue_name, queue_name),
-        lock = COALESCE(new_lock, lock)
+    SET status = CASE


This reads like it would make sense to turn the "CASE" around:

CASE abort_requested WHEN TRUE THEN UPDATE procrastinate_jobs SET ... WHEN FALSE THEN UPDATE procrastinate_jobs SET ... END

I am not following the example.
Where does abort_requested come from when it is outside the UPDATE statement?

Hm you're right. Though maybe doing a quick SELECT to get the abort_requested might be enough. It's not very important but the current code is quite hard to read and I believe will be harder to maintain.

I don't know if this is much better to have something like below. I am happy to change if you feel it is more maintainable, or maybe this could be simplified even further?

CREATE OR REPLACE FUNCTION procrastinate_retry_job( job_id bigint, retry_at timestamp with time zone, new_priority integer, new_queue_name varchar, new_lock varchar ) RETURNS void LANGUAGE plpgsql AS $$ DECLARE _job_id bigint; _abort_requested boolean; BEGIN -- First, check if the job exists and get its abort_requested status SELECT id, abort_requested INTO _job_id, _abort_requested FROM procrastinate_jobs WHERE id = job_id AND status = 'doing'; IF _job_id IS NULL THEN RAISE EXCEPTION 'Job was not found or not in "doing" status (job id: %)', job_id; END IF; -- Update the job based on abort_requested status UPDATE procrastinate_jobs SET (status, attempts, scheduled_at, priority, queue_name, lock) = CASE WHEN NOT _abort_requested THEN ('todo'::procrastinate_job_status, attempts + 1, retry_at, COALESCE(new_priority, priority), COALESCE(new_queue_name, queue_name), COALESCE(new_lock, lock)) ELSE ('failed'::procrastinate_job_status, attempts, scheduled_at, priority, queue_name, lock) END WHERE id = job_id; END; $$;

ewjoachim · 2024-08-31T15:01:37Z

procrastinate/worker.py

+                running_job_ids = {
+                    c.job.id for c in self._running_jobs.values() if c.job.id
+                }
+
+                self._job_ids_to_abort.update(running_job_ids.intersection(job_ids))


I'm very slightly worried that there are 2 independent places where we react to changes in the abort job list. I think I'd feel slightly better if there was a single function (well, coroutine) handling this.

That said, while the work we do on self._job_ids_to_abort is seemingly identical (replace a set vs update a set), it will be different when we actually abort the tasks, because we'll need to identify what jobs should be aborted which we haven't aborted already.

So all in all... Maybe we'll see how to refactor when we add the abortion itself. But I'd prefer if we don't have 2 very related pieces of code that need to do roughly the same thing sitting hundreds of lines appart.

Another solution could be having a queue and a coroutine awaiting on that queue, but that marginally doesn't solve the issue that we're going to do slightly different things on the notification vs the poll.

Trying to put a brain cell on that. It's not blocking for merging this PR.

I refactored it to consolidate the logic.

ewjoachim · 2024-08-31T15:02:44Z

procrastinate/worker.py

-        side_tasks = [asyncio.create_task(self.periodic_deferrer(), name="deferrer")]
-        if self.wait and self.listen_notify:
+        side_tasks = [
+            asyncio.create_task(self.periodic_deferrer(), name="deferrer"),


should periodic_deferrer start with a _ too ? Is there a rationale on what gets a _ vs on on a class that's itself internal ?

Not sure about Python. Coming from a different background, I see _ being used when it is not meant to be used outside the class, whether the class is internal or not.
Changing to _periodic_deferrer but happy to go with whatever convention you prefer here.

ewjoachim · 2024-08-31T15:07:58Z

procrastinate/worker.py

-        if self.wait and self.listen_notify:
+        side_tasks = [
+            asyncio.create_task(self.periodic_deferrer(), name="deferrer"),
+            asyncio.create_task(self._poll_jobs_to_abort(), name="poll_jobs_to_abort"),


Given the 2 polling operations are independent, should the intervals be controlled by the same variable ? (can be handled in a different PR)

I thought about it too. We can expose 2 different settings. Happy for it to be a separate PR.

tests/acceptance/test_async.py

ewjoachim

We're almost there. This should_abort_async question and unless I forgot something, we're good !

ewjoachim

One last suggestion on your last commit, and I'm good.

procrastinate/worker.py

ewjoachim · 2024-09-03T13:16:37Z

coverage is down but it's not your fault. I'll fix it and you'll need to rebase.

ewjoachim · 2024-09-04T08:51:28Z

#1179
PR should be merged quickly. Sorry for the mess (it's all GitHub's fault :D )

onlyann · 2024-09-05T11:33:02Z

#1179 PR should be merged quickly. Sorry for the mess (it's all GitHub's fault :D )

No worries. I can rebase once it lands in v3 branch

ewjoachim · 2024-09-05T23:15:37Z

v3 is up to date :)

onlyann · 2024-09-06T11:15:09Z

Rebased and squashed
All 🟢 🥳

ewjoachim · 2024-09-06T11:57:10Z

And Merged !

github-actions bot added PR type: breaking 💥 Contains breaking changes PR type: feature ⭐️ Contains new features labels Aug 25, 2024

medihack reviewed Aug 27, 2024

View reviewed changes

procrastinate/sql/schema.sql Outdated Show resolved Hide resolved

onlyann marked this pull request as ready for review August 28, 2024 08:23

onlyann requested a review from a team as a code owner August 28, 2024 08:23

medihack reviewed Aug 30, 2024

View reviewed changes

procrastinate/sql/schema.sql Show resolved Hide resolved

medihack reviewed Aug 30, 2024

View reviewed changes

procrastinate/worker.py Outdated Show resolved Hide resolved

onlyann requested a review from medihack August 31, 2024 12:25

ewjoachim reviewed Aug 31, 2024

View reviewed changes

ewjoachim reviewed Sep 1, 2024

View reviewed changes

procrastinate/worker.py Outdated Show resolved Hide resolved

onlyann requested a review from ewjoachim September 3, 2024 11:57

ewjoachim approved these changes Sep 3, 2024

View reviewed changes

onlyann force-pushed the cancel-notification branch 5 times, most recently from a5f7619 to 1511f31 Compare September 6, 2024 11:10

add support for abort notification

2fd31bb

onlyann force-pushed the cancel-notification branch from 1511f31 to 2fd31bb Compare September 6, 2024 11:10

ewjoachim merged commit 5719510 into procrastinate-org:v3 Sep 6, 2024
11 checks passed

onlyann deleted the cancel-notification branch September 6, 2024 11:59

		@@ -12,6 +12,12 @@
		QUEUEING_LOCK_CONSTRAINT = "procrastinate_jobs_queueing_lock_idx"


		class Notify(Protocol):

Add support for abort notification #1172

Add support for abort notification #1172

Conversation

onlyann commented Aug 25, 2024 • edited Loading

Successful PR Checklist:

PR label(s):

Context

Other changes

Next steps

github-actions bot commented Aug 25, 2024 • edited Loading

Coverage report

medihack commented Aug 26, 2024

onlyann commented Aug 26, 2024

medihack commented Aug 27, 2024

Choose a reason for hiding this comment

onlyann Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ewjoachim Aug 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ewjoachim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onlyann Aug 31, 2024 • edited Loading

Choose a reason for hiding this comment

onlyann Aug 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onlyann Sep 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ewjoachim left a comment

Choose a reason for hiding this comment

ewjoachim left a comment

Choose a reason for hiding this comment

ewjoachim commented Sep 3, 2024

ewjoachim commented Sep 4, 2024

onlyann commented Sep 5, 2024 • edited Loading

ewjoachim commented Sep 5, 2024

onlyann commented Sep 6, 2024

ewjoachim commented Sep 6, 2024

onlyann commented Aug 25, 2024 •

edited

Loading

github-actions bot commented Aug 25, 2024 •

edited

Loading

onlyann Aug 30, 2024 •

edited

Loading

ewjoachim Aug 31, 2024 •

edited

Loading

onlyann Aug 31, 2024 •

edited

Loading

onlyann Aug 31, 2024 •

edited

Loading

onlyann Sep 1, 2024 •

edited

Loading

onlyann commented Sep 5, 2024 •

edited

Loading