Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use queue to exchange events between scheduler and evaluator #8424

Merged
merged 1 commit into from
Aug 20, 2024

Conversation

xjules
Copy link
Contributor

@xjules xjules commented Aug 8, 2024

Issue
Resolves #7721
Resolves #8485

Approach
This PR removes websockets based communication between scheduler and ensemble evaluator. It is replaced by two message queues, which are provided in LegacyEnsemble.evaluate function by the ensemble evaluator and passed over to scheduler:

  • scheduler_queue: responsible for providing CloudEvent (representing realization and driver events) for evaluator
  • manifest_queue: responsible for providing CloudEvent (representing notification manifest checksum Event) for scheduler

(Screenshot of new behavior in GUI if applicable)

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure tests pass locally (after every commit!)

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@xjules xjules self-assigned this Aug 8, 2024
@xjules xjules changed the title POC: WIP: use queue to send events to ee POC: use queue to exchange events between scheduler and evaluator Aug 9, 2024
@codecov-commenter
Copy link

codecov-commenter commented Aug 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.61%. Comparing base (7f0d9b8) to head (78a12ec).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8424   +/-   ##
=======================================
  Coverage   90.60%   90.61%           
=======================================
  Files         342      342           
  Lines       20904    20849   -55     
=======================================
- Hits        18941    18892   -49     
+ Misses       1963     1957    -6     
Flag Coverage Δ
gui-tests 75.84% <88.88%> (-0.04%) ⬇️
integration-tests 53.95% <92.59%> (-0.08%) ⬇️
unit-tests 68.62% <100.00%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xjules xjules marked this pull request as ready for review August 14, 2024 12:07
@xjules xjules force-pushed the queue_msg branch 4 times, most recently from f922eac to f5e7c68 Compare August 15, 2024 13:57
@xjules xjules changed the title POC: use queue to exchange events between scheduler and evaluator Use queue to exchange events between scheduler and evaluator Aug 15, 2024
@xjules xjules added improvement Something nice to have, that will make life easier for developers or users or both. scheduler labels Aug 15, 2024
@xjules xjules force-pushed the queue_msg branch 7 times, most recently from 1482331 to ca4ea6a Compare August 15, 2024 15:19
Comment on lines 424 to 426
# setup message queues
self._ensemble.scheduler_queue = self._events
self._ensemble.manifest_queue = self._manifest_queue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to the constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it can't as these queues are passed by the evaluator.

Comment on lines 123 to 124
self.scheduler_queue: Any = None
self.manifest_queue: Any = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need two queues, or can we have a general one and specify cloudevent types instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to two queues yes, but can update to use Queue[CloudEvent] instead 👍

event = await self._events.get()
await self._ee_queue.put(event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are only moving the events from one queue to the next one. Can we pass self._ee_queue to the job/driver instead of self._events? Then we could get rid of the _publisher task completely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it is required, since _ee_queue is a consumer a does not need to be set.

@jonathan-eq
Copy link
Contributor

So is the scheduler_queue for communication scheduler->evaluator, and manifest_queue for evaluator -> scheduler?

@xjules
Copy link
Contributor Author

xjules commented Aug 16, 2024

So is the scheduler_queue for communication scheduler->evaluator, and manifest_queue for evaluator -> scheduler?

Yes, it is described in the PR 😸

@@ -1,6 +1,7 @@
import asyncio
import logging
import pickle
import traceback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug leftover

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traceback is now used to log more info in case something crashed in the evaluator tasks.

@@ -311,9 +313,11 @@ async def forward_checksum(self, event: CloudEvent) -> None:
},
{event["run_path"]: event.data},
)
# currently clients still need to receive events via ws
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(what clients?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good one! Currently the only client using the EVTYPE_FORWARD_MODEL_CHECKSUM is in the test:
tests/unit_tests/ensemble_evaluator/test_scheduler.py::test_scheduler_receives_checksum_and_waits_for_disk_sync but both me and Dan don't know how to rewrite this without the websockets since monitor is still based on ws.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the forward_model_runner is still intending to use websocket for the communication of checksum events. Then it sounds odd to use the word "currently"?

Copy link
Contributor Author

@xjules xjules Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, particularly, it has nothing to do with job_dispatch, but it is kept only for the sake of test. As this one only informs the test (described above) that the checksum has been received, since scheduler has now only the manifest_queue to trigger verification of checksum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but not pretty 🤯

@@ -417,6 +421,9 @@ async def _start_running(self) -> None:
]
# now we wait for the server to actually start
await self._server_started.wait()
# setup message queues
self._ensemble.scheduler_queue = self._events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename self._events to something more precise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I've kept the self._events but made Ensemble stateless wrt. message queues. They are sent as attribute in ensemble.evaluate function.

f"Exception in evaluator task {task.get_name()}: {task_exception}\n"
f"Traceback: {exc_traceback}"
)
)
raise task_exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tre traceback is not printed by this latter raise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no. Have been testing it (crashing tasks) and traceback needs to be there.

@@ -164,7 +163,7 @@ async def run(self, sem: asyncio.BoundedSemaphore, max_submit: int = 1) -> None:
break

if self.returncode.result() == 0:
if self._scheduler.wait_for_checksum():
if self._scheduler._manifest_queue:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at first sight, this looks like there can be a bug introduced when the manifest queue is not yet populated, then it will skip it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how the bug can be introduced?

Copy link
Contributor

@berland berland Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misread the code. Should we write if self._scheduler._manifest_queue is not None to be more explicit? (I read it like we were testing whether there is anything in the queue)

@@ -92,6 +85,9 @@ def __init__(
ee_token: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ee_* class variables can be removed I suppose.

Copy link
Contributor Author

@xjules xjules Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but _update_jobs_json needs to be moved to LegacyEnsemble, which I thought could be a different PR?

f"{self._events.qsize()} items left unprocessed in the queue!"
)
if self._ee_queue:
# only join queue if there is a consumer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find "join" to be a strange word for what is happening here, so in a comment I would write something about waiting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do 👍

@xjules xjules force-pushed the queue_msg branch 2 times, most recently from 3c306b3 to f5ff029 Compare August 19, 2024 14:41
@xjules xjules requested a review from berland August 19, 2024 14:58
@@ -417,10 +421,14 @@ async def _start_running(self) -> None:
]
# now we wait for the server to actually start
await self._server_started.wait()
# let's run

# let's run with message queues set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that comment should be redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right 😄 that was more emotional comment than helpful really

@@ -208,79 +188,21 @@ def count_states(self) -> Dict[JobState, int]:
return counts

async def _checksum_consumer(self) -> None:
if not self._ee_uri:
if not self._manifest_queue:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if self._manifest_queue is None


async def _publisher(self) -> None:
if not self._ee_uri:
if not self._ensemble_evaluator_queue:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for None-ness

logger.error(
f"{self._events.qsize()} items left unprocessed in the queue!"
)
if self._ensemble_evaluator_queue:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked and queue, even created but empty will be fine with this if.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for readability

scheduling_tasks = [
asyncio.create_task(self._publisher(), name="publisher_task"),
asyncio.create_task(
self._process_event_queue(), name="process_event_queue_task"
),
asyncio.create_task(self.driver.poll(), name="poll_task"),
asyncio.create_task(self._checksum_consumer(), name="consumer_task"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name="checksum_consumer_task"

It is replaced by two message queues, which are provided in LegacyEnsemble.evaluate function
by the ensemble evaluator and passed over to scheduler:
- ensemble_evaluator_queue: responsible for providing CloudEvent (representing realization and driver events) for evaluator
- manifest_queue: responsible for providing CloudEvent (representing notification manifest checksum Event) for scheduler
Copy link
Contributor

@berland berland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay!

@xjules xjules merged commit e79cd87 into equinor:main Aug 20, 2024
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Something nice to have, that will make life easier for developers or users or both.
Projects
None yet
4 participants