Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate Experiment does not work as intended #7700

Closed
andreas-el opened this issue Apr 18, 2024 · 6 comments
Closed

Terminate Experiment does not work as intended #7700

andreas-el opened this issue Apr 18, 2024 · 6 comments
Assignees
Labels

Comments

@andreas-el
Copy link
Contributor

andreas-el commented Apr 18, 2024

Running Poly-ert ES_MDA and clicking Terminate Experiment seems to do little.
The cursor changes to busy-state, but the realizations will all complete in the background.

I clicked terminate once I saw the first realization complete, so there should be plenty of time to stop other jobs.

This was run using bleeding, on Mac with python 3.11 with local queue.


Testing this on RGS yielded:

Exception ignored in: <coroutine object Scheduler._process_event_queue at 0x7fb79b7c99c0>
Traceback (most recent call last):
  File "/prog/res/komodo/bleeding-py38-rhel7/root/lib64/python3.8/site-packages/ert/scheduler/scheduler.py", line 302, in _process_event_queue
    event = await self.driver.event_queue.get()
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/queues.py", line 165, in get
    getter.cancel()  # Just in case getter is not done yet.
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/base_events.py", line 719, in call_soon
    self._check_closed()
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/base_events.py", line 508, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

Seems like something happens, but some realizations keep going regardless, see attachments.

Screenshot 2024-04-18 at 14 41 15

Screenshot 2024-04-18 at 14 41 42

@andreas-el andreas-el added the bug label Apr 18, 2024
@andreas-el andreas-el added this to SCOUT Apr 18, 2024
@andreas-el andreas-el changed the title Terminate Experiment on local queue seems to do little Terminate Experiment does not work as intended Apr 18, 2024
@berland berland moved this to Todo in SCOUT Apr 18, 2024
@berland
Copy link
Contributor

berland commented Apr 19, 2024

I was able to reproduce this (need 100 realizations and QUEUE_OPTION MAX_RUNNING at 50). It might be related to #7704

For this poly-case it is probably difficult to be able to cancel the experiment in due time, as it will evaluate very fast. I got a seemingly hanging GUI after initiating termination, but the GUI eventually catches up, and then all four iterations are through.

@berland berland self-assigned this Apr 19, 2024
@berland berland moved this from Todo to In Progress in SCOUT Apr 19, 2024
@berland
Copy link
Contributor

berland commented Apr 19, 2024

Even including the fix of #7704 there is an issue where the next iteration is started even though the cancellation/termination has happened:

$ grep "requested" logs/*
(38) [havb@be-linrgsn001:~/projects/ert/test-data/poly_example] fix_local_driver_timed_kill$ grep -e "requested term" -e "requesting term" logs/*
logs/ee-log-poly-ert-2024-04-19T1302.txt:2024-04-19 13:03:46,504 - ert.ensemble_evaluator.tracker - MainThread - DEBUG - requesting termination...
logs/ee-log-poly-ert-2024-04-19T1302.txt:2024-04-19 13:03:46,614 - ert.ensemble_evaluator.tracker - MainThread - DEBUG - requested termination

Then in logs/jobqueue* we can find entries around the time of termination:

2024-04-19 13:03:47,955 - ert.scheduler.local_driver - LegacyEnsemble - INFO - Killing realization 94
2024-04-19 13:03:47,956 - ert.scheduler.local_driver - LegacyEnsemble - INFO - Killing realization 95
2024-04-19 13:03:47,957 - ert.scheduler.local_driver - LegacyEnsemble - INFO - Killing realization 96
2024-04-19 13:03:47,957 - ert.scheduler.local_driver - LegacyEnsemble - INFO - Killing realization 97
2024-04-19 13:03:48,139 - ert.scheduler.local_driver - LegacyEnsemble - INFO - All realization tasks finished
2024-04-19 13:03:48,140 - ert.scheduler.scheduler - LegacyEnsemble - DEBUG - scheduler cancelled, stopping jobs...
2024-04-19 13:03:48,140 - ert.scheduler - LegacyEnsemble - INFO - Experiment ran on QUEUESYSTEM: LOCAL
2024-04-19 13:03:53,432 - ert.scheduler - LegacyEnsemble - INFO - Experiment ran on ORCHESTRATOR: scheduler
2024-04-19 13:03:54,053 - ert.scheduler.local_driver - LegacyEnsemble - DEBUG - Submitting realization 0 as command '/private/havb/venv/38/bin/job_dispatch.py /private/havb/projects/ert/test-data/poly_example/poly_out/realization-0/iter-1'
2024-04-19 13:03:54,100 - ert.scheduler.local_driver - LegacyEnsemble - DEBUG - Submitting realization 1 as command '/private/havb/venv/38/bin/job_dispatch.py /private/havb/projects/ert/test-data/poly_example/poly_out/realization-1/iter-1'
2024-04-19 13:03:54,135 - ert.scheduler.local_driver - LegacyEnsemble - DEBUG - Submitting realization 2 as command '/private/havb/venv/38/bin/job_dispatch.py /private/havb/projects/ert/test-data/poly_example/poly_out/realization-2/iter-1'
2024-04-19 13:03:54,171 - ert.scheduler.local_driver - LegacyEnsemble - DEBUG - Submitting realization 3 as command '/private/havb/venv/38/bin/job_dispatch.py /private/havb/projects/ert/test-data/poly_example/poly_out/realization-3/iter-1'
2024-04-19 13:03:54,203 - ert.scheduler.local_driver - LegacyEnsemble - DEBUG - Submitting realization 4 as command '/private/havb/venv/38/bin/job_dispatch.py /private/havb/projects/ert/test-data/poly_example/poly_out/realization-4/iter-1'

@berland
Copy link
Contributor

berland commented Apr 19, 2024

Same problem with LSF and --enable-scheduler. Not reproducible with LSF and legacy jobqueue.

@berland
Copy link
Contributor

berland commented Apr 22, 2024

Adding a sleep to poly_eval.py makes the problem "go away". Seemingly we depend on the scheduler to be active for a "terminate experiment" message to go through.

@berland
Copy link
Contributor

berland commented Apr 23, 2024

The remainder issue after #7704 is merged is covered by #1250.

Scheduler does not perform technically worse than job_queue, but the problem is amplified by the Scheduler being faster making it hard to click 'Terminate' in the short time-window where it is actually running.

This is not a problem for users as they do not run poly-case.

@berland
Copy link
Contributor

berland commented Apr 23, 2024

#7710 is merged, closing this as duplicate of #1250.

@berland berland closed this as completed Apr 23, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in SCOUT Apr 23, 2024
@berland berland moved this from Done to Done-Done in SCOUT May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

2 participants