Skip to content

Everest not running on compute cluster #9447

@oyvindeide

Description

@oyvindeide

What happened? (You can include a screenshot if it helps explain)

Running Everest does not seem to work when running on compute cluster. The server starts on a node, but no jobs are submitted. After I while an error showed up in the terminal:

ERROR:everest_main:Everest run failed with: Traceback (most recent call last):
  File "/path/to/lib64/python3.11/site-packages/everest/detached/jobs/everserver.py", line 327, in main
    status, message = _get_optimization_status(run_model.exit_code, shared_data)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lib64/python3.11/site-packages/everest/detached/jobs/everserver.py", line 391, in _get_optimization_status
    messages = _failed_realizations_messages(shared_data)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lib64/python3.11/site-packages/everest/detached/jobs/everserver.py", line 401, in _failed_realizations_messages
    failed = shared_data[SIM_PROGRESS_ENDPOINT]["status"]["failed"]
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
KeyError: 'status'

Traceback (most recent call last):
  File "/path/to/lib64/python3.11/site-packages/everest/detached/jobs/everserver.py", line 327, in main
    status, message = _get_optimization_status(run_model.exit_code, shared_data)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lib64/python3.11/site-packages/everest/detached/jobs/everserver.py", line 391, in _get_optimization_status
    messages = _failed_realizations_messages(shared_data)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lib64/python3.11/site-packages/everest/detached/jobs/everserver.py", line 401, in _failed_realizations_messages
    failed = shared_data[SIM_PROGRESS_ENDPOINT]["status"]["failed"]
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
KeyError: 'status'

This was on lsf, but was also reported for slurm. Not sure if the error message was the same for slurm. Everest could run on the same config with local.

What did you expect to happen?

No response

steps to reproduce

Run math_func with:

simulator:
  queue_system: lsf

Environment where bug has been observed

  • python 3.11
  • python 3.12
  • macosx
  • rhel7
  • rhel8
  • local queue
  • lsf queue
  • slurm queue
  • openPBS queue

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions