Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: System Wisteria GPU version #204

Merged
merged 26 commits into from
Mar 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
14743cf
finished wisteria system now passes TestFlow,
bch0w Aug 25, 2023
7f2d202
Merge branch 'master' into feature-wisteria_gpu
bch0w Feb 19, 2024
8603911
bump version number
bch0w Feb 19, 2024
99b4fbe
bump version number docs
bch0w Feb 19, 2024
a60ed1e
updating custom gpu wisteria runscript to include latest cuda version
bch0w Feb 20, 2024
f639347
bugfix prepare data for solver was not properly evaluating data paths
bch0w Feb 20, 2024
d403404
adding missing import statement
bch0w Feb 20, 2024
514f2c8
merging conflicts from devel, removing unncessary parameter in forwar…
bch0w Feb 20, 2024
40361e1
removing last merge conflict
bch0w Feb 20, 2024
35fcdb4
api function name change update for fujitsu/wisteria system modules
bch0w Feb 20, 2024
3f19ba5
Merge branch 'devel' into feature-wisteria_gpu
bch0w Mar 4, 2024
b161575
bugfix job tasktime was using walltime value
bch0w Mar 5, 2024
59f65fb
updates log message for assertion warning to include actual missing path
bch0w Mar 6, 2024
53d48c1
bugfix missing environ input in wisteria system
bch0w Mar 7, 2024
a48e421
Adding more comments to custom wisteria scripts
bch0w Mar 7, 2024
63eed6a
removing env variable from custom wisteria run scripts for now becaus…
bch0w Mar 8, 2024
96fc996
fixing environ implementation in wisteria
bch0w Mar 9, 2024
b1d2808
updating resource group numbers based on users guide
bch0w Mar 11, 2024
f553bc5
shifts wisteria functionality into fujitsu to keep wisteria overwrite…
bch0w Mar 11, 2024
84e9aef
removing unncessary imports and unused variables
bch0w Mar 11, 2024
3e5f742
fixing up the new custom run script functionality
bch0w Mar 11, 2024
bed15f0
fixing bash syntax errors in custon run wisteria
bch0w Mar 11, 2024
6708f66
updating cuda version in loaded modules
bch0w Mar 11, 2024
7cfea60
okay now it works, does not like backslash newlines in bash commands
bch0w Mar 11, 2024
e62dc28
fixing merge conflicts
bch0w Mar 28, 2024
fb11ab8
update changelog
bch0w Mar 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
# CHANGELOG

## v3.0.1 (#203)
## v3.0.2 (#204)
System Wisteria GPU Upgrades

- Bugfix: Fujitsu tasktime for individual job submission was using `walltime` value, not `tasktime` value
- Combined and condensed main System functionality in Fujitsu system from Wisteria child class. Prior to this Wisteria child class was overwriting most of the functionality of Fujitsu which is not really the point of inheritance
- New Custom Run and Submit scriopts for Wisteria GPU. Better comments and slightly easier to modify for others
- Added new `rscgrps` to include GPU partitions on Wisteria
- Improved run call header for easier switching between CPU and GPU nodes

## v3.0.1 (#203)
Quality of Life Updates

- Solver now automatically generates VTK files for Models and Gradients at the end of each iteration
- New function `solver.specfem.make_output_vtk_files` that generates .vtk files for all files in the output/ directory
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "seisflows"
version = "3.0.1"
version = "3.0.2"
description = "An automated workflow tool for full waveform inversion"
readme = "README.md"
requires-python = ">=3.7"
Expand Down
92 changes: 53 additions & 39 deletions seisflows/system/fujitsu.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
Computing Suite).

.. note::

The nickname `PJM`, based on the batch job script directives, may be used
as a shorthand to refer to the Fujitsu job management system.

Expand All @@ -18,10 +19,10 @@
import subprocess

from datetime import timedelta
from seisflows import ROOT_DIR, logger
from seisflows import logger
from seisflows.system.cluster import Cluster
from seisflows.tools import msg
from seisflows.tools.config import pickle_function_list
from seisflows.tools.config import pickle_function_list, import_seisflows


class Fujitsu(Cluster):
Expand Down Expand Up @@ -75,7 +76,6 @@ def __init__(self, ntask_max=100, pjm_args="", **kwargs):
self._failed_states = ["CANCEL", "HOLD", "ERROR"]
self._pending_states = ["QUEUED", "RUNNING"]


def check(self):
"""
Checks parameters and paths
Expand Down Expand Up @@ -146,42 +146,51 @@ def run_call_header(self):
f"-N {self.title}", # job name
f"-o {os.path.join(self.path.log_files, '%j')}",
f"-j", # merge stderr with stdout
f"-L elapse={self._walltime}", # [[hour:]minute:]second
f"-L elapse={self._tasktime}", # [[hour:]minute:]second
f"-L node={self.nodes}",
f"--mpi proc={self.nproc}",
])
return _call

def submit(self, workdir=None, parameter_file="parameters.yaml",
direct=True):
"""
Submit main workflow to the System. Two options are available,
submitting a Python job directly to the system, or submitting a
subprocess.

def submit(self, workdir=None, parameter_file="parameters.yaml"):
"""
Submits the main workflow job as a separate job submitted directly to
the system that is running the master job.

.. note::

Fujitsu scheduler doesn't allow command line arugments
(e.g., --workdir), so these are assumed to be default values where
the workdir is ${pwd} and the parameter file is called
'parameters.yaml'

:type workdir: str
:param workdir: path to the current working directory
:type parameter_file: str
:param parameter_file: paramter file file name used to instantiate
the SeisFlows package
"""
# e.g., submit -w ./ -p parameters.yaml
submit_call = " ".join([
f"{self.submit_call_header}",
f"{self.submit_workflow}",
])

logger.debug(submit_call)
try:
subprocess.run(submit_call, shell=True)
except subprocess.CalledProcessError as e:
logger.critical(f"SeisFlows master job has failed with: {e}")
sys.exit(-1)
the SeisFlows package
:type direct: bool
:param direct: (used for overriding system modules) submits the main
workflow job directly to the login node as a Python process
(default). If False, submits the main job as a separate subprocess.
Note that this is Fujitsu specific and main jobs should be run from
interactive jobs run on compute nodes to avoid running jobs on
shared login resources
"""
if direct:
workflow = import_seisflows(workdir=workdir or self.path.workdir,
parameter_file=parameter_file)
workflow.check()
workflow.setup()
workflow.run()
else:
# e.g., submit -w ./ -p parameters.yaml
submit_call = " ".join([
f"{self.submit_call_header}",
f"{self.submit_workflow}",
])

logger.debug(submit_call)
try:
subprocess.run(submit_call, shell=True)
except subprocess.CalledProcessError as e:
logger.critical(f"SeisFlows master job has failed with: {e}")
sys.exit(-1)

@staticmethod
def _stdout_to_job_id(stdout):
Expand Down Expand Up @@ -215,11 +224,8 @@ def run(self, funcs, single=False, **kwargs):
cluster. Executes the list of functions (`funcs`) NTASK times with each
task occupying NPROC cores.

.. warning::
This has not been tested generally on Fujitsu systems, see system
Wisteria for a working application of the Fujitsu module

.. note::

Completely overwrites the `Cluster.run()` command

:type funcs: list of methods
Expand All @@ -244,15 +250,23 @@ def run(self, funcs, single=False, **kwargs):
f"system {self.ntask} times")
_ntask = self.ntask

# If no environs, ensure there is not trailing comma
if self.environs:
self.environs = f",{self.environs}"

# Default Fujitsu command line input, can be overloaded by subclasses
# Copy-paste this default run_call and adjust accordingly for subclass
job_ids = []
for taskid in range(_ntask):
run_call = " ".join([
f"{self.run_call_header}",
f"--funcs {funcs_fid}",
f"--kwargs {kwargs_fid}",
f"--environment SEISFLOWS_TASKID={{task_id}},{self.environs}"
# -x in 'pjsub' sets environment variables which are distributed
# in the run script, see custom run scripts for example how
# Ensure that these are comma-separated, not space-separated
f"-x SEISFLOWS_FUNCS={funcs_fid},"
f"SEISFLOWS_KWARGS={kwargs_fid},"
f"SEISFLOWS_TASKID={taskid}{self.environs},"
f"GPU_MODE={int(bool(self.gpu))}", # 0 if False, 1 if True
f"{self.run_functions}",
])

Expand All @@ -266,7 +280,7 @@ def run(self, funcs, single=False, **kwargs):

# Monitor the job queue until all jobs have completed, or any one fails
try:
status = self.check_job_status(job_id)
status = self.monitor_job_status(job_ids)
except FileNotFoundError:
logger.critical(f"cannot access job information through 'pjstat', "
f"waited 50s with no return, please check job "
Expand Down Expand Up @@ -351,7 +365,7 @@ def query_job_states(self, job_id, timeout_s=300, wait_time_s=30,
if _recheck > (timeout_s // wait_time_s):
raise TimeoutError(f"cannot access job ID {job_id}")
time.sleep(wait_time_s)
query_job_states(job_id, _recheck=_recheck)
self.query_job_states(job_id, _recheck=_recheck)

return job_ids, job_states

34 changes: 28 additions & 6 deletions seisflows/system/runscripts/custom_run-wisteria
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,41 @@
# ==============================================================================
# This is a Wisteria (UTokyo HPC) specific run script that is required
# because the compute node does not inherit the login node's Conda environment.
# Instead we need to load the module and environment manually, before run
# Instead we need to load the module and environment manually, before run.
#
# User needs to set the following paths:
# WORK_DIR: path to the directory where the Conda environment is stored, and
# where the SeisFlows repository has been cloned
# CONDA_ENV: name of the Conda environment to be used
# GPU_MODE: needs to be set by the calling function, if GPU_MODE=='GPU', then
# the script will load the GPU specific environment
# ==============================================================================

# Defines where our Conda environment is saved and what its name is
WORK_DIR=/work/01/gr58/share/adjtomo
CONDA_ENV=adjtomo
echo "work environment set as: '$WORK_DIR'"

# Load MPI and activate Conda environment
module load intel
module load impi
if [ $GPU_MODE -eq 1 ]; then # Use GPUs
echo "loading GPU modules on compute node"
module load cuda/12.2
module load gcc
module load ompi-cuda
else
echo "loading CPU modules on compute node"
module load intel
module load impi
fi

# Conda will be common to GPU or CPU versions
echo "loading Conda environment: $CONDA_ENV"
module load miniconda/py38_4.9.2
source $MINICONDA_DIR/etc/profile.d/conda.sh
conda activate $WORK_DIR/conda/envs/adjtomo
conda activate $WORK_DIR/conda/envs/$CONDA_ENV

# Run Functions: ensure that we are using the correct Python version

# Run Functions: ensure that we are using the correct Python version
# The following environment variables must be set by the '-x' flag in the
# corresponding system.run() function:
# ---
Expand All @@ -24,4 +45,5 @@ conda activate $WORK_DIR/conda/envs/adjtomo
# SEISFLOWS_ENV: any additional environment variables
# SEISFLOWS_TASKID: assigned processor number for given task
# ---
$WORK_DIR/conda/envs/adjtomo/bin/python $WORK_DIR/REPOSITORIES/seisflows/seisflows/system/runscripts/run --funcs $SEISFLOWS_FUNCS --kwargs $SEISFLOWS_KWARGS --environment SEISFLOWS_TASKID=$SEISFLOWS_TASKID,$SEISFLOWS_ENV
$WORK_DIR/conda/envs/$CONDA_ENV/bin/python $WORK_DIR/REPOSITORIES/seisflows/seisflows/system/runscripts/run --funcs $SEISFLOWS_FUNCS --kwargs $SEISFLOWS_KWARGS --environment SEISFLOWS_TASKID=$SEISFLOWS_TASKID,$SEISFLOWS_ENV

1 change: 1 addition & 0 deletions seisflows/system/runscripts/custom_submit-wisteria
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
# Instead we need to load the module and environment manually, before submit
# ==============================================================================

# Defines where our Conda environment is stored
WORK_DIR=/work/01/gr58/share/adjtomo

# Load MPI and activate Conda environment
Expand Down
Loading