Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added initial script to bootstrap Flux on CORAL/IBM machine #4

Merged
merged 18 commits into from
Sep 7, 2023

Conversation

lpottier
Copy link
Collaborator

@lpottier lpottier commented Aug 10, 2023

Added a script that bootstraps FLux on IBM machines (CORAL like Lassen). The script takes as input: number of nodes and JSON configuration file. This script starts Flux and create 3 sub allocations and add their respective Flux URI to the JSON file.

The JSON file given as input should contains some fields, at least the number of nodes for each sub-allocation (ML, Physics and containers):

{
  "ams_app": {
    "executable": "/build_mpi/examples/ams_example",
    "modelpath": "/trained_models/debug_model.pt",
    "mpi_ranks": 2,
    "use_gpu": true,
    "use_db": true,
    "dbtype": "rmq"
  },
  "physics": {
    "nodes": 1,
    "cores": 20,
    "gpus": 2
  },
  "ml": {
    "nodes": 1,
    "cores": 10,
    "gpus": 2
  },
  "containers": {
    "nodes": 1,
    "cores": 1,
    "gpus": 0
  },
  "daemon": {
    "queue-training-init": "training-init",
    "queue-training-start": "training-start",
    "queue-training-done": "training-done"
  },
  "rabbitmq": {
    "rabbitmq-name": "testamsrabbitmq",
    "rabbitmq-password": "passwd",
    "rabbitmq-user": "lpottier",
    "rabbitmq-vhost": "/",
    "service-port": 1234,
    "service-host": "url-for-rabbitmq",
    "rabbitmq-cert": "rmq-pds.crt",
    "rabbitmq-inbound-queue": "test4",
    "rabbitmq-outbound-queue": "test3"
  }
}

The script will update the JSON file to add a block flux block:

{
  "flux": {
    "uid": "12345",
    "global_uri": "ssh://...",
    "physics_uri": "ssh://...",
    "ml_uri": "ssh://...",
    "container_uri": "ssh://..."
  }
}

To test these scripts with two nodes:

./bootstrap_flux.sh 3 config.json
./launch_flux.sh config.json

Script tested on Ruby (Slurm) with flux-core==0.50 and on Lassen (LSF) with flux-core==0.45.
Warning: on Lassen only flux-core<=0.45 will work, modern Flux fails for unknown reasons.

Updates:

  • Added a general script to run everything launch_ams.sh
  • One just need to first export where AMS is located with export AMS_ROOT=<AMS repo path>

Important:

  • Note that this script does not load any Flux installation from Spack. This must be done before (or added into this script).
  • The physics and ML allocation are exclusive so you will need at least 3 nodes to run.

Signed-off-by: Loic Pottier <pottier1@llnl.gov>
@lpottier lpottier requested review from koparasy and milroy August 10, 2023 02:54
@lpottier lpottier self-assigned this Aug 10, 2023
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
…Lassen

Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
@lpottier lpottier marked this pull request as draft August 12, 2023 00:12
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
@lpottier lpottier marked this pull request as ready for review August 23, 2023 00:22
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
…ly works wityh Flux<= 0.45 (tested with 0.45)

Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
@koparasy koparasy changed the base branch from develop to feature/restructure August 30, 2023 19:06
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Signed-off-by: Loic Pottier <pottier1@llnl.gov>
@lpottier lpottier merged commit 2d93aec into feature/restructure Sep 7, 2023
koparasy pushed a commit that referenced this pull request Nov 7, 2023
* Added initial script to bootstrap Flux on CORAL/IBM machine
* Added support Slurm based system, tested with flux-core 0.49 on Ruby/Lassen
* Added script to launch AMS miniapp with Flux
* Reverted script to support older version of flux, Lassen bootstrap only works wityh Flux<= 0.45 (tested with 0.45)
* Added scripts to add secrets on OC
* Added new scripts to launch the entire AMS workflow
* Upgrade all scripts, they are now fully functional (main script communicates with AMS daemon via RMQ)

---------

Signed-off-by: Loic Pottier <pottier1@llnl.gov>
koparasy pushed a commit that referenced this pull request Nov 9, 2023
* Added initial script to bootstrap Flux on CORAL/IBM machine
* Added support Slurm based system, tested with flux-core 0.49 on Ruby/Lassen
* Added script to launch AMS miniapp with Flux
* Reverted script to support older version of flux, Lassen bootstrap only works wityh Flux<= 0.45 (tested with 0.45)
* Added scripts to add secrets on OC
* Added new scripts to launch the entire AMS workflow
* Upgrade all scripts, they are now fully functional (main script communicates with AMS daemon via RMQ)

---------

Signed-off-by: Loic Pottier <pottier1@llnl.gov>
koparasy pushed a commit that referenced this pull request Nov 10, 2023
* Added initial script to bootstrap Flux on CORAL/IBM machine
* Added support Slurm based system, tested with flux-core 0.49 on Ruby/Lassen
* Added script to launch AMS miniapp with Flux
* Reverted script to support older version of flux, Lassen bootstrap only works wityh Flux<= 0.45 (tested with 0.45)
* Added scripts to add secrets on OC
* Added new scripts to launch the entire AMS workflow
* Upgrade all scripts, they are now fully functional (main script communicates with AMS daemon via RMQ)

---------

Signed-off-by: Loic Pottier <pottier1@llnl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants