Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

installing flux view an on demand volume is almost working! #68

Merged
merged 9 commits into from
Sep 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/_static/data/addons.json
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,10 @@
"name": "volume-secret",
"description": "secret volume type",
"family": "volume"
},
{
"name": "workload-flux",
"description": "hierarchical graph-based scheduler and resource manager",
"family": "workload"
}
]
57 changes: 57 additions & 0 deletions docs/getting_started/addons.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,63 @@ spec:

**Note that we have support for a custom application container, but haven't written any good examples yet!**

## Workload

### workload-flux

If you need to "throw in" Flux Framework into your container to use as a scheduler, you can do that with an addon!

> Yes, it's astounding. 🦩️

This works by way of the same trick that we use for other addons that have a complex (and/or large) install setup. We:

- Build the software into an isolated spack "copy" view
- The software is then (generally) at some `/opt/view` and `/opt/software`
- The flux container is added as a sidecar container to your pod for your replicated job
- Additional setup / configuration is done here
- We can then create an empty volume that is shared by your metric or scaled application
- The entire tree is copied over into the empty volume
- When the copy is done, indicated by the final touch of a file, the updated container entrypoint is run
- This typically means we have taken your metric command, and wrapped it in a Flux submit.

It's really cool because it means you can run a metric / application with Flux without needing
to install it into your container to begin with. The one important detail is a matching of
general operating system. The current view uses rocky, however the image is customizable
(and we can provide other bases if/when requested). Here are the arguments you can customize
under the metric -> options.

| Name | Description | Type | Default |
|-----|-------------|------------|------|
| mount | Path to mount flux view in application container | string | /opt/share |
| tasks | Number of tasks `-n` to give to flux (not provided if not set) | string | unset |
| image | Customize the container image | string | `ghcr.io/rse-ops/spack-flux-rocky-view:tag-8` |
| fluxUser | The flux user (currently not used, but TBA) | string | flux |
| fluxUid | The flux user ID (currently not used, but TBA) | string | 1004 |
| interactive | Run flux in interactive mode | string | "false" |
| connectTimeout | How long zeroMQ should wait to retry | string | "5s" |
| quorum | The number of brokers to require before starting the cluster | string | (total brokers or pods) |
| debugZeroMQ | Turn on zeroMQ debugging | string | "false" |
| logLevel | Customize the flux log level | string | "6" |
| queuePolicy | Queue policy for flux to use | string | fcfs |
| workerLetter | The letter that the worker job is expected to have | string | w |
| launcherLetter | The letter that the launcher job is expected to have | string | w |
| workerIndex | The index of the replicated job for the worker | string | 0 |
| launcherIndex | The index of the replicated job for the launcher | string | 0 |
| preCommand | Pre-command logic to run in launcher/workers before flux is started (after setup in flux container) | string | unset |

Note that the number of pods for flux defaults to the number in your MetricSet, along
with the namespace and service name.

**Important** the flux addon is currently supported for metric types that:

1. have the launcher / worker design (so the hostlist.txt is present in the PWD)
2. Have scp installed, as the shared certificate needs to be copied from the lead broker to all followers
3. Ideally have munge installed - we do try to install it (but better to already be there)

We also currently run flux as root. This is considered bad practice, but probably OK
for this early development work. We don't see a need to have shared namespace / operator
environments at this point, which is why I didn't add it.

## Performance

### perf-hpctoolkit
Expand Down
32 changes: 32 additions & 0 deletions examples/addons/flux-lammps/metrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 4
logging:
interactive: true

metrics:

# Running more scaled lammps is our main goal
- name: app-lammps

# This flux addon is built on rocky, and we can provide additional os bases
image: ghcr.io/converged-computing/metric-lammps-intel-mpi:rocky

options:
command: lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS

# Add on hpctoolkit, will mount a volume and wrap lammps
addons:
- name: workload-flux
options:
# Ensure intel environment is setup
preCommand: . /opt/intel/mpi/latest/env/vars.sh
workdir: /opt/lammps/examples/reaxff/HNS
9 changes: 5 additions & 4 deletions pkg/addons/addons.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ var (
AddonFamilyPerformance = "performance"
AddonFamilyVolume = "volume"
AddonFamilyApplication = "application"
AddonFamilyWorkload = "workload"
)

// A general metric is a container added to a JobSet
Expand All @@ -37,7 +38,7 @@ type Addon interface {
Description() string

// Options and exportable attributes
SetOptions(*api.MetricAddon)
SetOptions(*api.MetricAddon, *api.MetricSet)
Options() map[string]intstr.IntOrString
ListOptions() map[string][]intstr.IntOrString
MapOptions() map[string]map[string]intstr.IntOrString
Expand Down Expand Up @@ -65,7 +66,7 @@ type AddonBase struct {
mapOptions map[string]map[string]intstr.IntOrString
}

func (b *AddonBase) SetOptions(metric *api.MetricAddon) {}
func (b *AddonBase) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {}
func (b *AddonBase) CustomizeEntrypoints([]*specs.ContainerSpec, []*jobset.ReplicatedJob) {}

func (b *AddonBase) Validate() bool {
Expand Down Expand Up @@ -97,7 +98,7 @@ func (b *AddonBase) MapOptions() map[string]map[string]intstr.IntOrString {
}

// GetAddon looks up and validates an addon
func GetAddon(a *api.MetricAddon) (Addon, error) {
func GetAddon(a *api.MetricAddon, set *api.MetricSet) (Addon, error) {

// We don't want to change the addon interface/struct itself
template, ok := Registry[a.Name]
Expand All @@ -111,7 +112,7 @@ func GetAddon(a *api.MetricAddon) (Addon, error) {
addon := reflect.New(templateType.Type()).Interface().(Addon)

// Set options before validation
addon.SetOptions(a)
addon.SetOptions(a, set)

// Validate the addon
if !addon.Validate() {
Expand Down
8 changes: 4 additions & 4 deletions pkg/addons/commands.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@ func (a *PerfAddon) CustomizeEntrypoints(
}
}

func (a *PerfAddon) SetOptions(metric *api.MetricAddon) {
func (a *PerfAddon) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {
a.Identifier = perfCommandsName
a.SetSharedCommandOptions(metric)
a.SetSharedCommandOptions(addon)
}

// addContainerCaps adds capabilities to a container spec
Expand Down Expand Up @@ -102,9 +102,9 @@ func (m CommandAddon) Family() string {
return AddonFamilyApplication
}

func (a *CommandAddon) SetOptions(metric *api.MetricAddon) {
func (a *CommandAddon) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {
a.Identifier = commandsName
a.SetSharedCommandOptions(metric)
a.SetSharedCommandOptions(addon)
}

// Set custom options / attributes for the metric
Expand Down
4 changes: 2 additions & 2 deletions pkg/addons/containers.go
Original file line number Diff line number Diff line change
Expand Up @@ -139,8 +139,8 @@ func (a *ApplicationAddon) setDefaultEntrypoint() {
}

// Calling the default allows a custom application that uses this to do the same
func (a *ApplicationAddon) SetOptions(metric *api.MetricAddon) {
a.SetDefaultOptions(metric)
func (a *ApplicationAddon) SetOptions(addon *api.MetricAddon, metric *api.MetricSet) {
a.SetDefaultOptions(addon)
}

// Underlying function that can be shared
Expand Down
Loading