Skip to content

Commit

Permalink
wip to add resources (#27)
Browse files Browse the repository at this point in the history
* wip to add resources
* add support to ask for thread level detail for the perf-sysstat metric

this is not tested yet - I am going to test on Google Cloud with >1 node

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch authored Aug 9, 2023
1 parent b179b4a commit e830a70
Show file tree
Hide file tree
Showing 20 changed files with 421 additions and 50 deletions.
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,7 @@ To learn more:
## Dinosaur TODO

- We need a way for the entrypoint command to monitor (based on the container) to differ (potentially)
- add resource limits / requests
- make flux operator command generator
- Find better logging library for logging outside of controller (go 1.21 has a logging library!)
- For larger metric collections, we should have a log streaming mode (and not wait for Completed/Successful)
- For services we are measuring, we likely need to be able to kill after N seconds (to complete job) or to specify the success policy on the metrics containers instead of the application
- Python function to save entire spec to yaml (for MetricSet and JobSet)?
Expand Down
20 changes: 20 additions & 0 deletions api/v1alpha1/metric_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,10 @@ type MetricSetSpec struct {
// +optional
Pods int32 `json:"pods"`

// Resources include limits and requests for each pod (that include a JobSet)
// +optional
Resources ContainerResource `json:"resources"`

// Single pod completion, meaning the jobspec completions is unset
// and we only require one main completion
// +optional
Expand Down Expand Up @@ -98,11 +102,27 @@ type Application struct {
//+optional
PullSecret string `json:"pullSecret"`

// Resources include limits and requests for the application
// +optional
Resources ContainerResources `json:"resources"`

// Existing Volumes for the application
// +optional
Volumes map[string]Volume `json:"volumes"`
}

// ContainerResources include limits and requests
type ContainerResources struct {

// +optional
Limits ContainerResource `json:"limits"`

// +optional
Requests ContainerResource `json:"requests"`
}

type ContainerResource map[string]intstr.IntOrString

// A Volume should correspond with an existing volume, either:
// config map, secret, or claim name. This will be added soon.
type Volume struct {
Expand Down
58 changes: 58 additions & 0 deletions api/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

27 changes: 27 additions & 0 deletions config/crd/bases/flux-framework.org_metricsets.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,24 @@ spec:
pullSecret:
description: A pull secret for the application container
type: string
resources:
description: Resources include limits and requests for the application
properties:
limits:
additionalProperties:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
type: object
requests:
additionalProperties:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
type: object
type: object
volumes:
additionalProperties:
description: 'A Volume should correspond with an existing volume,
Expand Down Expand Up @@ -172,6 +190,15 @@ spec:
description: Parallelism (e.g., pods)
format: int32
type: integer
resources:
additionalProperties:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
description: Resources include limits and requests for each pod (that
include a JobSet)
type: object
serviceName:
default: ms
description: Service name for the JobSet (MetricsSet) cluster network
Expand Down
14 changes: 7 additions & 7 deletions docs/_static/data/metrics.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
[
{
"name": "io-sysstat",
"description": "statistics for Linux tasks (processes) : I/O, CPU, memory, etc.",
"type": "storage",
"image": "ghcr.io/converged-computing/metric-sysstat:latest",
"url": "https://github.com/sysstat/sysstat"
},
{
"name": "network-netmark",
"description": "point to point networking tool",
Expand All @@ -19,12 +26,5 @@
"type": "application",
"image": "ghcr.io/converged-computing/metric-sysstat:latest",
"url": "https://github.com/sysstat/sysstat"
},
{
"name": "io-sysstat",
"description": "statistics for Linux tasks (processes) : I/O, CPU, memory, etc.",
"type": "storage",
"image": "ghcr.io/converged-computing/metric-sysstat:latest",
"url": "https://github.com/sysstat/sysstat"
}
]
48 changes: 46 additions & 2 deletions docs/getting_started/custom-resource-definition.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,18 @@ spec:
An application is allowed to have one or more existing volumes. An existing volume can be any of the types described in [existing volumes](#existing-volumes)
#### resources
Resource lists for an application container go under [Overhead](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/). Known keys include "memory" and "cpu" (should be provided in some
string format that can be parsed) and all others are considered some kind of quantity request.
```yaml
pod:
resources:
memory: 500M
cpu: 4
```
### storage
When you want to measure some storage performance, you'll want to add a "storage" section to your MetricSet. This will typically just be a reference to some existing storage (see [existing volumes](#existing-volumes)) that we want to measure, and can
Expand Down Expand Up @@ -120,7 +132,7 @@ spec:
rate: 20
```

### completions
#### completions

Completions for a metric are relevant if you are assessing storage (which doesn't have an application runtime) or a service application that will continue to run forever. When this value is set to 0, it essentially indicates no set number of completions (meaning we run forever). Any non-zero value will ensure the metric
runs for that many completions before exiting.
Expand All @@ -134,7 +146,7 @@ spec:

This is usually suggested to provide for a storage metric.

### options
#### options

Metrics can take custom options, which are key value pairs of a string key and either string or integer value. These come in three types:

Expand Down Expand Up @@ -164,6 +176,38 @@ Presence of absence of an option type depends on the metric. Metrics are free to
options as they see fit.


## resources

Resources for an entire spec are given to the Pod template of the Job. They can include limits and requests. Known keys include "memory" and "cpu" (should be provided in some
string format that can be parsed) and all others are considered some kind of quantity request.

```yaml
resources:
limits:
memory: 500M
cpu: 4
```

If you wanted to, for example, request a GPU, that might look like:

```yaml
resources:
limits:
gpu-vendor.example/example-gpu: 1
```

Or for a particulat type of networking fabric:

```yaml
resources:
limits:
vpc.amazonaws.com/efa: 1
```

Both limits and resources are flexible to accept a string or an integer value, and you'll get an error if you
provide something else. If you need something else, [let us know](https://github.com/converged-computing/metrics-operator/issues).
If you are requesting GPU, [this documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) is helpful.

## Existing Volumes

An existing volume can be provided to support an application (multiple) or one can be provided for assessing its performance (single).
Expand Down
8 changes: 6 additions & 2 deletions docs/getting_started/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ This metric provides the "pidstat" executable of the sysstat library. The follow
|Name | Description | Type | Default |
|-----|-------------|------------|------|
| color | Set to turn on color parsing | Anything set | unset |
| pids | For debugging, show consistent output of ps aux | Anything set | Unset |
| pids | For debugging, show consistent output of ps aux | Anything set | unset |
| threads | add `-t` to each pidstat command to indicate wanting thread-level output | unset |

By default color and pids are set to false anticipating log parsing.
And we also provide the option to see "commands" or specific commands based on a job index to the metric.
Expand All @@ -51,8 +52,11 @@ and the rest (workers).
"all": /usr/libexec/flux/cmd/flux-broker --config /etc/flux/config -Scron.directory=/etc/flux/system/cron.d -Stbon.fanout
"0": /usr/bin/python3.8 /usr/libexec/flux/cmd/flux-submit.py -n 2 --quiet --watch lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
```
In the map above, order matters, as the command for all indices is first set to be the flux-broker one, and then
after the index at 0 gets a custom command.
after the index at 0 gets a custom command. See [pidstat](https://man7.org/linux/man-pages/man1/pidstat.1.html) for
more information on this command, and [this file](https://github.com/converged-computing/metrics-operator/blob/main/pkg/metrics/perf/sysstat.go)
for how we use them. If there is an option or command that is not exposed that you would like, please [open an issue](https://github.com/converged-computing/metrics-operator/issues).
### Storage
Expand Down
26 changes: 26 additions & 0 deletions examples/dist/metrics-operator-arm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,24 @@ spec:
pullSecret:
description: A pull secret for the application container
type: string
resources:
description: Resources include limits and requests for the application
properties:
limits:
additionalProperties:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
type: object
requests:
additionalProperties:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
type: object
type: object
volumes:
additionalProperties:
description: 'A Volume should correspond with an existing volume, either: config map, secret, or claim name. This will be added soon.'
Expand Down Expand Up @@ -164,6 +182,14 @@ spec:
description: Parallelism (e.g., pods)
format: int32
type: integer
resources:
additionalProperties:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
description: Resources include limits and requests for each pod (that include a JobSet)
type: object
serviceName:
default: ms
description: Service name for the JobSet (MetricsSet) cluster network
Expand Down
26 changes: 26 additions & 0 deletions examples/dist/metrics-operator.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,24 @@ spec:
pullSecret:
description: A pull secret for the application container
type: string
resources:
description: Resources include limits and requests for the application
properties:
limits:
additionalProperties:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
type: object
requests:
additionalProperties:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
type: object
type: object
volumes:
additionalProperties:
description: 'A Volume should correspond with an existing volume, either: config map, secret, or claim name. This will be added soon.'
Expand Down Expand Up @@ -164,6 +182,14 @@ spec:
description: Parallelism (e.g., pods)
format: int32
type: integer
resources:
additionalProperties:
anyOf:
- type: integer
- type: string
x-kubernetes-int-or-string: true
description: Resources include limits and requests for each pod (that include a JobSet)
type: object
serviceName:
default: ms
description: Service name for the JobSet (MetricsSet) cluster network
Expand Down
Loading

0 comments on commit e830a70

Please sign in to comment.