Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] second design for metrics operator #63

Merged
merged 40 commits into from
Sep 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
f23da0b
WIP to refactor
vsoch Sep 19, 2023
f525b8a
definitely making bad life decisions
vsoch Sep 20, 2023
8084830
very satisfying deletion of things.
vsoch Sep 20, 2023
b0f94c2
lammps ran!
vsoch Sep 20, 2023
9cc1769
amg is back
vsoch Sep 20, 2023
9d47cf1
bdas is back
vsoch Sep 20, 2023
d79ecbc
add back hpl
vsoch Sep 20, 2023
412217a
add back kripke
vsoch Sep 20, 2023
c5e0938
laghos
vsoch Sep 20, 2023
92d93ff
test signing again
vsoch Sep 20, 2023
fd242e9
add back nekbone
vsoch Sep 20, 2023
46ca402
add back pennant
vsoch Sep 20, 2023
6ebacea
add back quicksilver
vsoch Sep 20, 2023
7151697
workflow format bug
vsoch Sep 20, 2023
e9c2b0a
add back fio
vsoch Sep 21, 2023
25385cb
add back host volume example
vsoch Sep 21, 2023
635fa47
add back ior
vsoch Sep 21, 2023
8656acd
add back osu benchmarks!
vsoch Sep 21, 2023
7764578
add back chatterbug
vsoch Sep 21, 2023
8123c7d
add back netmark
vsoch Sep 21, 2023
f144bb3
systat and lammps working again
vsoch Sep 21, 2023
dfdc79b
hpctoolkit design at least works
vsoch Sep 21, 2023
f79f9bf
clean up docs a little bit
vsoch Sep 21, 2023
3ad7902
addon documentation is good
vsoch Sep 21, 2023
36603ee
hopefully fix bug
vsoch Sep 21, 2023
fd5e6db
fixing workingdir bug!
vsoch Sep 21, 2023
c6e5fa7
update to v1alpha2
vsoch Sep 21, 2023
1c6c15a
bugfix
vsoch Sep 21, 2023
1049084
update versions
vsoch Sep 21, 2023
940b26a
samples removed
vsoch Sep 21, 2023
f808bfd
updates to docs
vsoch Sep 22, 2023
5b322f5
typos
vsoch Sep 22, 2023
03c3649
a single touch marker at the end of the copy is more reliable than a …
vsoch Sep 22, 2023
76f050c
support to customize container for any metric, and for hpctoolkit to …
vsoch Sep 23, 2023
27f9f42
support for custom container
vsoch Sep 23, 2023
ec06d4b
small tweak
vsoch Sep 23, 2023
0d492ef
update metrics file
vsoch Sep 23, 2023
7b7b7bb
add print at end of post analysis for hpctoolkit
vsoch Sep 23, 2023
57279d5
fixing bug with internal crd state
vsoch Sep 23, 2023
d713bef
typo with laghos and kriple
vsoch Sep 23, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
- name: Check Spelling
uses: crate-ci/typos@7ad296c72fa8265059cc03d1eda562fbdfcd6df2 # v1.9.0
with:
files: ./README.md ./config/samples ./docs/*.md ./docs/*/*.md
files: ./README.md ./docs/*.md ./docs/*/*.md ./docs/*/*/*.md

- name: Lint and format Python code
run: |
Expand Down Expand Up @@ -66,19 +66,19 @@ jobs:
strategy:
fail-fast: false
matrix:
test: [["perf-hello-world", "ghcr.io/converged-computing/metric-sysstat:latest", 60], # performance test
["io-host-volume", "ghcr.io/converged-computing/metric-sysstat:latest", 60], # storage test
["io-fio", "ghcr.io/converged-computing/metric-fio:latest", 120], # storage test
["io-ior", "ghcr.io/converged-computing/metric-ior:latest", 120], # storage test
# ["network-chatterbug", "ghcr.io/converged-computing/metric-chatterbug:latest", 120], # network app test
["app-nekbone", "ghcr.io/converged-computing/metric-nekbone:latest", 120], # standalone app test
# ["app-ldms", "ghcr.io/converged-computing/metric-ovis-hpc:latest", 120], # standalone app test
["app-amg", "ghcr.io/converged-computing/metric-amg:latest", 120], # standalone app test
["app-kripke", "ghcr.io/converged-computing/metric-kripke:latest", 120], # standalone app test
["app-pennant", "ghcr.io/converged-computing/metric-pennant:latest", 120], # standalone app test
["app-bdas", "ghcr.io/converged-computing/metric-bdas:latest", 120], # standalone app test
["app-quicksilver", "ghcr.io/converged-computing/metric-quicksilver:latest", 120], # standalone app test
["app-lammps", "ghcr.io/converged-computing/metric-lammps:latest", 120]] # standalone app test
test: [["app-lammps", "ghcr.io/converged-computing/metric-lammps:latest", 120],
["perf-hello-world", "ghcr.io/converged-computing/metric-sysstat:latest", 60],
["io-host-volume", "ghcr.io/converged-computing/metric-sysstat:latest", 60],
["io-fio", "ghcr.io/converged-computing/metric-fio:latest", 120],
["io-ior", "ghcr.io/converged-computing/metric-ior:latest", 120],
## ["network-chatterbug", "ghcr.io/converged-computing/metric-chatterbug:latest", 120],
["app-nekbone", "ghcr.io/converged-computing/metric-nekbone:latest", 120],
["app-ldms", "ghcr.io/converged-computing/metric-ovis-hpc:latest", 120],
["app-amg", "ghcr.io/converged-computing/metric-amg:latest", 120],
["app-kripke", "ghcr.io/converged-computing/metric-kripke:latest", 120],
["app-pennant", "ghcr.io/converged-computing/metric-pennant:latest", 120],
["app-bdas", "ghcr.io/converged-computing/metric-bdas:latest", 120],
["app-quicksilver", "ghcr.io/converged-computing/metric-quicksilver:latest", 120]]

steps:
- name: Clone the code
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/python.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
run: |
export PATH="/usr/share/miniconda/bin:$PATH"
source activate mo
cd sdk/python/v1alpha1
cd sdk/python/v1alpha2
pip install .
pip install seaborn pandas

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ jobs:
run: |
export PATH="/usr/share/miniconda/bin:$PATH"
source activate mo
cd sdk/python/v1alpha1/
cd sdk/python/v1alpha2/
pip install -e .
python setup.py sdist bdist_wheel
cd dist
Expand Down
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,8 @@ helm: manifests kustomize helmify

.PHONY: docs-data
docs-data:
go run hack/docs-gen/main.go docs/_static/data/metrics.json
go run hack/metrics-gen/main.go docs/_static/data/metrics.json
go run hack/addons-gen/main.go docs/_static/data/addons.json

.PHONY: pre-push
pre-push: generate build-config-arm build-config docs-data
Expand Down
4 changes: 2 additions & 2 deletions PROJECT
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,6 @@ resources:
controller: true
domain: flux-framework.org
kind: MetricSet
path: github.com/converged-computing/metrics-operator/api/v1alpha1
version: v1alpha1
path: github.com/converged-computing/metrics-operator/api/v1alpha2
version: v1alpha2
version: "3"
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ To learn more:

## Dinosaur TODO

- Figure out issue with errors.IsNotFound not working...
- We need a way for the entrypoint command to monitor (based on the container) to differ (potentially)
- For larger metric collections, we should have a log streaming mode (and not wait for Completed/Successful)
- For services we are measuring, we likely need to be able to kill after N seconds (to complete job) or to specify the success policy on the metrics containers instead of the application
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ See the License for the specific language governing permissions and
limitations under the License.
*/

// Package v1alpha1 contains API Schema definitions for the v1alpha1 API group
// Package v1alpha2 contains API Schema definitions for the v1alpha2 API group
// +kubebuilder:object:generate=true
// +groupName=flux-framework.org
package v1alpha1
package v1alpha2

import (
"k8s.io/apimachinery/pkg/runtime/schema"
Expand All @@ -26,7 +26,7 @@ import (

var (
// GroupVersion is group version used to register these objects
GroupVersion = schema.GroupVersion{Group: "flux-framework.org", Version: "v1alpha1"}
GroupVersion = schema.GroupVersion{Group: "flux-framework.org", Version: "v1alpha2"}

// SchemeBuilder is used to add go types to the GroupVersionKind scheme
SchemeBuilder = &scheme.Builder{GroupVersion: GroupVersion}
Expand Down
Loading