Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] second design for metrics operator #63

Merged
merged 40 commits into from
Sep 24, 2023
Merged

[wip] second design for metrics operator #63

merged 40 commits into from
Sep 24, 2023

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Sep 20, 2023

Opening a separate PR since I'm mucking with my environment / setup (and didn't want to bork it in case I messed up!) This is a continuation of #62. See there for verbose description.

Requirements before this can be merged:

  • All metrics re-implemented and re-tested in this new design
  • My Kubecon experiments also verified to function the same (OSU looks ok, need to test storage)
  • A web UI of addons
  • A larger version bump here (likely alpha 1 to alpha 2)

And probably something else I didn't think of. I'm giving myself to the end of the week to complete this and prototype hpctoolkit as an addon with the lammps app. This probably could be enough work to spread out over a few weeks to a month... no pressure! But also, I think I'm going to try my damn best anyway.

Crap, commits aren't signed! Need to fix that, but going back to sleep for a bit :)

This is going to be a huge refactor to remove the application/storage "hard coded"
legos replaced by a more flexible setup where we have one base metric set (no
subtypes) and then metrics generate the replicated jobs (as many as they like, how
they please) and then addons are provided to them, which can range from additional
volumes to containers (that provide volumes) to any kind of customization. This
is not ready for any kind of testing but I am mostly concerned about my computer
blowing up and losing the work so I am saving for good measure :) Also, yay today! :D

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
but might as well save the state of them!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
we did not get this completely working before (likely
the spack mpi install as a basic hostname does not work
) so a basic conversion is sufficient

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the test-refactor-design branch from 7edaa43 to 92d93ff Compare September 20, 2023 13:35
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
also simplify logic of applications - the launcher worker
pattern is generic and can be shared

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
it is accepted this does not fully work, we need to
come back to it.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
but shared libraries are failing to load. HPCToolkit
you are a jerk. I am laughing. And crying. And mostly
crying.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
…file that is part of it!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
…run post commands

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
if we do not make a copy (refect) of the interface,
the state seems to change (and perist) between runs. While
I am still worried about this design, this at least seems
to fix that bug. I am also wondering about garbage collection
(e.g., if making the copies means they stay around and the
operator will use increasing memory) but that is TBA
explored.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch merged commit 67ad62f into main Sep 24, 2023
21 checks passed
@vsoch vsoch deleted the test-refactor-design branch September 24, 2023 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant