Generalize System Status app for use at other sites #92

msquee · 2020-09-25T14:25:13Z

More sites have expressed interest in the System Status app [1], there is still OSC specific code in the latest version. Let's generalize the status app so it can be dropped in and deployed at other sites.

The ideal scenario would be to support all of the adapters that Open OnDemand supports but I think focusing on supporting SLURM clusters is a good place to start.

Todo (WIP):

Merge gpu_cluster_status.rb into moab_showq_client.rb and rename to torque_moab_client.rb (omitting any OSC specific items)

Add a torque - only adapter - several interested sites actually have Torque (but might not have Moab); could get the aggregate information about jobs by just parsing info_all (which is slower but not that much slower) or use qstat directly with arguments to display server status

Displaying Server Status

If batch server status is being displayed and the -f option is not specified, the following items are dis‐
played on a single line, in the specified order, separated by white space:

     -      the server name

     -      the maximum number of jobs that the server may run concurrently

     -      the total number of jobs currently managed by the server

     -      the status of the server

     -      for each job state, the name of the state and the number of jobs in the server in that state

Support specifying partitions to display in system status (perhaps in cluster config there might be a custom: systemstatus: partitions: [ serial, parallel ] configuration that if exists, we create a separate graph for each (and then the sinfo/squeue calls are constrained to those partitions)

[1] https://discourse.osc.edu/t/system-status-app/1129

The text was updated successfully, but these errors were encountered:

msquee · 2020-09-25T14:25:34Z

Would it be best to remove Ganglia and focus on supporting Grafana?

achalker · 2020-09-25T14:33:48Z

I would say no, since many sites still utilize Ganglia.

…

----------------------- Alan Chalker, Ph.D. alanc@osc.edu<mailto:alanc@osc.edu> 614-247-8672 From: Mario Squeo <notifications@github.com> Sent: Friday, September 25, 2020 10:26 AM To: OSC/osc-systemstatus <osc-systemstatus@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: Re: [OSC/osc-systemstatus] Generalize System Status app for use at other sites (#92) Should we remove Ganglia and focus on supporting Grafana only? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/OSC/osc-systemstatus/issues/92*issuecomment-698960281__;Iw!!KGKeukY!kq6-3HtNDPITbzGHjzJtpyRf4mkdyy7kMnTofw8KTr32FVJmbLrlLdNEMnyV$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/ABT3M4TMSKCEUUENWJQEYG3SHSR65ANCNFSM4RZTPR3Q__;!!KGKeukY!kq6-3HtNDPITbzGHjzJtpyRf4mkdyy7kMnTofw8KTr32FVJmbLrlLaVZ2Uz6$>.

ericfranz · 2020-09-25T14:49:00Z

Make https://github.com/OSC/osc-systemstatus/blob/master/views/layout.erb#L164 configurable via ENV.

Just remove that.

ericfranz · 2020-09-25T14:50:09Z

Support setting custom colors on graphs

What did you have in mind here? I think the MVP might not require this.

msquee · 2020-09-25T15:00:06Z

@ericfranz It's not required. Since there's support for customization on the Dashboard, we could bring that functionality here eventually.

mcuma · 2020-09-25T16:07:07Z

We do have Ganglia but I know next to nothing about it. Though we could probably plug it in. I would vote for make it optional, if possible.

What I was mostly after is output from sinfo to see what resources are available at a given time, so that people could decide what cluster and partition to use to get their job running ASAP. E.g. for our simplest cluster, sinfo gives this:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lonepeak* up 3-00:00:00 1 drain* lp203
lonepeak* up 3-00:00:00 210 alloc lp[001-089,091-112,133-202,204-232]
lonepeak* up 3-00:00:00 1 idle lp090
lonepeak-shared up 3-00:00:00 1 drain* lp203
lonepeak-shared up 3-00:00:00 210 alloc lp[001-089,091-112,133-202,204-232]
lonepeak-shared up 3-00:00:00 1 idle lp090
lonepeak-guest up 3-00:00:00 21 alloc lp[113-132,233]
lonepeak-shared-guest up 3-00:00:00 21 alloc lp[113-132,233]
liu-lp up 14-00:00:0 20 alloc lp[113-132]
liu-shared-lp up 14-00:00:0 20 alloc lp[113-132]
fischer-lp up 14-00:00:0 1 alloc lp233
fischer-shared-lp up 14-00:00:0 1 alloc lp233

We have 2 main partitions, the "lonepeak", with synonym "lonepeak-shared" for jobs that can share a node, and the "owner" partition, which consists of the liu-lp and fischer-lp nodes, their shared synonyms, and the guest access to the owner nodes (lonepeak-guest) and its shared synonym.

So, in the simplest case, we could report the status of the "lonepeak" and "lonepeak-guest" partitions (alloc, idle, drain, mix=partially occupied), and, potentially how busy each owner partition is, as sometimes guests target specific owner nodes for smaller chances of preemptions (owner jobs preempt guest jobs).

I hope this helps you with the generalization strategy, or possibly make some plug ins for site specific stuff like ours. And feel free to let me know if I can help with anything.

ericfranz · 2020-09-25T16:39:28Z

@mcuma we made a few quick changes to the app that at least now runs at CHPC. Here is a screenshot:

If you just get the latest code from the master branch and touch tmp/restart.txt it should run. If you are updating a previously cloned version you will need to rm -rf .bundle and rm -rf vendor/bundle.

Now, that said, as you can see from the screenshot, it just builds these graph for each cluster, not for partitions of a specific cluster. It seems like maybe what you are looking for would be best served by a custom widget when we are able to easily support that type of thing in OnDemand.

ericfranz · 2020-09-25T16:40:33Z

Or maybe we are talking about the same graphs above, but being able to make graphs per partition instead of per cluster, or pick the cluster and partitions to do the graphs for?

mcuma · 2020-09-25T16:45:02Z

Great, let me try that and let you know how it went. It's looking good enough for now from the screenshot. I may hack around at it to get the two partitions separate (lonepeak, lonepeak-guest) if I get a chance. I should be able to do it from skimming the code.

mcuma · 2020-09-25T17:10:03Z

I confirm that the System Status works both on our test and production servers. Thanks for getting this fixed so quickly.

msquee self-assigned this Sep 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize System Status app for use at other sites #92

Generalize System Status app for use at other sites #92

msquee commented Sep 25, 2020 •

edited by ericfranz

Loading

msquee commented Sep 25, 2020 •

edited

Loading

achalker commented Sep 25, 2020 via email

ericfranz commented Sep 25, 2020

ericfranz commented Sep 25, 2020

msquee commented Sep 25, 2020 •

edited

Loading

mcuma commented Sep 25, 2020

ericfranz commented Sep 25, 2020

ericfranz commented Sep 25, 2020

mcuma commented Sep 25, 2020

mcuma commented Sep 25, 2020

Generalize System Status app for use at other sites #92

Generalize System Status app for use at other sites #92

Comments

msquee commented Sep 25, 2020 • edited by ericfranz Loading

msquee commented Sep 25, 2020 • edited Loading

achalker commented Sep 25, 2020 via email

ericfranz commented Sep 25, 2020

ericfranz commented Sep 25, 2020

msquee commented Sep 25, 2020 • edited Loading

mcuma commented Sep 25, 2020

ericfranz commented Sep 25, 2020

ericfranz commented Sep 25, 2020

mcuma commented Sep 25, 2020

mcuma commented Sep 25, 2020

msquee commented Sep 25, 2020 •

edited by ericfranz

Loading

msquee commented Sep 25, 2020 •

edited

Loading

msquee commented Sep 25, 2020 •

edited

Loading