Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurmmon web page #5

Open
henkela opened this issue Sep 8, 2014 · 6 comments
Open

slurmmon web page #5

henkela opened this issue Sep 8, 2014 · 6 comments

Comments

@henkela
Copy link

henkela commented Sep 8, 2014

Yet another thing...
I like your plugin that's why I really want to use it... but some things remain not clear to me.
This is about the web page. I installed to the default path /var/www/html/slurmmon. The php's for the graphs are located at /var/www/html/ganglia/graph.d.
When I look on my ganglia web page I see the slurmmond_* reports. But when I open the slurmmon page I only see text and broken images although the graph.php calls look the same as on my ganglia page.
The whitespace report works at least for the first 3 columns. The rest is not shown (like CPU efficiency and so on). Unfortunately, it is not clear to me how to set up the whole thing that I could look at the graphs in ganglia and use the slurmmon web page. Or is it a ganglia OR slurmmon?
I'm thankful for any advices.

@jabrcx
Copy link
Contributor

jabrcx commented Sep 9, 2014

I appreciate the interest and I want to help make this work for you, too.

The main overview/stats/diagnostics page and the whitespace report are currently very disjoint. Let's start with the main page.

That page only requires the single slurmmond service running somewhere (like the slurm master). The raw metrics it's sending should be showing up on that host's normal ganglia page. Sounds like this is the case, right? There's not much that can go wrong so far, as it's just calling gmetric with no config -- if ganglia monitoring is already working on the host, this should be, too.

Next, there are slurmmon's custom reports you have installed to /var/www/html/ganglia/graph.d. That this the same directory as the other built-in reports for your ganglia cluster (cpu_report.php, mem_report.php, etc.), right? It's not unusual this default location won't be right -- e.g. we have many ganglia clusters, so we have to install our rpms with a --prefix in order for these custom reports to land in the right place. But it sounds like you can construct direct graph.php calls that work, right?

Lastly is the /var/www/html/slurmmon/index.psp page that just links to all those custom reports. It's pretty simple in that it just dynamically constructs and organizes these graph.php links. If the images are broken there must be something different than the direct links you manually construct.

Can you check each one of these steps and let me know where it goes wrong? If you want to compare and contrast, our main slurmmon page is public at https://portal.rc.fas.harvard.edu/slurmmon/.

Then there's the whitespace report:

That is not integrated with ganglia at all and just does a bunch of direct, live querying of slurm with sacct, scontrol, etc. Strange that the first three columns, which include CPU days wasted would be working, but not CPU efficiency, cores allocated, etc.; they're all basically using the same one or two numbers. When you run the script does it give any errors? Can you view the html source and see if there is some unclosed tag hiding further info?

It's normal if the job script preview` is blank, since slurm doesn't actually provide that. We have a custom job prolog that saves it in a db and implemented the corresponding function in the site-specific config.py:

https://github.com/fasrc/slurmmon/blob/master/lib/python/site-packages/slurmmon/config.py#L72

Our whitespace reports are not public, but I'll update the repo README with a link to a redacted screenshot for comparison.

Hope that helps.

@henkela
Copy link
Author

henkela commented Sep 9, 2014

I've been trying a few things and found that even a direct call:
fionn1/ganglia/graph.php?g=slurm_probejob_pendtime&c=Cluster&h=fionn1&z=default&r=hour
doesn't work. It just shows a blank page.
When I call the report directly (like it is done on the ganglia page) it works:
http://fionn1/ganglia/graph.php?c=Cluster&h=fionn1&m=slurmmond_probejob_pendtime_compute,IB&z=default&r=hour

Unfortunately, I couldn't find the problem, yet.

@henkela
Copy link
Author

henkela commented Sep 9, 2014

Ok, finally I found it. I was checking my httpd-configs just to be sure. And it was a misconfig in my ganglia.conf.
Now the slurmmon page works :-)
PS: just the slurm_probejob_pendtime remains broken...

@henkela
Copy link
Author

henkela commented Sep 9, 2014

I found the "error": When I have two partitions (compute,IB) then it doesn't work, which means I see cannot parse vname from 'DEF.....' in httpd error log. When I remove the second partition everything is fine. I didn't check whether it's my "strange" partition name 'IB', yet.

@jabrcx
Copy link
Contributor

jabrcx commented Sep 12, 2014

Hmm, I don't know why the 'IB' name would be a problem. We don't have caps in our names, but other ganglia metrics do. It seems to work fine in slurmmon (no errors in the logs?) but then ganglia/rrdtool can't work with it? /etc/slurmmon has

"probejob_partitions": ["compute", "IB"]

on both the node running slurmmond and on the webserver running the report page?

@jabrcx
Copy link
Contributor

jabrcx commented Sep 22, 2014

Any progress on this one? Anything I can help with to get this closer to closed for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants