-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
powermand: segfault #197
Comments
was eventually able to find all the right debuginfo rpms, they were hidden in a different place than i was expecting
so arg is NULL for some reason.
So presumably Initial assumption, I'm wondering if some part of powerman parses nodes of the form Edit: scratch that
|
grabbing the original powerman.conf and running it under simulation mode (remove So the assumption is something "real networking" ( |
running powermand under valgrind with a small-ish config I did see this
so perhaps there is a unhappy thing going on somewhere deep inside powermand or perhaps the hostlist library |
hmmmm
My initial feeling was the if statement above should be I instead tried |
FWIW I added a unit test that used that hostname format with an iterator and didn't get a hit from valgrind. |
The valgrind warnings go away if I do ...
I'm wondering if the setting of Edit: also added value of nranges and got
|
That "fix" seems reasonable to me, do the tests all pass with that? I was also unable to reproduce this locally strangely (I think because in most cases I wonder if in your workaround you set Or alternately, ensure |
Do you have a standalone hostlist test that reproduces this? I could add it to the testsuite and run it under valgrind (also would like to test against libflux-hostlist just in case). |
I don't understand this, can you explain what |
Sorry, that was just my debugging printf otuput. 399 was |
The tests now pass. I may have had some other debug stuff lingering that made it fail earlier. what's interesting is that the warning from valgrind only appears once after the very first power query. after that it disappears.
Still trying to get to bottom of this, then I'll try and create one. |
Did you check |
That confirms the suspicion that this occurs when |
Ahhh the warning also goes away with this.
I'm guessing there is a copied hostlist somewhere in powerman that is used just once (or a few times) leading to the warning. If we go with the rule that size should always be greater than nranges, then some fixes probably have to be done elsewhere. I notice
in some places. |
@grondo this appears to hit the corner case in valgrind
basically pushing 16 chunks on the hostlist, equaling This one hits along w/ hostlist_copy.
key w/ the second test is to have > 16 hostranges, so that the copy makes the internal array exactly 17. |
this appears to fix things using your
So I think this or the prior
both work. do you have any preference to a fix? |
Second one seems simpler 😅 |
Suggestion: pull in the unit tests from lsd-tools. Alternatively, convert to the hostlist in flux and bring in its unit tests. Powerman doesn't have any hostlist unit tests that I can see. |
As an aside, there is a chance that this fix doesn't fix the segfault that happened with this original issue. They are in the same region, which is promsing (the hostlist within an arglist). The only thing I can't figure out is that the valgrind warnings are basically on array out of bounds read errors occurring, but AFAICT the read is not used. Nothing indicates to me that an out of array bounds write occurs, leading to some corruption. But there's always the chance there's a side effect of the out of bounds access causes that I can't see b/c the |
Good idea, although we have to dig it up. I don't know if it's in any repo we currently have online. The belief is that pdsh's version is the newest / best.
Was talking to @grondo about this, perhaps this is a longer term todo for not just powerman but many chaos projects. The API is different, so it would require changes everywhere. |
So the segfault was hit again, but the newer hostlist wasn't installed. I was able to get more debug information this time:
(just memory refresher, here's offending code area)
it appears that the search for To solve the segfault, I could easily put all of the second branch code inside the first one when we know arg != NULL. Or just check for perhaps there is a subtlety in this area of code I don't understand. I notice in
but it does not use |
ok, i think I have an idea what's happening. I now understand that
I'm betting there is a severe error of some sort on the parent of the pelcap node. That error is being output, thus powerman parses it, wants to "populate" that result. The user didn't input that host, thus didn't get found, thus segfault. Sure enough ...
So I think just checking for |
Problem: In _process_setresult() if the plug name is not found in the inputted arglist, it can result in a segfault. Check that arg is non-NULL before trying to dereference it. Fixes chaos#197
Problem: In _process_setresult() if the plug name is not found in the inputted arglist, it can result in a segfault. This can occur when power operations on dependent targets (e.g. the parent of node needs to be powered on) have errors, leading to an unexpected "power result". Check that arg is non-NULL before trying to dereference it. Fixes chaos#197
Problem: In _process_setresult() if the plug name is not found in the inputted arglist, it can result in a segfault. This can occur when power operations on dependent targets (e.g. the parent of node needs to be powered on) have errors, leading to an unexpected host having a "power result". Check that arg is non-NULL before trying to dereference it. Fixes chaos#197
A crash of powermand recently occurred w/ either of the following
admins were able to get a core dump but for the life of me I cannot get debug symbols from powermand to get a trace. Not clear if it wasn't built with debug symbols or rpmbuild messed up and didn't build debuginfo rpm.
The problem disappeared when the hostnames for this machine happened to be changed.
Anyways, perhaps best thing to try is to try and emulate that machine's powerman.conf and create a temp /etc/hosts file with the old hostnames and see if we can reproduce under development branch. The assumption is there's something unique to the hostname formatting.
The text was updated successfully, but these errors were encountered: