Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

powermand: segfault #197

Closed
chu11 opened this issue Aug 21, 2024 · 22 comments · Fixed by #201
Closed

powermand: segfault #197

chu11 opened this issue Aug 21, 2024 · 22 comments · Fixed by #201
Assignees

Comments

@chu11
Copy link
Member

chu11 commented Aug 21, 2024

A crash of powermand recently occurred w/ either of the following

  pm -0 "pclstr[201-896]-p"
  pm -0 "peclstr[201-896]-s"

admins were able to get a core dump but for the life of me I cannot get debug symbols from powermand to get a trace. Not clear if it wasn't built with debug symbols or rpmbuild messed up and didn't build debuginfo rpm.

The problem disappeared when the hostnames for this machine happened to be changed.

Anyways, perhaps best thing to try is to try and emulate that machine's powerman.conf and create a temp /etc/hosts file with the old hostnames and see if we can reproduce under development branch. The assumption is there's something unique to the hostname formatting.

@chu11 chu11 changed the title powerman: segfault powermand: segfault Aug 21, 2024
@chu11
Copy link
Member Author

chu11 commented Aug 22, 2024

was eventually able to find all the right debuginfo rpms, they were hidden in a different place than i was expecting

(gdb) bt
#0  snprintf (__fmt=0x555555572235 "%s", __n=1024, __s=0x7fffffffe700 "") at /usr/include/bits/stdio2.h:68
#1  _process_setresult (act=act@entry=0x55557785ccb0, e=<optimized out>, dev=<optimized out>, dev=<optimized out>) at device.c:1183
#2  0x000055555555b8cc in _process_stmt (timeout=0x7fffffffec10, e=0x555577d128a0, act=0x55557785ccb0, dev=0x555559460dd0) at device.c:945
#3  _process_action (timeout=<optimized out>, dev=<optimized out>) at device.c:871
#4  dev_post_poll (pfd=0x5555557b31e0, timeout=0x7fffffffec10) at device.c:1618
#5  0x00005555555572cb in _select_loop () at powermand.c:189
#6  main (argc=<optimized out>, argv=<optimized out>) at powermand.c:129
(gdb) up
#1  _process_setresult (act=act@entry=0x55557785ccb0, e=<optimized out>, dev=<optimized out>, dev=<optimized out>) at device.c:1183
1183	                snprintf(strbuf, sizeof(strbuf), "%s", arg->val);
(gdb) p arg
$2 = (Arg *) 0x0

so arg is NULL for some reason.

            if ((arg = arglist_find(act->arglist, plug->node))) {
                arg->result = result;
                xfree(arg->val);
                arg->val = xstrdup(str);
            }

            if (result != RT_SUCCESS) {
                char strbuf[1024];
                snprintf(strbuf, sizeof(strbuf), "%s", arg->val);
                /* remove trailing carriage return or newline */
                strbuf[strcspn(strbuf, "\r\n")] = '\0';
                act->dpf_fun(act->client_id, "%s: %s", arg->node, strbuf);
            }

So presumably arglist_find() didn't find what was intended.

Initial assumption, I'm wondering if some part of powerman parses nodes of the form pclstrXXX-p incorrectly, like the hyphen messes things up. And thus messes up this parsing.

Edit:

scratch that

(gdb) p *plug
$3 = {name = 0x5555594978f0 "Enclosure", node = 0x5555594b2c60 "clustr-rak569"}

@chu11
Copy link
Member Author

chu11 commented Aug 22, 2024

grabbing the original powerman.conf and running it under simulation mode (remove --resolve-hosts and add --test-mode to each redifshpower line) did not reproduce the issue.

So the assumption is something "real networking" (--resolve-hosts + libcurl, etc.) contributed to this issue.

@chu11
Copy link
Member Author

chu11 commented Aug 22, 2024

running powermand under valgrind with a small-ish config I did see this

==2241718== Invalid read of size 8
==2241718==    at 0x40FCC4: _iterator_advance (hostlist.c:2281)
==2241718==    by 0x40FCC4: hostlist_next (hostlist.c:2315)
==2241718==    by 0x40240D: arglist_create (arglist.c:87)
==2241718==    by 0x402AEB: _create_command (client.c:533)
==2241718==    by 0x4032AF: _parse_input (client.c:617)
==2241718==    by 0x4032AF: _handle_input (client.c:1085)
==2241718==    by 0x403FD8: cli_post_poll (client.c:1176)
==2241718==    by 0x4020BF: _select_loop (powermand.c:188)
==2241718==    by 0x4020BF: main (powermand.c:129)
==2241718==  Address 0xbb5c430 is 0 bytes after a block of size 3,200 alloc'd
==2241718==    at 0x4C40B77: realloc (vg_replace_malloc.c:1690)
==2241718==    by 0x40CC51: hostlist_resize (hostlist.c:1105)
==2241718==    by 0x40DA2F: hostlist_copy (hostlist.c:1573)
==2241718==    by 0x402B1C: _create_command (client.c:519)
==2241718==    by 0x4032AF: _parse_input (client.c:617)
==2241718==    by 0x4032AF: _handle_input (client.c:1085)
==2241718==    by 0x403FD8: cli_post_poll (client.c:1176)
==2241718==    by 0x4020BF: _select_loop (powermand.c:188)
==2241718==    by 0x4020BF: main (powermand.c:129)
==2241718== 
==2241718== Invalid read of size 8
==2241718==    at 0x40FCC4: _iterator_advance (hostlist.c:2281)
==2241718==    by 0x40FCC4: hostlist_next (hostlist.c:2315)
==2241718==    by 0x402530: arglist_next (arglist.c:146)
==2241718==    by 0x40397F: _client_query_status_reply (client.c:424)
==2241718==    by 0x40397F: _act_finish (client.c:757)
==2241718==    by 0x406627: _process_action (device.c:895)
==2241718==    by 0x406627: dev_post_poll (device.c:1618)
==2241718==    by 0x4020CA: _select_loop (powermand.c:189)
==2241718==    by 0x4020CA: main (powermand.c:129)
==2241718==  Address 0xbc5f060 is 0 bytes after a block of size 3,200 alloc'd
==2241718==    at 0x4C40B77: realloc (vg_replace_malloc.c:1690)
==2241718==    by 0x40CC51: hostlist_resize (hostlist.c:1105)
==2241718==    by 0x40DA2F: hostlist_copy (hostlist.c:1573)
==2241718==    by 0x402425: arglist_create (arglist.c:94)
==2241718==    by 0x402AEB: _create_command (client.c:533)
==2241718==    by 0x4032AF: _parse_input (client.c:617)
==2241718==    by 0x4032AF: _handle_input (client.c:1085)
==2241718==    by 0x403FD8: cli_post_poll (client.c:1176)
==2241718==    by 0x4020BF: _select_loop (powermand.c:188)
==2241718==    by 0x4020BF: main (powermand.c:129)
==2241718== 

so perhaps there is a unhappy thing going on somewhere deep inside powermand or perhaps the hostlist library

@chu11
Copy link
Member Author

chu11 commented Aug 23, 2024

hmmmm

static void _iterator_advance(hostlist_iterator_t i)
{
    assert(i != NULL);
    assert(i->magic == HOSTLIST_MAGIC);
    if (i->idx > i->hl->nranges - 1)
        return;
    if (++(i->depth) > (i->hr->hi - i->hr->lo)) {
        i->depth = 0;
        i->hr = i->hl->hr[++i->idx];  // this line Invalid read of size 8
    }
}

My initial feeling was the if statement above should be if (i->idx >= (i->hl->nranges - 1)) to ensure ++i->idx does not go off the end of the array, but that lead to a bunch of test failures, likely b/c I don't understand the depth/hi/lo logic.

I instead tried assert (i->idx < (i->hl->size - 1)); right before the offending line. Presumably that should ensure we never go off the end of the array. But I also get test errors, suggesting the assert is hit. Might need to get some help here.

@garlick
Copy link
Member

garlick commented Aug 23, 2024

FWIW I added a unit test that used that hostname format with an iterator and didn't get a hit from valgrind.
(Test is on my issue#197 branch, commit a1438ef)

@chu11
Copy link
Member Author

chu11 commented Aug 23, 2024

The valgrind warnings go away if I do ...

diff --git a/src/liblsd/hostlist.c b/src/liblsd/hostlist.c
index 2ab4c045..f5e163a5 100644
--- a/src/liblsd/hostlist.c
+++ b/src/liblsd/hostlist.c
@@ -2278,6 +2278,11 @@ static void _iterator_advance(hostlist_iterator_t i)
         return;
     if (++(i->depth) > (i->hr->hi - i->hr->lo)) {
         i->depth = 0;
+        if (i->idx >= (i->hl->size - 1)) {
+            printf ("i think i'll array out of bounds %d %d\n", i->idx, i->hl->size);
+            ++i->idx;
+            return;
+        }
         i->hr = i->hl->hr[++i->idx];
     }
 }

I'm wondering if the setting of i->hr to some out of bounds value has some side effect later on in the code.

Edit:

also added value of nranges and got

i think i'll array out of bounds 399 400 400 outputting. So original code appears to have tried to go out of range. but some powerman tests are failing, so there is some side effect.

@grondo
Copy link
Member

grondo commented Aug 23, 2024

That "fix" seems reasonable to me, do the tests all pass with that?

I was also unable to reproduce this locally strangely (I think because in most cases hl->size (the number of entries allocated in the hl->hr array) is much larger than hl->nranges (the number of active hostranges)) so hl->hr == NULL in the common case and no out of bounds is used.

I wonder if in your workaround you set i->hr = NULL if that will allow your tests to pass?

Or alternately, ensure hl->size > hl->nranges when reallocing hl->hr.

@grondo
Copy link
Member

grondo commented Aug 23, 2024

Do you have a standalone hostlist test that reproduces this? I could add it to the testsuite and run it under valgrind (also would like to test against libflux-hostlist just in case).

@grondo
Copy link
Member

grondo commented Aug 23, 2024

also added value of nranges and got

i think i'll array out of bounds 399 400 400 outputting. So original code appears to have tried to go out of range. but some powerman tests are failing, so there is some side effect.

I don't understand this, can you explain what i think i'll array out of bounds ... means?

@chu11
Copy link
Member Author

chu11 commented Aug 23, 2024

I don't understand this, can you explain what i think i'll array out of bounds ... means?

Sorry, that was just my debugging printf otuput. 399 was i->idx, 400 was i->hl->size and 400 was i->hl->nranges. The output was right before the ++i->idx.

@chu11
Copy link
Member Author

chu11 commented Aug 23, 2024

That "fix" seems reasonable to me, do the tests all pass with that?

The tests now pass. I may have had some other debug stuff lingering that made it fail earlier.

what's interesting is that the warning from valgrind only appears once after the very first power query. after that it disappears.
One would think it should happen a lot, making think there is some side effect cause elsewhere. Or perhaps it only happens on the "copied" hostlist once but there is no issue on the original (or vice versa).

Do you have a standalone hostlist test that reproduces this? I could add it to the testsuite and run it under valgrind (also would like to test against libflux-hostlist just in case).

Still trying to get to bottom of this, then I'll try and create one.

@grondo
Copy link
Member

grondo commented Aug 23, 2024

what's interesting is that the warning from valgrind only appears once after the very first power query. after that it disappears.
One would think it should happen a lot, making think there is some side effect cause elsewhere. Or perhaps it only happens on the "copied" hostlist once but there is no issue on the original (or vice versa).

Did you check hl->size vs hl->nranges as discussed above? I think it is rare for the number of ranges to exactly match the size such that the invalid read occurs.

@grondo
Copy link
Member

grondo commented Aug 23, 2024

Sorry, that was just my debugging printf otuput. 399 was i->idx, 400 was i->hl->size and 400 was i->hl->nranges. The output was right before the ++i->idx.

That confirms the suspicion that this occurs when hl->size == hl->nranges. Another solution would be always ensure size > nranges, then i->hr will be set to NULL when going beyond nranges. This was probably the (incorrect) assumption in the code.

@chu11
Copy link
Member Author

chu11 commented Aug 23, 2024

That confirms the suspicion that this occurs when hl->size == hl->nranges. Another solution would be always ensure size > nranges, then i->hr will be set to NULL when going beyond nranges. This was probably the (incorrect) assumption in the code.

Ahhh the warning also goes away with this.

--- a/src/liblsd/hostlist.c
+++ b/src/liblsd/hostlist.c
@@ -1570,7 +1570,7 @@ hostlist_t hostlist_copy(const hostlist_t hl)
     new->nranges = hl->nranges;
     new->nhosts = hl->nhosts;
     if (new->nranges > new->size)
-        hostlist_resize(new, new->nranges);
+        hostlist_resize(new, new->nranges + HOSTLIST_CHUNK);
 
     for (i = 0; i < hl->nranges; i++)
         new->hr[i] = hostrange_copy(hl->hr[i]);

I'm guessing there is a copied hostlist somewhere in powerman that is used just once (or a few times) leading to the warning.

If we go with the rule that size should always be greater than nranges, then some fixes probably have to be done elsewhere. I notice

    if (hl->size == hl->nranges && !hostlist_expand(hl))                                                                                    
        goto error; 

in some places.

@chu11
Copy link
Member Author

chu11 commented Aug 23, 2024

@grondo this appears to hit the corner case in valgrind

#include <stdio.h>
#include <stdlib.h>

#include "hostlist.h"

int
main()
{
  hostlist_t hl = hostlist_create(NULL);
  hostlist_iterator_t itr;
  char *h;
  hostlist_push(hl, "fooA[0-80]");
  hostlist_push(hl, "fooB[0-80]");
  hostlist_push(hl, "fooC[0-80]");
  hostlist_push(hl, "fooD[0-80]");
  hostlist_push(hl, "fooE[0-80]");
  hostlist_push(hl, "fooF[0-80]");
  hostlist_push(hl, "fooG[0-80]");
  hostlist_push(hl, "fooH[0-80]");
  hostlist_push(hl, "fooI[0-80]");
  hostlist_push(hl, "fooJ[0-80]");
  hostlist_push(hl, "fooK[0-80]");
  hostlist_push(hl, "fooL[0-80]");
  hostlist_push(hl, "fooM[0-80]");
  hostlist_push(hl, "fooN[0-80]");
  hostlist_push(hl, "fooO[0-80]");
  hostlist_push(hl, "fooP[0-80]");
  itr = hostlist_iterator_create (hl);
  while ((h = hostlist_next (itr)))
      ;
  hostlist_iterator_destroy (itr);
  hostlist_destroy (hl);
}

basically pushing 16 chunks on the hostlist, equaling HOSTLIST_CHUNK internally.

This one hits along w/ hostlist_copy.

#include <stdio.h>
#include <stdlib.h>

#include "hostlist.h"

int
main()
{
  hostlist_t hl = hostlist_create(NULL);
  hostlist_t cpy;
  hostlist_iterator_t itr;
  char *h;
  hostlist_push(hl, "fooA[0-80]");
  hostlist_push(hl, "fooB[0-80]");
  hostlist_push(hl, "fooC[0-80]");
  hostlist_push(hl, "fooD[0-80]");
  hostlist_push(hl, "fooE[0-80]");
  hostlist_push(hl, "fooF[0-80]");
  hostlist_push(hl, "fooG[0-80]");
  hostlist_push(hl, "fooH[0-80]");
  hostlist_push(hl, "fooI[0-80]");
  hostlist_push(hl, "fooJ[0-80]");
  hostlist_push(hl, "fooK[0-80]");
  hostlist_push(hl, "fooL[0-80]");
  hostlist_push(hl, "fooM[0-80]");
  hostlist_push(hl, "fooN[0-80]");
  hostlist_push(hl, "fooO[0-80]");
  hostlist_push(hl, "fooP[0-80]");
  hostlist_push(hl, "fooQ[0-80]");
  cpy = hostlist_copy (hl);
  itr = hostlist_iterator_create (cpy);
  while ((h = hostlist_next (itr)))
      ;
  hostlist_iterator_destroy (itr);
  hostlist_destroy (hl);
}

key w/ the second test is to have > 16 hostranges, so that the copy makes the internal array exactly 17.

@chu11
Copy link
Member Author

chu11 commented Aug 23, 2024

@grondo

this appears to fix things using your hl->size > hl->nranges requirement.

diff --git a/src/liblsd/hostlist.c b/src/liblsd/hostlist.c
index 4324831c..aa4fbd00 100644
--- a/src/liblsd/hostlist.c
+++ b/src/liblsd/hostlist.c
@@ -1137,7 +1137,7 @@ static int hostlist_push_range(hostlist_t hl, hostrange_t hr)
 
     tail = (hl->nranges > 0) ? hl->hr[hl->nranges-1] : hl->hr[0];
 
-    if (hl->size == hl->nranges && !hostlist_expand(hl))
+    if ((hl->size - 1) == hl->nranges && !hostlist_expand(hl))
         goto error;
 
     if (hl->nranges > 0
@@ -1192,7 +1192,7 @@ static int hostlist_insert_range(hostlist_t hl, hostrange_t hr, int n)
     if (n > hl->nranges)
         return 0;
 
-    if (hl->size == hl->nranges && !hostlist_expand(hl))
+    if ((hl->size - 1) == hl->nranges && !hostlist_expand(hl))
         return 0;
 
     /* copy new hostrange into slot "n" in array */
@@ -1570,7 +1570,7 @@ hostlist_t hostlist_copy(const hostlist_t hl)
     new->nranges = hl->nranges;
     new->nhosts = hl->nhosts;
     if (new->nranges > new->size)
-        hostlist_resize(new, new->nranges);
+        hostlist_resize(new, new->nranges + 1);
 
     for (i = 0; i < hl->nranges; i++)
         new->hr[i] = hostrange_copy(hl->hr[i]);
@@ -2442,7 +2442,7 @@ static int hostset_insert_range(hostset_t set, hostrange_t hr)
 
     hl = set->hl;
 
-    if (hl->size == hl->nranges && !hostlist_expand(hl))
+    if ((hl->size - 1) == hl->nranges && !hostlist_expand(hl))
         return 0;
 
     nhosts = hostrange_count(hr);

So I think this or the prior

diff --git a/src/liblsd/hostlist.c b/src/liblsd/hostlist.c
index 2ab4c045..f5e163a5 100644
--- a/src/liblsd/hostlist.c
+++ b/src/liblsd/hostlist.c
@@ -2278,6 +2278,11 @@ static void _iterator_advance(hostlist_iterator_t i)
         return;
     if (++(i->depth) > (i->hr->hi - i->hr->lo)) {
         i->depth = 0;
+        if (i->idx >= (i->hl->size - 1)) {
+            ++i->idx;
+            i->hr = NULL;
+            return;
+        }
         i->hr = i->hl->hr[++i->idx];
     }
 }

both work.

do you have any preference to a fix?

@grondo
Copy link
Member

grondo commented Aug 23, 2024

Second one seems simpler 😅

@garlick
Copy link
Member

garlick commented Aug 23, 2024

Suggestion: pull in the unit tests from lsd-tools.

Alternatively, convert to the hostlist in flux and bring in its unit tests.

Powerman doesn't have any hostlist unit tests that I can see.

@chu11
Copy link
Member Author

chu11 commented Aug 23, 2024

As an aside, there is a chance that this fix doesn't fix the segfault that happened with this original issue. They are in the same region, which is promsing (the hostlist within an arglist).

The only thing I can't figure out is that the valgrind warnings are basically on array out of bounds read errors occurring, but AFAICT the read is not used. Nothing indicates to me that an out of array bounds write occurs, leading to some corruption. But there's always the chance there's a side effect of the out of bounds access causes that I can't see b/c the i->hr is left non-NULL and pointing to something bad.

@chu11
Copy link
Member Author

chu11 commented Aug 23, 2024

Suggestion: pull in the unit tests from lsd-tools.

Good idea, although we have to dig it up. I don't know if it's in any repo we currently have online. The belief is that pdsh's version is the newest / best.

Alternatively, convert to the hostlist in flux and bring in its unit tests.

Was talking to @grondo about this, perhaps this is a longer term todo for not just powerman but many chaos projects. The API is different, so it would require changes everywhere.

@chu11
Copy link
Member Author

chu11 commented Sep 9, 2024

So the segfault was hit again, but the newer hostlist wasn't installed. I was able to get more debug information this time:

(gdb) bt
#0  snprintf (__fmt=0x555555572235 "%s", __n=1024, __s=0x7fffffffe700 "") at /usr/include/bits/stdio2.h:68
#1  _process_setresult (act=act@entry=0x555575a31710, e=<optimized out>, dev=<optimized out>, dev=<optimized out>) at device.c:1183
#2  0x000055555555b8cc in _process_stmt (timeout=0x7fffffffec10, e=0x5555781d55d0, act=0x555575a31710, dev=0x555558123a20) at device.c:945
#3  _process_action (timeout=<optimized out>, dev=<optimized out>) at device.c:871
#4  dev_post_poll (pfd=0x5555557b31e0, timeout=0x7fffffffec10) at device.c:1618
#5  0x00005555555572cb in _select_loop () at powermand.c:189
#6  main (argc=<optimized out>, argv=<optimized out>) at powermand.c:129
(gdb) up
#1  _process_setresult (act=act@entry=0x555575a31710, e=<optimized out>, dev=<optimized out>, dev=<optimized out>) at device.c:1183
1183	                snprintf(strbuf, sizeof(strbuf), "%s", arg->val);
(gdb) p arg
$1 = (Arg *) 0x0
(gdb) p *plug
$3 = {name = 0x5555581236b0 "Enclosure", node = 0x555558175740 "elcap-cmm390"}
(gdb) p *act->arglist
$6 = {args = 0x555577a937d0, hl = 0x5555557ab8d0, refcount = 9}
(gdb) p *act->arglist->hl
$7 = {magic = 57005, size = 16, nranges = 5, nhosts = 559, hr = 0x5555557b25e0, ilist = 0x0}
(gdb) p *(act->arglist->hl->hr[0])
$12 = {prefix = 0x5555765b99e0 "pelcap", lo = 201, hi = 416, width = 3, singlehost = 0}
(gdb) p *(act->arglist->hl->hr[1])
$13 = {prefix = 0x555577130020 "pelcap", lo = 425, hi = 764, width = 3, singlehost = 0}
(gdb) p *(act->arglist->hl->hr[2])
$14 = {prefix = 0x5555778c3c40 "pelcap", lo = 767, hi = 767, width = 3, singlehost = 0}
(gdb) p *(act->arglist->hl->hr[3])
$15 = {prefix = 0x5555774f3520 "pelcap", lo = 781, hi = 781, width = 3, singlehost = 0}
(gdb) p *(act->arglist->hl->hr[4])
$16 = {prefix = 0x5555769d5dc0 "pelcap", lo = 837, hi = 837, width = 3, singlehost = 0}
(gdb) p *(act->arglist->hl->hr[5])
Cannot access memory at address 0x0

(just memory refresher, here's offending code area)

            if ((arg = arglist_find(act->arglist, plug->node))) {
                arg->result = result;
                xfree(arg->val);
                arg->val = xstrdup(str);
            }

            if (result != RT_SUCCESS) {
                char strbuf[1024];
                snprintf(strbuf, sizeof(strbuf), "%s", arg->val);
                /* remove trailing carriage return or newline */
                strbuf[strcspn(strbuf, "\r\n")] = '\0';
                act->dpf_fun(act->client_id, "%s: %s", arg->node, strbuf);
            }

it appears that the search for elcap-cmm390 is happening in the wrong argument list (all the hosts are pelcap hots). Thus arg is NULL, leading to segfault in the snprintf() line.

To solve the segfault, I could easily put all of the second branch code inside the first one when we know arg != NULL. Or just check for arg != NULL && arg->val != NULL. However, I don't understand the consequences at the moment.

perhaps there is a subtlety in this area of code I don't understand. I notice in _process_setplugstate() is similarly does

            if ((arg = arglist_find(act->arglist, plug->node))) {
                arg->state = state;
                xfree(arg->val);
                arg->val = xstrdup(str);
            }

but it does not use arg afterwards. Under what scenarios would arglist_find() fail and it be "normal / ok".

@chu11
Copy link
Member Author

chu11 commented Sep 9, 2024

ok, i think I have an idea what's happening. I now understand that act->arglist is the list of inputted nodes that the user passed in on the command line, which is a mountain of pelcap nodes in this case.

elcap-cmm390 wasn't input on the command line, it is a "parent" node.

I'm betting there is a severe error of some sort on the parent of the pelcap node. That error is being output, thus powerman parses it, wants to "populate" that result. The user didn't input that host, thus didn't get found, thus segfault. Sure enough ...

(gdb) p str
$1 = 0x555579a8d970 "Couldn't connect to server\n"

So I think just checking for arg != NULL is more than enough to avoid the segfault. Should investigate other situations in redfishpower where these "severe error" outputs occur and should consider removing them.

@chu11 chu11 self-assigned this Sep 9, 2024
chu11 added a commit to chu11/powerman that referenced this issue Sep 9, 2024
Problem: In _process_setresult() if the plug name is not found
in the inputted arglist, it can result in a segfault.

Check that arg is non-NULL before trying to dereference it.

Fixes chaos#197
chu11 added a commit to chu11/powerman that referenced this issue Sep 9, 2024
Problem: In _process_setresult() if the plug name is not found
in the inputted arglist, it can result in a segfault.  This
can occur when power operations on dependent targets (e.g. the
parent of node needs to be powered on) have errors, leading to
an unexpected "power result".

Check that arg is non-NULL before trying to dereference it.

Fixes chaos#197
chu11 added a commit to chu11/powerman that referenced this issue Sep 9, 2024
Problem: In _process_setresult() if the plug name is not found
in the inputted arglist, it can result in a segfault.  This
can occur when power operations on dependent targets (e.g. the
parent of node needs to be powered on) have errors, leading to
an unexpected host having a "power result".

Check that arg is non-NULL before trying to dereference it.

Fixes chaos#197
@mergify mergify bot closed this as completed in #201 Sep 10, 2024
@mergify mergify bot closed this as completed in dae192c Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants