-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-10622 pool: Fix ds_pool_get_version for NULL sp_map #9277
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
ms resilience: DAOS-11073 |
Makito and Samir observed the following assertion failure after restarting engines. #0 raise () from /lib64/libc.so.6 #1 abort () from /lib64/libc.so.6 #2 __assert_fail_base () from /lib64/libc.so.6 #3 __assert_fail () from /lib64/libc.so.6 #4 pool_map_get_version (map=0x0) at src/common/pool_map.c:2852 #5 ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at src/include/daos_srv/pool.h:296 #6 pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at src/rebuild/srv.c:2101 #7 rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at src/rebuild/scan.c:954 #8 crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654 #9 ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at arch/abtd_ythread.c:21 #10 make_fcontext () from /usr/lib64/libabt.so.1 #11 ?? () The ds_pool_get_version call passed a NULL map argument to pool_map_get_version. The ds_pool.sp_map field may be NULL after the pool is started but before the pool receives the initial pool map from the pool service. This patch fixes ds_pool_get_version to return 0, which is less than all valid pool map versions, when sp_map is NULL, resulting in rebuild retries like this: Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device or resource busy') Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device or resource busy') Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...] Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...] Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...] Target[2] (rank 2 idx 0 status 16 ver 1) is excluded. Also, this patch removes some rebuild code that handles NULL ds_pool.sp_group fields. Those can not happen as we always initialize sp_group (as well as sp_iv_ns) before putting a ds_pool object into the LRU. Signed-off-by: Li Wei <wei.g.li@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
|
||
ABT_rwlock_rdlock(pool->sp_lock); | ||
ver = pool_map_get_version(pool->sp_map); | ||
if (pool->sp_map != NULL) | ||
ver = pool_map_get_version(pool->sp_map); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Note] I'm considering to redesign the management of ds_pool objects with a new "status" field, so that we'll be able to hide a ds_pool object who hasn't been fully ready yet. Then most of the code will be able to assume that sp_map is always valid, like sp_group. Hence, the current patch serves as a short-term fix, for this particular segfault seems easy to reproduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return 0 is a bit obscure, probably let the caller check sp_map explicitly is clearer? Hmm, given the initial pool map version is 1, probably return 0 is ok, though probably worth D_INFO here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangdi1, my original design of the "remove orterun dependency" was to let every user of sp_map to check if it is NULL. Because of bugs like this, now I think that was probably bad decision. Hence the [Note] comment above.
We already use 0 to indicate an uninitialized map elsewhere in the code, I think. Since 0 < any valid map version, it seems to work consistently.
Looks like Fan Yong and Wang Di are both unavailable at the moment; change reviewers. |
Ah, Wang Di is available. Sorry for the noise. |
Makito and Samir observed the following assertion failure after restarting engines. #0 raise () from /lib64/libc.so.6 #1 abort () from /lib64/libc.so.6 #2 __assert_fail_base () from /lib64/libc.so.6 #3 __assert_fail () from /lib64/libc.so.6 #4 pool_map_get_version (map=0x0) at src/common/pool_map.c:2852 #5 ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at src/include/daos_srv/pool.h:296 #6 pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at src/rebuild/srv.c:2101 #7 rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at src/rebuild/scan.c:954 #8 crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654 #9 ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at arch/abtd_ythread.c:21 #10 make_fcontext () from /usr/lib64/libabt.so.1 #11 ?? () The ds_pool_get_version call passed a NULL map argument to pool_map_get_version. The ds_pool.sp_map field may be NULL after the pool is started but before the pool receives the initial pool map from the pool service. This patch fixes ds_pool_get_version to return 0, which is less than all valid pool map versions, when sp_map is NULL, resulting in rebuild retries like this: Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device or resource busy') Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device or resource busy') Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...] Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...] Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...] Target[2] (rank 2 idx 0 status 16 ver 1) is excluded. Also, this patch removes some rebuild code that handles NULL ds_pool.sp_group fields. Those can not happen as we always initialize sp_group (as well as sp_iv_ns) before putting a ds_pool object into the LRU. Signed-off-by: Li Wei <wei.g.li@intel.com>
Makito and Samir observed the following assertion failure after restarting engines. #0 raise () from /lib64/libc.so.6 #1 abort () from /lib64/libc.so.6 #2 __assert_fail_base () from /lib64/libc.so.6 #3 __assert_fail () from /lib64/libc.so.6 #4 pool_map_get_version (map=0x0) at src/common/pool_map.c:2852 #5 ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at src/include/daos_srv/pool.h:296 #6 pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at src/rebuild/srv.c:2101 #7 rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at src/rebuild/scan.c:954 #8 crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654 #9 ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at arch/abtd_ythread.c:21 #10 make_fcontext () from /usr/lib64/libabt.so.1 #11 ?? () The ds_pool_get_version call passed a NULL map argument to pool_map_get_version. The ds_pool.sp_map field may be NULL after the pool is started but before the pool receives the initial pool map from the pool service. This patch fixes ds_pool_get_version to return 0, which is less than all valid pool map versions, when sp_map is NULL, resulting in rebuild retries like this: Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device or resource busy') Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device or resource busy') Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...] Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...] Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...] Target[2] (rank 2 idx 0 status 16 ver 1) is excluded. Also, this patch removes some rebuild code that handles NULL ds_pool.sp_group fields. Those can not happen as we always initialize sp_group (as well as sp_iv_ns) before putting a ds_pool object into the LRU. Signed-off-by: Li Wei <wei.g.li@intel.com>
Makito and Samir observed the following assertion failure after
restarting engines.
The ds_pool_get_version call passed a NULL map argument to
pool_map_get_version. The ds_pool.sp_map field may be NULL after the
pool is started but before the pool receives the initial pool map from
the pool service. This patch fixes ds_pool_get_version to return 0,
which is less than all valid pool map versions, when sp_map is NULL,
resulting in rebuild retries like this:
Also, this patch removes some rebuild code that handles NULL
ds_pool.sp_group fields. Those can not happen as we always initialize
sp_group (as well as sp_iv_ns) before putting a ds_pool object into the
LRU.
Signed-off-by: Li Wei wei.g.li@intel.com