DAOS-10622 pool: Fix ds_pool_get_version for NULL sp_map (#9277) #9753
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Makito and Samir observed the following assertion failure after
restarting engines.
#0 raise () from /lib64/libc.so.6
#1 abort () from /lib64/libc.so.6
#2 __assert_fail_base () from /lib64/libc.so.6
#3 __assert_fail () from /lib64/libc.so.6
#4 pool_map_get_version (map=0x0) at src/common/pool_map.c:2852
#5 ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at
src/include/daos_srv/pool.h:296
#6 pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at
src/rebuild/srv.c:2101
#7 rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at
src/rebuild/scan.c:954
#8 crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654
#9 ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at
arch/abtd_ythread.c:21
#10 make_fcontext () from /usr/lib64/libabt.so.1
#11 ?? ()
The ds_pool_get_version call passed a NULL map argument to
pool_map_get_version. The ds_pool.sp_map field may be NULL after the
pool is started but before the pool receives the initial pool map from
the pool service. This patch fixes ds_pool_get_version to return 0,
which is less than all valid pool map versions, when sp_map is NULL,
resulting in rebuild retries like this:
Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
Rebuild [started] (pool 3bf68c9c ver=2)
Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
or resource busy')
Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
Rebuild [started] (pool 3bf68c9c ver=2)
Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
or resource busy')
Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
Rebuild [started] (pool 3bf68c9c ver=2)
Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...]
Target[2] (rank 2 idx 0 status 16 ver 1) is excluded.
Also, this patch removes some rebuild code that handles NULL
ds_pool.sp_group fields. Those can not happen as we always initialize
sp_group (as well as sp_iv_ns) before putting a ds_pool object into the
LRU.
Signed-off-by: Li Wei wei.g.li@intel.com