Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-10622 pool: Fix ds_pool_get_version for NULL sp_map (#9277) #9753

Merged
merged 1 commit into from
Jul 21, 2022

Conversation

liw
Copy link
Contributor

@liw liw commented Jul 20, 2022

Makito and Samir observed the following assertion failure after
restarting engines.

#0 raise () from /lib64/libc.so.6
#1 abort () from /lib64/libc.so.6
#2 __assert_fail_base () from /lib64/libc.so.6
#3 __assert_fail () from /lib64/libc.so.6
#4 pool_map_get_version (map=0x0) at src/common/pool_map.c:2852
#5 ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at
src/include/daos_srv/pool.h:296
#6 pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at
src/rebuild/srv.c:2101
#7 rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at
src/rebuild/scan.c:954
#8 crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654
#9 ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at
arch/abtd_ythread.c:21
#10 make_fcontext () from /usr/lib64/libabt.so.1
#11 ?? ()

The ds_pool_get_version call passed a NULL map argument to
pool_map_get_version. The ds_pool.sp_map field may be NULL after the
pool is started but before the pool receives the initial pool map from
the pool service. This patch fixes ds_pool_get_version to return 0,
which is less than all valid pool map versions, when sp_map is NULL,
resulting in rebuild retries like this:

Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
Rebuild [started] (pool 3bf68c9c ver=2)
Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
or resource busy')
Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
Rebuild [started] (pool 3bf68c9c ver=2)
Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
or resource busy')
Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
Rebuild [started] (pool 3bf68c9c ver=2)
Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...]
Target[2] (rank 2 idx 0 status 16 ver 1) is excluded.

Also, this patch removes some rebuild code that handles NULL
ds_pool.sp_group fields. Those can not happen as we always initialize
sp_group (as well as sp_iv_ns) before putting a ds_pool object into the
LRU.

Signed-off-by: Li Wei wei.g.li@intel.com

Makito and Samir observed the following assertion failure after
restarting engines.

  #0  raise () from /lib64/libc.so.6
  #1  abort () from /lib64/libc.so.6
  #2  __assert_fail_base () from /lib64/libc.so.6
  #3  __assert_fail () from /lib64/libc.so.6
  #4  pool_map_get_version (map=0x0) at src/common/pool_map.c:2852
  #5  ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at
      src/include/daos_srv/pool.h:296
  #6  pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at
      src/rebuild/srv.c:2101
  #7  rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at
      src/rebuild/scan.c:954
  #8  crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654
  #9  ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at
      arch/abtd_ythread.c:21
  #10 make_fcontext () from /usr/lib64/libabt.so.1
  #11 ?? ()

The ds_pool_get_version call passed a NULL map argument to
pool_map_get_version. The ds_pool.sp_map field may be NULL after the
pool is started but before the pool receives the initial pool map from
the pool service. This patch fixes ds_pool_get_version to return 0,
which is less than all valid pool map versions, when sp_map is NULL,
resulting in rebuild retries like this:

  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
    or resource busy')
  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
    or resource busy')
  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
  Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
  Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...]
  Target[2] (rank 2 idx 0 status 16 ver 1) is excluded.

Also, this patch removes some rebuild code that handles NULL
ds_pool.sp_group fields. Those can not happen as we always initialize
sp_group (as well as sp_iv_ns) before putting a ds_pool object into the
LRU.

Signed-off-by: Li Wei <wei.g.li@intel.com>
@github-actions
Copy link

Bug-tracker data:
Unable to load ticket data for 'DAOS-10622'
https://daosio.atlassian.net/browse/DAOS-10622

@liw liw added the clean-cherry-pick Cherry-pick from another branch that did not require additional edits label Jul 20, 2022
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@liw liw marked this pull request as ready for review July 21, 2022 00:39
@liw liw added the release-2.2 PR is eventually targeted for 2.2 label Jul 21, 2022
@liw liw requested a review from a team July 21, 2022 00:40
@jolivier23 jolivier23 merged commit d01b8a6 into release/2.2 Jul 21, 2022
@jolivier23 jolivier23 deleted the liw/early-scan-2.2 branch July 21, 2022 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clean-cherry-pick Cherry-pick from another branch that did not require additional edits release-2.2 PR is eventually targeted for 2.2
Development

Successfully merging this pull request may close these issues.

3 participants