Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-10622 pool: Fix ds_pool_get_version for NULL sp_map #9277

Merged
merged 1 commit into from
Jul 19, 2022

Conversation

liw
Copy link
Contributor

@liw liw commented Jun 7, 2022

Makito and Samir observed the following assertion failure after
restarting engines.

#0  raise () from /lib64/libc.so.6
#1  abort () from /lib64/libc.so.6
#2  __assert_fail_base () from /lib64/libc.so.6
#3  __assert_fail () from /lib64/libc.so.6
#4  pool_map_get_version (map=0x0) at src/common/pool_map.c:2852
#5  ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at
    src/include/daos_srv/pool.h:296
#6  pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at
    src/rebuild/srv.c:2101
#7  rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at
    src/rebuild/scan.c:954
#8  crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654
#9  ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at
    arch/abtd_ythread.c:21
#10 make_fcontext () from /usr/lib64/libabt.so.1
#11 ?? ()

The ds_pool_get_version call passed a NULL map argument to
pool_map_get_version. The ds_pool.sp_map field may be NULL after the
pool is started but before the pool receives the initial pool map from
the pool service. This patch fixes ds_pool_get_version to return 0,
which is less than all valid pool map versions, when sp_map is NULL,
resulting in rebuild retries like this:

Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
Rebuild [started] (pool 3bf68c9c ver=2)
Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
  or resource busy')
Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
Rebuild [started] (pool 3bf68c9c ver=2)
Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
  or resource busy')
Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
Rebuild [started] (pool 3bf68c9c ver=2)
Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...]
Target[2] (rank 2 idx 0 status 16 ver 1) is excluded.

Also, this patch removes some rebuild code that handles NULL
ds_pool.sp_group fields. Those can not happen as we always initialize
sp_group (as well as sp_iv_ns) before putting a ds_pool object into the
LRU.

Signed-off-by: Li Wei wei.g.li@intel.com

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@liw liw force-pushed the liw/early-scan branch from 4f53ddd to 7e36835 Compare July 5, 2022 06:41
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@liw
Copy link
Contributor Author

liw commented Jul 6, 2022

ms resilience: DAOS-11073

Makito and Samir observed the following assertion failure after
restarting engines.

  #0  raise () from /lib64/libc.so.6
  #1  abort () from /lib64/libc.so.6
  #2  __assert_fail_base () from /lib64/libc.so.6
  #3  __assert_fail () from /lib64/libc.so.6
  #4  pool_map_get_version (map=0x0) at src/common/pool_map.c:2852
  #5  ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at
      src/include/daos_srv/pool.h:296
  #6  pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at
      src/rebuild/srv.c:2101
  #7  rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at
      src/rebuild/scan.c:954
  #8  crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654
  #9  ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at
      arch/abtd_ythread.c:21
  #10 make_fcontext () from /usr/lib64/libabt.so.1
  #11 ?? ()

The ds_pool_get_version call passed a NULL map argument to
pool_map_get_version. The ds_pool.sp_map field may be NULL after the
pool is started but before the pool receives the initial pool map from
the pool service. This patch fixes ds_pool_get_version to return 0,
which is less than all valid pool map versions, when sp_map is NULL,
resulting in rebuild retries like this:

  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
    or resource busy')
  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
    or resource busy')
  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
  Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
  Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...]
  Target[2] (rank 2 idx 0 status 16 ver 1) is excluded.

Also, this patch removes some rebuild code that handles NULL
ds_pool.sp_group fields. Those can not happen as we always initialize
sp_group (as well as sp_iv_ns) before putting a ds_pool object into the
LRU.

Signed-off-by: Li Wei <wei.g.li@intel.com>
@liw liw force-pushed the liw/early-scan branch from 7e36835 to f72f2f7 Compare July 6, 2022 08:18
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.


ABT_rwlock_rdlock(pool->sp_lock);
ver = pool_map_get_version(pool->sp_map);
if (pool->sp_map != NULL)
ver = pool_map_get_version(pool->sp_map);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Note] I'm considering to redesign the management of ds_pool objects with a new "status" field, so that we'll be able to hide a ds_pool object who hasn't been fully ready yet. Then most of the code will be able to assume that sp_map is always valid, like sp_group. Hence, the current patch serves as a short-term fix, for this particular segfault seems easy to reproduce.

Copy link
Contributor

@wangdi1 wangdi1 Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return 0 is a bit obscure, probably let the caller check sp_map explicitly is clearer? Hmm, given the initial pool map version is 1, probably return 0 is ok, though probably worth D_INFO here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangdi1, my original design of the "remove orterun dependency" was to let every user of sp_map to check if it is NULL. Because of bugs like this, now I think that was probably bad decision. Hence the [Note] comment above.

We already use 0 to indicate an uninitialized map elsewhere in the code, I think. Since 0 < any valid map version, it seems to work consistently.

@liw liw marked this pull request as ready for review July 18, 2022 00:58
@liw liw requested review from Nasf-Fan, wangdi1, kccain and jolivier23 and removed request for Nasf-Fan and wangdi1 July 18, 2022 00:58
@liw
Copy link
Contributor Author

liw commented Jul 18, 2022

Looks like Fan Yong and Wang Di are both unavailable at the moment; change reviewers.

@liw liw requested a review from wangdi1 July 18, 2022 08:50
@liw
Copy link
Contributor Author

liw commented Jul 18, 2022

Ah, Wang Di is available. Sorry for the noise.

@liw liw requested a review from a team July 19, 2022 01:16
@jolivier23 jolivier23 merged commit 535febf into master Jul 19, 2022
@jolivier23 jolivier23 deleted the liw/early-scan branch July 19, 2022 14:46
liw added a commit that referenced this pull request Jul 20, 2022
Makito and Samir observed the following assertion failure after
restarting engines.

  #0  raise () from /lib64/libc.so.6
  #1  abort () from /lib64/libc.so.6
  #2  __assert_fail_base () from /lib64/libc.so.6
  #3  __assert_fail () from /lib64/libc.so.6
  #4  pool_map_get_version (map=0x0) at src/common/pool_map.c:2852
  #5  ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at
      src/include/daos_srv/pool.h:296
  #6  pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at
      src/rebuild/srv.c:2101
  #7  rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at
      src/rebuild/scan.c:954
  #8  crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654
  #9  ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at
      arch/abtd_ythread.c:21
  #10 make_fcontext () from /usr/lib64/libabt.so.1
  #11 ?? ()

The ds_pool_get_version call passed a NULL map argument to
pool_map_get_version. The ds_pool.sp_map field may be NULL after the
pool is started but before the pool receives the initial pool map from
the pool service. This patch fixes ds_pool_get_version to return 0,
which is less than all valid pool map versions, when sp_map is NULL,
resulting in rebuild retries like this:

  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
    or resource busy')
  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
    or resource busy')
  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
  Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
  Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...]
  Target[2] (rank 2 idx 0 status 16 ver 1) is excluded.

Also, this patch removes some rebuild code that handles NULL
ds_pool.sp_group fields. Those can not happen as we always initialize
sp_group (as well as sp_iv_ns) before putting a ds_pool object into the
LRU.

Signed-off-by: Li Wei <wei.g.li@intel.com>
jolivier23 pushed a commit that referenced this pull request Jul 21, 2022
Makito and Samir observed the following assertion failure after
restarting engines.

  #0  raise () from /lib64/libc.so.6
  #1  abort () from /lib64/libc.so.6
  #2  __assert_fail_base () from /lib64/libc.so.6
  #3  __assert_fail () from /lib64/libc.so.6
  #4  pool_map_get_version (map=0x0) at src/common/pool_map.c:2852
  #5  ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at
      src/include/daos_srv/pool.h:296
  #6  pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at
      src/rebuild/srv.c:2101
  #7  rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at
      src/rebuild/scan.c:954
  #8  crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654
  #9  ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at
      arch/abtd_ythread.c:21
  #10 make_fcontext () from /usr/lib64/libabt.so.1
  #11 ?? ()

The ds_pool_get_version call passed a NULL map argument to
pool_map_get_version. The ds_pool.sp_map field may be NULL after the
pool is started but before the pool receives the initial pool map from
the pool service. This patch fixes ds_pool_get_version to return 0,
which is less than all valid pool map versions, when sp_map is NULL,
resulting in rebuild retries like this:

  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
    or resource busy')
  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
    or resource busy')
  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
  Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
  Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...]
  Target[2] (rank 2 idx 0 status 16 ver 1) is excluded.

Also, this patch removes some rebuild code that handles NULL
ds_pool.sp_group fields. Those can not happen as we always initialize
sp_group (as well as sp_iv_ns) before putting a ds_pool object into the
LRU.

Signed-off-by: Li Wei <wei.g.li@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants