Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

CORTX-33564: Fixed panic in rpc_session_cancelled() due to session state #1969

Merged
merged 4 commits into from
Jul 18, 2022

Conversation

yatin-mahajan
Copy link
Contributor

@yatin-mahajan yatin-mahajan commented Jul 7, 2022

Problem:
While session was getting canceled some other thread is trying to terminate
that session and moves session state to M0_RPC_SESSION_TERMINATING, which
lead to hit assert M0_POST(session->s_sm.sm_state == M0_RPC_SESSION_IDLE)
in rpc_session_cancel().

#5 in m0_arch_panic (c=c@entry=0x7f7276a91b40 <__pctx.14289>, ap=ap@entry=0x7f7268c05ce0) at lib/user_space/uassert.c:131
#6 in m0_panic (ctx=ctx@entry=0x7f7276a91b40 <__pctx.14289>) at lib/assert.c:52
#7 in m0_rpc_session_cancel (session=session@entry=0x56283c7c13d8) at rpc/session.c:863
#8 in m0_rpc_conn_sessions_cancel (conn=conn@entry=0x56283c7c1030) at rpc/conn.c:1333
#9 in rpc_conn__on_service_event_cb (clink=) at rpc/conn.c:1364
#10 in clink_signal (clink=clink@entry=0x56283c7c12c0) at lib/chan.c:135
#11 in chan_signal_nr (chan=chan@entry=0x56283c6a8768, nr=1) at lib/chan.c:154
#12 in m0_chan_broadcast (chan=chan@entry=0x56283c6a8768) at lib/chan.c:174
#13 in ha_state_accept (ignore_same_state=1, note=0x7f7268c06060, confc=0x56283816b028) at ha/note.c:18

Solution:
It is possible that some other thread is trying to terminate the same session
while session is getting cancelled, only the IDLE/BUSY sessions are allowed to
cancel. Updated pre check to return from m0_rpc_cancel instead of panic/assert.
Also replaced M0_POST()/assert with proper debug log.

Signed-off-by: Yatin Mahajan yatin.mahajan@seagate.com

Problem Statement

  • Problem statement

Design

  • For Bug, Describe the fix here.
  • For Feature, Post the link for design

Coding

Checklist for Author

  • Coding conventions are followed and code is consistent

Testing

Checklist for Author

  • Unit and System Tests are added
  • Test Cases cover Happy Path, Non-Happy Path and Scalability
  • Testing was performed with RPM

Impact Analysis

Checklist for Author/Reviewer/GateKeeper

  • Interface change (if any) are documented
  • Side effects on other features (deployment/upgrade)
  • Dependencies on other component(s)

Review Checklist

Checklist for Author

  • JIRA number/GitHub Issue added to PR
  • PR is self reviewed
  • Jira and state/status is updated and JIRA is updated with PR link
  • Check if the description is clear and explained

Documentation

Checklist for Author

  • Changes done to WIKI / Confluence page / Quick Start Guide

rpc/session.c Outdated Show resolved Hide resolved
Copy link
Contributor

@ivan-alekhin ivan-alekhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch does not address the root cause: unprotected access to sm state. The patch, essentially, disables the preconditions. It means the situation described in the PR description is still possible: session is getting canceled while another thread is trying to terminate it. However, with this patch we won't see it because preconditions were removed. We will have a hidden race condition instead.
Please take a look at the comments.

Also, please include the backtraces of the issue into the commit message.

rpc/session.c Show resolved Hide resolved
rpc/session.c Outdated Show resolved Hide resolved
Problem:
  While session was getting canceled some other thread is trying to terminate
  that session and moves session state to M0_RPC_SESSION_TERMINATING, which
  lead to hit assert M0_POST(session->s_sm.sm_state == M0_RPC_SESSION_IDLE)
  in rpc_session_cancel().

Solution:
  It is possible that some other thread is trying to terminate the same session
  while session is getting cancelled, only the IDLE/BUSY sessions are allowed to
  cancel. Updated pre check to return from m0_rpc_cancel instead of panic/assert.
  Also replaced M0_POST()/assert with  proper debug log.

Signed-off-by: Yatin Mahajan <yatin.mahajan@seagate.com>
Problem:
  While session was getting canceled some other thread is trying to terminate
  that session and moves session state to M0_RPC_SESSION_TERMINATING, which
  lead to hit assert M0_POST(session->s_sm.sm_state == M0_RPC_SESSION_IDLE)
  in rpc_session_cancel().

  Seagate#5  in m0_arch_panic (c=c@entry=0x7f7276a91b40 <__pctx.14289>, ap=ap@entry=0x7f7268c05ce0) at lib/user_space/uassert.c:131
  Seagate#6  in m0_panic (ctx=ctx@entry=0x7f7276a91b40 <__pctx.14289>) at lib/assert.c:52
  Seagate#7  in m0_rpc_session_cancel (session=session@entry=0x56283c7c13d8) at rpc/session.c:863
  Seagate#8  in m0_rpc_conn_sessions_cancel (conn=conn@entry=0x56283c7c1030) at rpc/conn.c:1333
  Seagate#9  in rpc_conn__on_service_event_cb (clink=<optimized out>) at rpc/conn.c:1364
  Seagate#10 in clink_signal (clink=clink@entry=0x56283c7c12c0) at lib/chan.c:135
  Seagate#11 in chan_signal_nr (chan=chan@entry=0x56283c6a8768, nr=1) at lib/chan.c:154
  Seagate#12 in m0_chan_broadcast (chan=chan@entry=0x56283c6a8768) at lib/chan.c:174
  Seagate#13 in ha_state_accept (ignore_same_state=1, note=0x7f7268c06060, confc=0x56283816b028) at ha/note.c:18

Solution:
  It is possible that some other thread is trying to terminate the same session
  while session is getting cancelled, only the IDLE/BUSY sessions are allowed to
  cancel. Updated pre check to return from m0_rpc_cancel instead of panic/assert.
  Also replaced M0_POST()/assert with  proper debug log.

Signed-off-by: Yatin Mahajan <yatin.mahajan@seagate.com>
@yatin-mahajan yatin-mahajan changed the base branch from CORTX-33564 to main July 12, 2022 06:46
@madhavemuri
Copy link
Contributor

@rkothiya rkothiya merged commit 773eea4 into Seagate:main Jul 18, 2022
@rkothiya
Copy link
Contributor

Due to rate limit the ci/cd results did not get updated in the pr, but I have checked it had know failures :
image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants