Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ycabled] fix exception-handling logic for ycabled #306

Merged
merged 24 commits into from
Nov 11, 2022

Conversation

vdahiya12
Copy link
Contributor

@vdahiya12 vdahiya12 commented Oct 19, 2022

Signed-off-by: vaibhav-dahiya vdahiya@microsoft.com
This PR is changing how exception handling is handled in python threading.
In python3.7 if child thread gets an exception, there is no way of knowing an exception occured, unless a join is called. The problem is resolved in this PR
Also task classes are structured this way

class YcableInfoUpdateTask(threading.Thread):

    def __init__(self):
        threading.Thread.__init__(self)
        self.exc = None
        
   def run(self):
        if self.task_stopping_event.is_set():
            return

        self.task_thread = threading.Thread(target=self.task_worker, args=(y_cable_presence,))
        self.task_thread.start()
        try:
            self.task_worker(self.y_cable_presence)
        except Exception as e:
            helper_logger.log_error("Exception occured at child thread YcableInfoUpdateTask due to {} {}".format(repr(e), traceback.format_exc()))

            self.exc = e

    def join(self):
        threading.Thread.join(self)

        if self.exc:
            raise self.exc

This allows helping catching the exception thrown by a child thread get caught along with traceback.
Main thread has a while loop, which monitors each thread, if not running will kill the whole process and supervisord will bring back the ycabled up.

Description

Motivation and Context

How Has This Been Tested?

Unit-tests and deploying changes on testbed

Additional Information (Optional)

sonic-mgmt tests which pass with this change

/var/src/sonic-mgmt-int/tests/logs/dualtor/test_ipinip.py::test_decap_active_tor
/var/src/sonic-mgmt-int/tests/logs/dualtor/test_ipinip.py::test_decap_standby_tor
/var/src/sonic-mgmt-int/tests/logs/dualtor/test_toggle_mux
/var/src/sonic-mgmt-int/dualtor_io/test_link_failure.py::test_standby_tor_downlink_down_downstream_active
/var/src/sonic-mgmt-int/tests/logs/platform_tests/api/test_chassis.py
In addition ran this script for testing the change on some Gemini devices
as well as switchover regression
time cost distribution

key: time cost

value: counts

i.e. 10:272 means 272 toggles took 0~10ms

{10: 272, 20: 217, 30: 3, 50: 3, 100: 3, 101: 3}

average (ms)

11.204035928143714

sudo cp <fw> /usr/share/sonic/firmware/.

declare -a PORTS='Ethernet0 Ethernet4 Ethernet8 Ethernet12 Ethernet16 Ethernet20 Ethernet40 Ethernet44 Ethernet48 Ethernet52 Ethernet56 Ethernet60 Ethernet64 Ethernet68 Ethernet72 Ethernet76 Ethernet80 Ethernet84 Ethernet104 Ethernet108 Ethernet112 Ethernet116 Ethernet120 Ethernet124'

declare -a PORTS1='Ethernet4'

for port in ${PORTS[@]}
do
  echo -e "\n\nname: $port"
  sudo show muxcable firmware version $port
  sleep 1
  sudo config muxcable firmware download <fw> $port
  sleep 10
  sudo show muxcable firmware version $port

  sudo config muxcable firmware activate $port
  sleep 10
  sudo show muxcable firmware version $port

  show mux status
  sleep 1
  sudo config mux mode auto $port
  sudo config mux mode active $port
  sleep 1
  show mux metrics $port
  sudo config mux mode auto $port
  sudo config mux mode standby $port
  sleep 1
  show mux metrics $port
  sleep 1
  show mux cableinfo $port
  sleep 1
  show mux alivecablestatus $port
  sleep 1
  show mux eyeinfo $port NIC
  show mux eyeinfo $port TORA
  show mux eyeinfo $port TORB
  sleep 1
  show mux pcsstatistics $port NIC
  sleep 1
  show mux pcsstatistics $port TORA
  sleep 1
  show mux pcsstatistics $port TORB
  sleep 1
  show mux fecstatistics $port NIC
  sleep 1
  show mux fecstatistics $port TORA
  sleep 1
  show mux fecstatistics $port TORB
  sleep 1
  show mux get-fec-anlt-speed $port
  sleep 1
  show mux hwmode mux $port
  sleep 1
  show mux hwmode switch $port
  sleep 1
  redis-cli -n 6 hgetall "HW_MUX_CABLE_TABLE|$port"
  sleep 1
  redis-cli -n 6 hgetall "MUX_CABLE_INFO|$port"
  sleep 1
  redis-cli -n 6 hgetall "MUX_CABLE_STATIC_INFO|$port"
  sleep 1
  redis-cli -n 6 hset "TRANSCEIVER_STATUS|$port" "status" "0"
  sleep 1
  sudo config mux mode auto $port
  sudo config mux mode active $port
  sleep 1
  show mux metrics $port
  sudo config mux mode auto $port
  sudo config mux mode standby $port
  sleep 1
  show mux metrics $port
  sleep 1
  show mux cableinfo $port
  sleep 1
  show mux alivecablestatus $port
  sleep 1
  show mux eyeinfo $port NIC
  show mux eyeinfo $port TORA
  show mux eyeinfo $port TORB
  sleep 1
  show mux pcsstatistics $port NIC
  sleep 1
  show mux pcsstatistics $port TORA
  sleep 1
  show mux fecstatistics $port NIC
  sleep 1
  show mux fecstatistics $port TORA
  sleep 1
  show mux fecstatistics $port TORB
  sleep 1
  show mux get-fec-anlt-speed $port
  sleep 1
  show mux hwmode mux $port
  sleep 1
  show mux hwmode switch $port
  sleep 1
  redis-cli -n 6 hgetall "HW_MUX_CABLE_TABLE|$port"
  sleep 1
  redis-cli -n 6 hgetall "MUX_CABLE_INFO|$port"
  sleep 1
  redis-cli -n 6 hgetall "MUX_CABLE_STATIC_INFO|$port"
  sleep 1
  redis-cli -n 6 hset "TRANSCEIVER_STATUS|$port" "status" "1"
  sleep 1
  show mux status
  sleep 1
  sudo config mux mode auto $port
  sudo config mux mode active $port
  sleep 1
  show mux metrics $port
  sudo config mux mode auto $port
  sudo config mux mode standby $port
  sleep 1
  show mux metrics $port
  sleep 1
  show mux cableinfo $port
  sleep 1
  show mux alivecablestatus $port
  sleep 1
  show mux eyeinfo $port NIC
  show mux eyeinfo $port TORA
  show mux eyeinfo $port TORB
  sleep 1
  show mux pcsstatistics $port NIC
  sleep 1
  show mux pcsstatistics $port TORA
  sleep 1
  show mux pcsstatistics $port TORB
  sleep 1
  show mux fecstatistics $port NIC
  sleep 1
  show mux fecstatistics $port TORA
  sleep 1
  show mux fecstatistics $port TORB

done

No exceptions observed, ycabled seems healthy with true thread count

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Oct 19, 2022

This pull request introduces 2 alerts when merging e3cd1a0 into 23b6970 - view on LGTM.com

new alerts:

  • 1 for Too few arguments in formatting call
  • 1 for Syntax error

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Oct 19, 2022

This pull request introduces 1 alert when merging 68e9412 into 23b6970 - view on LGTM.com

new alerts:

  • 1 for Too few arguments in formatting call

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Oct 20, 2022

This pull request introduces 2 alerts when merging cc96144 into 4ea12cf - view on LGTM.com

new alerts:

  • 2 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2022

This pull request introduces 3 alerts when merging 81cf7ca into 4ea12cf - view on LGTM.com

new alerts:

  • 3 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2022

This pull request introduces 4 alerts when merging b898ffa into 4ea12cf - view on LGTM.com

new alerts:

  • 4 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2022

This pull request introduces 4 alerts when merging 2807278 into 4ea12cf - view on LGTM.com

new alerts:

  • 4 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2022

This pull request introduces 4 alerts when merging 3cfc033 into 4ea12cf - view on LGTM.com

new alerts:

  • 4 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2022

This pull request introduces 4 alerts when merging 2c14791 into 4ea12cf - view on LGTM.com

new alerts:

  • 4 for Illegal raise

@vdahiya12 vdahiya12 requested a review from prgeor October 31, 2022 17:10
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 1, 2022

This pull request introduces 4 alerts when merging a71b11b into 9ae551f - view on LGTM.com

new alerts:

  • 4 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 1, 2022

This pull request introduces 4 alerts when merging e92c74f into 9ae551f - view on LGTM.com

new alerts:

  • 4 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 2, 2022

This pull request introduces 4 alerts when merging 9745e6a into 9ae551f - view on LGTM.com

new alerts:

  • 4 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@prgeor
Copy link
Collaborator

prgeor commented Nov 7, 2022

@vdahiya12 there are 4 new LGTM alerts. could you fix?

logical_port_name = "Ethernet0"
mux_tbl = Table("STATE_DB", "Y_CABLE_INFO_TABLE")
rc = post_port_mux_info_to_db(logical_port_name, mux_tbl,'active-standby')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test to crash at known location and see if we print backtrace ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UT is added, fixed

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 7, 2022

This pull request introduces 4 alerts when merging 1ac2a2b into 905874d - view on LGTM.com

new alerts:

  • 4 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 8, 2022

This pull request introduces 4 alerts when merging 56c48ef into 905874d - view on LGTM.com

new alerts:

  • 4 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 8, 2022

This pull request introduces 4 alerts when merging 2e19b16 into 905874d - view on LGTM.com

new alerts:

  • 4 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 8, 2022

This pull request introduces 4 alerts when merging e23540c into 905874d - view on LGTM.com

new alerts:

  • 4 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 9, 2022

This pull request introduces 4 alerts when merging 0bd7cfb into 905874d - view on LGTM.com

new alerts:

  • 4 for Illegal raise

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 9, 2022

This pull request introduces 4 alerts when merging 2061ab1 into 905874d - view on LGTM.com

new alerts:

  • 4 for Illegal raise

@vdahiya12 vdahiya12 requested a review from prgeor November 9, 2022 00:56
@@ -32,6 +33,20 @@

os.environ["YCABLE_UNIT_TESTING"] = "1"

exception_output_expected= """\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this work if python version changes in future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would not be backwards compatible, thats why I had added some string checks only,
Better to omit this check I think.

assert(expected_exception_start.args == expected_exception_join.args)
assert("NotImplementedError" in str(trace) and "effect" in str(trace))
assert("sonic-ycabled/ycable/ycable_utilities/y_cable_helper.py" in str(trace))
assert("line 3655" in str(trace))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the line no change if the code changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes line number will change and assert will have to be updated.
Maybe we need to just focus on the trace lines

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
@lgtm-com
Copy link

lgtm-com bot commented Nov 10, 2022

This pull request introduces 4 alerts when merging 419e742 into 905874d - view on LGTM.com

new alerts:

  • 4 for Illegal raise

@prgeor prgeor merged commit e8c5657 into sonic-net:master Nov 11, 2022
vdahiya12 added a commit that referenced this pull request Nov 11, 2022
Cherry-pick conflict
#306

Signed-off-by: vaibhav-dahiya vdahiya@microsoft.com
This PR is changing how exception handling is handled in python threading.
In python3.7 if child thread gets an exception, there is no way of knowing an exception occured, unless a join is called. The problem is resolved in this PR
Also task classes are structured this way

class YcableInfoUpdateTask(threading.Thread):

    def __init__(self):
        threading.Thread.__init__(self)
        self.exc = None
        
   def run(self):
        if self.task_stopping_event.is_set():
            return

        self.task_thread = threading.Thread(target=self.task_worker, args=(y_cable_presence,))
        self.task_thread.start()
        try:
            self.task_worker(self.y_cable_presence)
        except Exception as e:
            helper_logger.log_error("Exception occured at child thread YcableInfoUpdateTask due to {} {}".format(repr(e), traceback.format_exc()))

            self.exc = e

    def join(self):
        threading.Thread.join(self)

        if self.exc:
            raise self.exc

This allows helping catching the exception thrown by a child thread get caught along with traceback.
Main thread has a while loop, which monitors each thread, if not running will kill the whole process and supervisord will bring back the ycabled up.

Description
Motivation and Context
How Has This Been Tested?
Unit-tests and deploying changes on testbed

Additional Information (Optional)
sonic-mgmt tests which pass with this change

/var/src/sonic-mgmt-int/tests/logs/dualtor/test_ipinip.py::test_decap_active_tor
/var/src/sonic-mgmt-int/tests/logs/dualtor/test_ipinip.py::test_decap_standby_tor
/var/src/sonic-mgmt-int/tests/logs/dualtor/test_toggle_mux
/var/src/sonic-mgmt-int/dualtor_io/test_link_failure.py::test_standby_tor_downlink_down_downstream_active
/var/src/sonic-mgmt-int/tests/logs/platform_tests/api/test_chassis.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants