Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1030 ips: SBE 1 should be automatically switched if the SBE 0 is broken #265

Closed
lxwinspur opened this issue Feb 13, 2023 · 10 comments
Closed

Comments

@lxwinspur
Copy link
Contributor

The current logic is: If the BMC reboot fails three times, it will automatically switch to SBE 1 (this logic considers that SBE 0 is broken)

In fact, we encountered a phenomenon:
When the BMC executes host power on, it is found that SBE 0 is broken. The normal logic is that the BMC should automatically restart and try three times. If it fails, it will automatically switch to SBE1.
But when the bmc fails to power on for the first time, the bmc will be stuck after the SBE 0 startup fails, and the bmc will not be automatically restarted, so the BMC reboot will not be executed, which will not automatically switch to SBE 1

Is this a problem?

@lxwinspur
Copy link
Contributor Author

@mzipse @geissonator @ojayanth
FYI

@ojayanth
Copy link
Contributor

Autoreboot is based on the policy , Should be true to initiate auto reboot during boot window.
root@xxxx:~# busctl get-property mapper get-service /xyz/openbmc_project/control/host0/auto_reboot /xyz/openbmc_project/control/host0/auto_reboot xyz.openbmc_project.Control.Boot.RebootPolicy AutoReboot
b true

Also need to look the host reboot counter value, by default it is three. @geissonator can comment on the behaviour of this . upstream was got support update this via Redfish API incase value is not setting correctly.

@lxwinspur
Copy link
Contributor Author

Autoreboot is based on the policy , Should be true to initiate auto reboot during boot window. root@xxxx:~# busctl get-property mapper get-service /xyz/openbmc_project/control/host0/auto_reboot /xyz/openbmc_project/control/host0/auto_reboot xyz.openbmc_project.Control.Boot.RebootPolicy AutoReboot b true

Yes, I enabled auto_reboot and this problem still exists.

Also need to look the host reboot counter value, by default it is three. @geissonator can comment on the behaviour of this . upstream was got support update this via Redfish API incase value is not setting correctly.

@geissonator
Copy link
Contributor

Please provide a bmc dump, or at least a journal so we can see what's going on. Reboot policy is only utilized if we get far enough into the boot.

@lxwinspur
Copy link
Contributor Author

Please provide a bmc dump, or at least a journal so we can see what's going on. Reboot policy is only utilized if we get far enough into the boot.

Related logs and dump files are at #263

@geissonator
Copy link
Contributor

@lxwinspur I took at look at the logs, it appears you aren't testing with the latest 1030.ips code? I put a fix for the "why do we not switch to sbe side 1" issue up via ibm-openbmc/phosphor-state-manager@39d5673 and I verified that bump is in the latest version of meta-phosphor/recipes-phosphor/state/phosphor-state-manager_git.bb in the 1030.ips but I don't see the new traces I added for that in the journal data from #263?

@lxwinspur
Copy link
Contributor Author

@geissonator

it appears you aren't testing with the latest 1030.ips code?

No, For this issue, I am based on the latest 1030.ips branch test(9a5e35f)

@geissonator
Copy link
Contributor

Hmm, I'm not sure what's going on then @lxwinspur, if you look at my commit in ibm-openbmc/phosphor-state-manager@39d5673 you can see the change I made to the log when that script does a quiesce. Your journal showed the older log (without the "and host crashed"). Please double check your level of firmware and maybe look at that script, host-reboot, on your system to ensure it has the new logic.

@lxwinspur
Copy link
Contributor Author

@sampmisr
FYI

@lxwinspur
Copy link
Contributor Author

After updating and using the following solution, the problem is solved

2a0c183

rfrandse added a commit that referenced this issue Mar 28, 2024
Sandeepa Singh (47):
  Firmware-change (#66)
  Allow only tar file upload (#71)
  Hardware Deconfiguration Page (#84)
  Deconfig-Toggles (#110)
  Filter SNMP data (#112)
  Upload acf certificate on login page (#126)
  Hardware deconfiguration fix (#128)
  TFTP firmware update (#104)
  Add filter to remove absent dimms form GUI (#139)
  Add abiliy to sort hardware deconfig columns (#162)
  Add helptext for FQDN (#164)
  Add deconfiguration type as None (#163)
  Fix link to deconfiguration records (#155)
  Remove regex from firmware (#151)
  Add alert for HMC connection disconnect (#152)
  Update hardware deconfiguration per Demo feedback (#180)
  Remove Default option from Server power operations page (#188)
  reverting removal of Default partition environment dropdown (#190)
  Add Lateral cast out page (#177)
  fix toggle issue (#191)
  Add details on login page (#193)
  Remove TFTP server option from firmware page (#194)
  Real time post codes converted to ASCII (#207)
  fix TFTP bug (#213)
  Show/Hide ACF upload button (#214)
  Fix toggle issue (#219)
  Change the toggle text to configure/deconfigure (#223)
  Fix the sorting issue in progress logs (#240)
  Translate severity to fatal,predictive and manual (#235)
  Add Pel ID column on HW deconfiguration page (#244)
  Show FW_boot_side_current attribute value (#262)
  Added filter to remove 00000000 from post code table (#272)
  Fix toast msg for HW deconfiguration page (#251)
  Add location code of Deconfig records page (#293)
  Make memory page consistent (#308)
  Add pel id column (#332)
  Update service login condition (#326)
  Edit app nav and login file (#335)
  Update Automatic helptext (#340)
  Grey out toggle when DHCP is disabled (#338)
  Disable delete when system is powered on (#327)
  Renamed added optimization page (#346)
  Fix deconfiguration record translation bug (#360)
  Fix power page translation bug (#361)
  Operating mode is translatable now (#363)
  Fix user management page translation bugs (#365)
  Fix server power ops translation bugs (#359)

Kenneth Fullbright (85):
  Removed irrelevant fields from the VET Capabilities table (#68)
  Update Firmware page interactions when system is powered on (#51)
  Updated CSR Modal & Service login Certificate Modal (#59)
  Removed OemIBMServiceAgent from  Group Privilege list (#76)
  Updated Power saver modes descriptions (#83)
  Popup SOL Console (Host Console) not showing correct connection status (#79)
  Removed irrelevant fields from the VET Capabilities table (#93)
  Added Initiate Resource Dump Function (#103)
  Fixed password change/reset code for expired password (#125)
  Fixed global action vuex error getUsers (#120)
  Fixed 'Promise.all' related errors on Overview (#119)
  Renamed "Serial over LAN (SOL) console" page (#54)
  Fixed event log table to be fully responsive (#122)
  Prevent service user password change (#88)
  Turned dumps PHYP alert into a toast (#140)
  Repaired Service login consoles links in the navbar (#145)
  Removed LDAP from navigation on non admin role accounts (#108)
  Updated the link to consoles and other nav related items
  Refactored Power page and power page related things (#109)
  Added Power restore policy missing alert on operating mode manual (#147)
  Made non-service roles not pass default password for resource dumps (#135)
  Fixed BMC Hypervisor console switch (#159)
  Enhanced user creation and current user failed message for password change (#81)
  Fixed translation double key error (#146)
  Removed service privilege option from edit user and add user (#161)
  Enhanced resource dump error messages (#168)
  Refactored Power page code for efficiency and clarity (#158)
  Fixed init system dump from resource dump (#136)
  Added toast for invalid privilege (#172)
  Fixed Service consoles (#176)
  Fix user management delete table action (#179)
  Fixed service account resource dumps password field to allow any string (#183)
  Fixed Idle power saving missing reset button option (#184)
  Removed lower and upper limit and warning sensors (#186)
  Fixed missing fields for add user on user modal (#185)
  Fixed maximum amount of users toast error (#196)
  Fixed delete and replace function in Certificates table (#197)
  Fixed navbar missing error (#206)
  Fixed popup BMC and Hypervisor consoles. (#205)
  Fixed init system dump PHYP in standby check error (#204)
  Fixed closing console conntections. (#220)
  Fixed upload certificate button not being disabled on max certificates (#224)
  Added info tool tips on password changing fields. (#225)
  Removed operator role from add role group modal (#229) [SW550540]
  Removed Operator and NoAccess roles from desciption table (#228) [SW550558]
  Fixed proxy logout error (#226)
  Created info icon for enhanced information about power consumption (#232)
  Fixed some tables not being fully responsive (#222)
  Set autocomplete option to off for password fields (#231)
  Added dump being offloaded warning for reboot and shutdown (#241)
  Fixed system dump error messages (#238)
  Fixed factory reset to default code (#243)
  Changed OemIBMServiceAgent to ServiceAgent (#261)
  Add safe mode to user interface (#250)
  Fixed fresh install set password and login error (#263)
  Fixed DHCP delete button not disabled (#273)
  Removed unsupported ServiceAgent group from LDAP group privilege modal (#268)
  Fixed Zombie state when factory resetting (#270)
  Fixed unauthorized error toast on page loading (#267)
  Fixed firmware swapping confusion (#271)
  Fixed console connection indicators (#275)
  Fixed account polocy settings displaying not updated info on refresh (#276)
  Fixed running and backup image info render problem (#287)
  Fixed event logs not updating upon delete all button (#290)
  Fixed account policy radio buttons (#289)
  Fixed secure LDAP checkbox not showing correct values (#291)
  Fixed firmware update function (#296)
  Fixed JSON.parse error from localStorage (#298)
  Fixed factory reset function to be fully async (#306)
  Removed host console access from ReadOnly roles (#307)
  Fixed SRC Details not showing on non manual records (#300)
  Fixed page memory validation error (#313)
  Fixed location code not showing on Deconfiguration records table (#317)
  Disabled users from changing username on user management table (#321)
  Added Location codes for TPM (#324)
  Fixed console indicators not updating status (#304)
  Added Location codes for TPM (#325)
  Made more meaningful toasts (#314)
  Fixed manage access keys hyperlink being disabled problems on Firmware page (#322)
  Fixed asset tag info not showing up in modal after app refresh and tag update (#333)
  Removed hashes from files (#334)
  Created real time indicator postCodeValue filter (#302)
  Fixed Deconfig table download additional data button (#328)
  Changed page "Lateral cast out" to "Added optimization" (#341)
  Added notices page (#336)

A Nikhil (47):
  Update Inventory DIMM table (#74)
  Update Inventory Assemblies table (#87)
  Update Inventory Processors table (#86)
  Incorrect Power mode value (#89)
  Dumps available on BMC are not displayed on BMC-GUI (#72)
  Components on the hardware page not in order (#101)
  No values populated for licensed and configured cores (#91)
  Update GUI as IBM (#116)
  Rename Update Firmware access key (#117)
  Health and state field of assembly components is missing in inventory page (#99)
  Event logs add missing information (#111)
  GUI has no way to turn off System attention LED (#129)
  Event log does not show information for service (#133)
  GUI missing detailed COD (#124)
  Rename count in system table (#149)
  FCO page accepts value greater than the number of licensed cores (#142)
  Part number field is showing spare part number value (#165)
  Wrong lable on SRC for logs (#156)
  Inventory and LEDs page has two system entries (#137)
  Add toggle to enable/disable the secure version lock in (#167)
  Factory reset option should only be provided at power off (#174)
  Health in critical state after marking critical errors as resolved (#189)
  Concurrent maintenance Page (#202)
  Download implementation in Event logs (#192)
   Missing host USB enable/disable (#239)
  Prevent system power on when BMC is not in Ready state (#227)
  Adding mex chassis Info (#233)
  Mex IO enclosure firmware version not displayed (#265)
  PCIe Hardware Topology (#181)
  Warning in PcieTopology.vue (#282)
  Pcie-topology and Inventory fixes (#288)
  Unable to edit group name in the Add Role group field. (#303)
  PCIe Topology Save changes (#309)
  Invalid range for I/O Adapter enlarged capacity (#311)
  Status for both system and chassis comes as absent at host power off state (#312)
  Status for system table should be Present (#320)
  Fixed Identify LED error in MEX chassis (#330)
  Assemblies section does not has search option in Inventory page (#315)
  PCIe link width for empty slots is showing as -1 (#319)
  Warning message only in manual mode (#323)
  Fixed incorrect Identity LEDs error message (#331)
  Unwanted fields for MEX components removed (#329)
  PCIe topology performance improved (#337)
  AIX/LINUX and IBM i partition are only for non-HMC manage system (#318)
  Severity values is now translatable (#357)
  Enabled value taken from translation file (#362)
  Removed .tar.xz extension from dumps (#410)

whitesource-ets[bot] (1):
  Add .whitesource configuration file

sandeepasingh116 (17):
  Add new toggles on CM page (#3)
  Changed connection status logic for Hypervisor console (#6)
  Remove dump download option from overview page (#9)
  Add text on user management page (#8)
  Rename the save setting button (#20)
  Add success toast (#18)
  Fix network eth1 error (#21)
  Disable date and time page (#24)
  Update password helptext (#19)
  Add info tooltip to frequency cap (#25)
  Read only user will not be able to toggle switches (#28)
  Make filters translatable (#33)
  Fix translations of vet capabilities (#35)
  fix english texts containing links (#38)
  Remove service login label for read only user (#45)
  fix translation defect for server power ops (#52)
  add toogle on Policies page (#73)

Reed Frandsen (1):
  Removed alert message from Update firmware component (#90)

Gunnar Mills (3):
  Enable hmc proxy (#208)
  Update notices to 1030 (#50)
  Revert "Refresh only once after login (#42)" (#59)

Nikhil Ashoka (33):
  pdated the text of server power ops documentation (#7)
  Displaying Sensors table one row at a time (#11)
  NTP server duplicate entry is not accepted (#4)
  Fabric Adapters Info in Inventory page (#12)
  Fixed Secure LDAP using SSL checkbox value (#2)
  Added progress bar for activate access key (#1)
  Error message displayed if fails to authenticate the user (#10)
  Memory page made HMC-managed independent (#15)
  Sorting fixed for status (#17)
  Sensors table now updating on refresh (#22)
  Secure LDAP is disabled when LDAP authentication disables (#23)
  Removed Service consoles page for read-only users (#14)
  Additional message added on Disable SSH (#30)
  Default partition value taken from translation file (#36)
  Updated password Max Limit (#26)
  New Error message displayed if fails to authenticate the user (#27)
  Added Status and roles values to the translation file (#31)
  Title translation (#34)
  Power values added to translation file (#32)
  Health and Date format taken from translation file (#37)
  Added possible property values in translation file (#39)
  Displaying System Anchor value (#40)
  Added Info tooltip to VirtualTPM (#47)
  Added max limit based on selected user (#46)
  Refresh only once after login (#42)
  Lamp test switch disabled once ON (#48)
  Tab names translated in Inventory page (#54)
  Using privilege values from the translation file (#56)
  Deconfiguration type is taken from translation file (#57)
  Fabric Adapter table showing Name (#55)
  PCIe topology overlapping fix (#53)
  Added Identity LED to Fabric Adapters (#49)
  Removed Error message from Accounts verification (#44)

Dixsie Wolmers (14):
  Fix network settings defects - FQDN, link info, and MAC address (#113)
  Audit translation file (#115)
  Network settings - update DHCP section (#114)
  Add deconfiguration logs page (#121)
  Fix host console route (#157)
  Fix language dropdown on login page (#166)
  Network settings fixes - dhcp modal, edit ipv4, default gateway (#175)
  Update deconfig log table (#200)
  Update  network settings ipv4 table (#199)
  Fix network settings hostname and IUM errors (#210)
  Add  ability to edit asset tag (#211)
  Fix LDAP form values when LDAP disabled - SW546990 (#245)
  Fix deconfig records defects (#246)
  Update maintainers - Remove Dixsie and add Sandeepa (#286)

aixt9n aixt9n (2):
  i18n: KO_KR: Drop latest translated files for webui-vue (#257)
  i18n: ES_ES: Drop latest translated files for webui-vue (#258)

Change-Id: Ib5cb6cfccace5b718d22173ff1df4e8ce2a1e05c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants