Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYSTEM READY #875

Merged
merged 1 commit into from
Nov 9, 2021
Merged

SYSTEM READY #875

merged 1 commit into from
Nov 9, 2021

Conversation

sg893052
Copy link
Contributor

@sg893052 sg893052 commented Oct 1, 2021

Support for System Ready

Repo PR title State
sonic-buildimage System Ready
sonic-utilities Show commands for System Ready

@ghost
Copy link

ghost commented Oct 1, 2021

CLA assistant check
All CLA requirements met.

@sg893052 sg893052 force-pushed the SYSTEM_READY branch 3 times, most recently from 257540f to be1191a Compare October 5, 2021 11:55
"type": "hash",
"value": {
"FAIL_REASON": "",
"TIME": "20211005 04:19:30",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use consistent lower case field names like other fields we have today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will update as per suggestion.

"SYSTEM_READY|SYSTEM_STATE": {
"type": "hash",
"value": {
"Status": "UP"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use "status"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will update as per suggestion.

SONiC package installation process will register new feature in CONFIG DB.
Third party dockers(signature verified) gets integrated into sonic os and runs similar to the existing dockers accessing db etc.
Now, once the feature is enabled, it becomes part of either sonic.target or multi-user.target and when it starts, it automatically comes under the system monitor framework watchlist.
However, app ready status for those dockers cant be tracked unless they comply with the framework logic.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we do app ready status, please add some details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will add the following for application extension:
App ready status for those dockers cant be tracked unless they comply with the framework logic. Hence any third party docker needs to follow the framework logic by including "check_up_status" field while registering itself in CONFIG_DB and also make use of the provision given to docker apps to mark its closest up status in STATE_DB.

"<dockername>": {
...
"state": "enabled",
"check_up_status": "True"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature table is for docker based features. Can we introduce a new table for host services like caclmgrd, hostcfgd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, Will introduce a new table in CONFIG_DB called HOST_FEATURE for this purpose.

- sonic-db-cli STATE_DB HSET "FEATURE|<dockername>" TIME "<timestamp in <datetime.now().strftime('%Y%m%d %H:%M:%S')> format >"

- Schema in STATE_DB
sonic-db-dump -n STATE_DB output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host services don't belong to FEATURE table and we need a different table for host services like hostcfgd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

"<dockername>": {
...
"state": "enabled",
"check_up_status": "True"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the feature yang on adding the new field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will update the feature yang.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgsudharsan I couldn't find the yang code for FEATURE table. Could you please point it out? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox
Copy link
Contributor

How does it monitor hardware status? Like fan/PSU/ASIC


## 1.1 Limitation of Existing tools:
- Monit tool is a poll based approach which monitors the configured services for every 1 minute.
- Monit tools feature of critical process monitoring was deprecated as supervisor does the job. Hence system-health tool which depends on monit does not work.
Copy link
Contributor

@Junchao-Mellanox Junchao-Mellanox Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate on how does this feature monitor critical process? .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As supervisor does the job of monitoring the docker critical processes, system ready framework need not explicitly monitor them.
The flow is that critical process exit is notified to supervisor. supervisor exit is notified to systemd. sysmonitor monitors systemd service.



For instances,
- swss docker app is considered to be UP only when its systemd service is in running state and when its dependencies are met by checking state-db entries populated by *mgrs, like intfmgr, vrfmgr, vxlanmgr.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions here:

  1. Which process/script is responsible for checking state-db entries populated by *mgrs?
  2. Which state-db entries should be checked?
  3. What could be the behavior by default if no one marks UP_STATUS to True?
  4. Does it mean that each docker service/app extension should be extended and find a way to mark its UP_STATUS to True in order to work with this framework?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In simple, each app is responsible in marking its closest up status in STATE_DB. Sysmonitor tool just reads from it.
Any docker app which has multiple independent daemons can maintain a separate intermediate key-value in the redis-db for each of the daemons and the startup script that invokes each of these daemons can determine the status from the redis entries by each daemon and finally update the STATE_DB UP_STATUS.
Eg, swss docker app can wait for port init done and wait for Vrfmgr, Intfmgr and Vxlanmgr to be ready before marking its up status.

Any app which agrees to comply with the system ready framework, should add an entry "check_up_status" to "true" in FEATURE table in CONFIG_DB first before marking its closest UP_STATUS in STATE_DB.
If "check_up_status" field in FEATURE table of CONFIG_DB is set to true for an app, then its UP_STATUS will be checked in STATE_DB, post checking the running status of that app, by sysmonitor tool.
If the entry is set to false or not set at all, sysmonitor just reads the running status for that service but not the app ready status.

@sg893052
Copy link
Contributor Author

How does it monitor hardware status? Like fan/PSU/ASIC

System ready framework monitors all the sonic services(docker+host) along with its app readiness to declare the system is ready for network traffic. It does not monitor hardware status.

@zhangyanzhao
Copy link
Collaborator

Support for System Ready

Repo PR title State
sonic-buildimage System Ready  Open
sonic-utilities Show commands for System Ready  Open

Can you update this table with template #806 which display the status with better view? Thanks

@sg893052
Copy link
Contributor Author

sg893052 commented Nov 5, 2021

Support for System Ready
Repo PR title State
sonic-buildimage System Ready  Open
sonic-utilities Show commands for System Ready  Open

Can you update this table with template #806 which display the status with better view? Thanks

Have updated the table as suggested.

@zhangyanzhao zhangyanzhao merged commit 984ab4b into sonic-net:master Nov 9, 2021
@lguohan
Copy link
Contributor

lguohan commented Nov 10, 2021

this pr is reverted, need sign off from @dgsudharsan

@dgsudharsan
Copy link
Contributor

dgsudharsan commented Nov 11, 2021

My comments are addressed. @Junchao-Mellanox Can you please review from your end so I can sign off?

@Junchao-Mellanox
Copy link
Contributor

I don't have comment to the design itself. There are many good point in the PR (like using the event driven way to monit service status), but I am not sure that we need another service to do similar job like system-health. See what we have:

  1. We have container_checker runs on top of monit which monitors the containers status
  2. We have supervisord runs on each container which monitors the critical processes status
  3. We also have system-health which monitors containers/critical processes/hardware status. The function of system-health was broken by other PR, but there is a PR to fix it ([system-health] No longer check critical process/service status via monit sonic-buildimage#9068)

So, my opinion is like this: how about we extend system health instead of creating another new service?

@liat-grozovik
Copy link
Collaborator

@zhangyanzhao i don't understand why this PR has been merged while there is outstanding feedback.
the feature originally was contributed by Nvidia and we have provided valuable feedback to this design.
@sg893052 please note that the HLD is now reverted. Please refer to the comments provided and we will be happy to discuss offline if this is needed.

@sg893052
Copy link
Contributor Author

sg893052 commented Nov 11, 2021

I don't have comment to the design itself. There are many good point in the PR (like using the event driven way to monit service status), but I am not sure that we need another service to do similar job like system-health. See what we have:

  1. We have container_checker runs on top of monit which monitors the containers status
  2. We have supervisord runs on each container which monitors the critical processes status
  3. We also have system-health which monitors containers/critical processes/hardware status. The function of system-health was broken by other PR, but there is a PR to fix it ([system-health] No longer check critical process/service status via monit sonic-buildimage#9068)

So, my opinion is like this: how about we extend system health instead of creating another new service?

SONiC runtime environment is made up of systemd services, dockers and processes. A systemd service encompasses docker or processes running on host, and thus monitoring the status of the services provide a snapshot of the overall system health. Also a real time feedback is thus helpful to know when a particular service / docker or a critical process went down. Thus System ready feature emphasises the following design philosophy:

  1. Service Monitoring over Process Monitoring
  2. Event driven monitoring over Polling for real time feedback

In SONIC, we have service restart policies based on critical processes. So it is essential that we need to monitor all the services. We already have supervisord monitoring the critical processes within a container. Supervisord's proc-exit-listener basically flags which critical process exited and then because of which when the service restarts, our system ready feature will catch that the particular service is down instantly. When it comes up, it will flag that as up without any delay. Service monitoring will help us know that some critical processes died, and we will get a notification that one of the services went down. This also helps with container monitoring indirectly as docker wait is a typical service criteria. Once docker goes down, service goes down. The existing container_checker in monit is poll based and does not provide real time feedback. So, we really don't need anything from outside to monitor critical processes of dockers once we have the service monitoring in place.

So System ready is all on top of the existing infrastructure and with that point#3 mentioned by Junchao-Mellanox for critical processes monitoring is not required. I understand that the new fix(sonic-net/sonic-buildimage#9068) is to only show the status of the critical process through docker exec and what if the docker dies? And since everything(monit/container_checker/newone in 9068) is polling based and it is polled every minute, it is not realtime and so likely to miss critical service restarts.

Benefits of system ready:

  1. System ready is event driven where the feedback is immediate.
  2. Since system ready monitors systemd services, docker monitoring is achieved indirectly. No need for polling based container_checker utility in monit.
  3. It monitors all the sonic host systemd services.
  4. In addition, System ready feature also brings in the concept of application readiness to allow each application/service/docker to declare themselves as ready based on different application specific criteria.

Agree, it makes sense to put the entire System ready feature implementation into System health monitor code, and also leverage the system ready feature to check for service readiness inside the system health monitor. This can definitely be considered for the next release, as we don’t have much time for 202111 release.

@liat-grozovik
Copy link
Collaborator

Not sure pushing a feature that need to be redesign is the right way.
We need to agree on the design and do it once even it if means the feature is out of 202111 as it is coming late without HLD approval first.

@Junchao-Mellanox
Copy link
Contributor

Hi @sg893052 , like I said, this design brings some good points, I like that part especially for the event driven idea. But it is still something similar to system-health (Azure/sonic-buildimage#9068 also support monitoring services). I am not sure that adding a new service "SYSTEM READY" to 202111 and then merge it to system health in future is a good idea. We cannot handle the backward compatible for user.

@Junchao-Mellanox
Copy link
Contributor

If we want to merge this feature to system health, some thought here:

  1. The subprocess to listen service status can be easily integrated to system health, and it is still a subprocess. I would call it "LS" process
  2. There is a queue between main process and "LS" process, "LS" process notify service change event via the queue. This is the same as what "SYSTEM READY" is doing.
  3. Currently, there is a class ServiceChecker to monitor services/critical processes status. It shall no longer monitor services because "LS" process will do it. But it shall still monitor the critical processes as we don't want any "function degradation".
  4. The concept of service readiness can be also integrated into system health, the integration would not be difficult as it is also a subprocess in current design.
  5. System health CLI can be extended if there is more information provided to user.

@zhangyanzhao
Copy link
Collaborator

@zhangyanzhao i don't understand why this PR has been merged while there is outstanding feedback. the feature originally was contributed by Nvidia and we have provided valuable feedback to this design. @sg893052 please note that the HLD is now reverted. Please refer to the comments provided and we will be happy to discuss offline if this is needed.

I am ok to revert my merge. Please go ahead

@sg893052 sg893052 mentioned this pull request Apr 6, 2022
sujinmkang pushed a commit that referenced this pull request May 17, 2022
Support for System Ready ( Followup to #875)

Repo	                 PR title	                                          State
sonic-buildimage	 System                                              Ready	
sonic-utilities	         Show commands for System            Ready
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants