-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SYSTEM READY #875
SYSTEM READY #875
Conversation
257540f
to
be1191a
Compare
doc/system-ready/system-ready-HLD.md
Outdated
"type": "hash", | ||
"value": { | ||
"FAIL_REASON": "", | ||
"TIME": "20211005 04:19:30", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use consistent lower case field names like other fields we have today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will update as per suggestion.
doc/system-ready/system-ready-HLD.md
Outdated
"SYSTEM_READY|SYSTEM_STATE": { | ||
"type": "hash", | ||
"value": { | ||
"Status": "UP" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use "status"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will update as per suggestion.
SONiC package installation process will register new feature in CONFIG DB. | ||
Third party dockers(signature verified) gets integrated into sonic os and runs similar to the existing dockers accessing db etc. | ||
Now, once the feature is enabled, it becomes part of either sonic.target or multi-user.target and when it starts, it automatically comes under the system monitor framework watchlist. | ||
However, app ready status for those dockers cant be tracked unless they comply with the framework logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how we do app ready status, please add some details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will add the following for application extension:
App ready status for those dockers cant be tracked unless they comply with the framework logic. Hence any third party docker needs to follow the framework logic by including "check_up_status" field while registering itself in CONFIG_DB and also make use of the provision given to docker apps to mark its closest up status in STATE_DB.
doc/system-ready/system-ready-HLD.md
Outdated
"<dockername>": { | ||
... | ||
"state": "enabled", | ||
"check_up_status": "True" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feature table is for docker based features. Can we introduce a new table for host services like caclmgrd, hostcfgd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, Will introduce a new table in CONFIG_DB called HOST_FEATURE for this purpose.
- sonic-db-cli STATE_DB HSET "FEATURE|<dockername>" TIME "<timestamp in <datetime.now().strftime('%Y%m%d %H:%M:%S')> format >" | ||
|
||
- Schema in STATE_DB | ||
sonic-db-dump -n STATE_DB output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The host services don't belong to FEATURE table and we need a different table for host services like hostcfgd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
doc/system-ready/system-ready-HLD.md
Outdated
"<dockername>": { | ||
... | ||
"state": "enabled", | ||
"check_up_status": "True" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also update the feature yang on adding the new field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will update the feature yang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgsudharsan I couldn't find the yang code for FEATURE table. Could you please point it out? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it monitor hardware status? Like fan/PSU/ASIC |
|
||
## 1.1 Limitation of Existing tools: | ||
- Monit tool is a poll based approach which monitors the configured services for every 1 minute. | ||
- Monit tools feature of critical process monitoring was deprecated as supervisor does the job. Hence system-health tool which depends on monit does not work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate on how does this feature monitor critical process? .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As supervisor does the job of monitoring the docker critical processes, system ready framework need not explicitly monitor them.
The flow is that critical process exit is notified to supervisor. supervisor exit is notified to systemd. sysmonitor monitors systemd service.
doc/system-ready/system-ready-HLD.md
Outdated
|
||
|
||
For instances, | ||
- swss docker app is considered to be UP only when its systemd service is in running state and when its dependencies are met by checking state-db entries populated by *mgrs, like intfmgr, vrfmgr, vxlanmgr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions here:
- Which process/script is responsible for checking state-db entries populated by *mgrs?
- Which state-db entries should be checked?
- What could be the behavior by default if no one marks UP_STATUS to True?
- Does it mean that each docker service/app extension should be extended and find a way to mark its UP_STATUS to True in order to work with this framework?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In simple, each app is responsible in marking its closest up status in STATE_DB. Sysmonitor tool just reads from it.
Any docker app which has multiple independent daemons can maintain a separate intermediate key-value in the redis-db for each of the daemons and the startup script that invokes each of these daemons can determine the status from the redis entries by each daemon and finally update the STATE_DB UP_STATUS.
Eg, swss docker app can wait for port init done and wait for Vrfmgr, Intfmgr and Vxlanmgr to be ready before marking its up status.
Any app which agrees to comply with the system ready framework, should add an entry "check_up_status" to "true" in FEATURE table in CONFIG_DB first before marking its closest UP_STATUS in STATE_DB.
If "check_up_status" field in FEATURE table of CONFIG_DB is set to true for an app, then its UP_STATUS will be checked in STATE_DB, post checking the running status of that app, by sysmonitor tool.
If the entry is set to false or not set at all, sysmonitor just reads the running status for that service but not the app ready status.
System ready framework monitors all the sonic services(docker+host) along with its app readiness to declare the system is ready for network traffic. It does not monitor hardware status. |
be1191a
to
3f9d005
Compare
Can you update this table with template #806 which display the status with better view? Thanks |
Have updated the table as suggested. |
this pr is reverted, need sign off from @dgsudharsan |
My comments are addressed. @Junchao-Mellanox Can you please review from your end so I can sign off? |
I don't have comment to the design itself. There are many good point in the PR (like using the event driven way to monit service status), but I am not sure that we need another service to do similar job like system-health. See what we have:
So, my opinion is like this: how about we extend system health instead of creating another new service? |
@zhangyanzhao i don't understand why this PR has been merged while there is outstanding feedback. |
SONiC runtime environment is made up of systemd services, dockers and processes. A systemd service encompasses docker or processes running on host, and thus monitoring the status of the services provide a snapshot of the overall system health. Also a real time feedback is thus helpful to know when a particular service / docker or a critical process went down. Thus System ready feature emphasises the following design philosophy:
In SONIC, we have service restart policies based on critical processes. So it is essential that we need to monitor all the services. We already have supervisord monitoring the critical processes within a container. Supervisord's proc-exit-listener basically flags which critical process exited and then because of which when the service restarts, our system ready feature will catch that the particular service is down instantly. When it comes up, it will flag that as up without any delay. Service monitoring will help us know that some critical processes died, and we will get a notification that one of the services went down. This also helps with container monitoring indirectly as docker wait is a typical service criteria. Once docker goes down, service goes down. The existing container_checker in monit is poll based and does not provide real time feedback. So, we really don't need anything from outside to monitor critical processes of dockers once we have the service monitoring in place. So System ready is all on top of the existing infrastructure and with that point#3 mentioned by Junchao-Mellanox for critical processes monitoring is not required. I understand that the new fix(sonic-net/sonic-buildimage#9068) is to only show the status of the critical process through docker exec and what if the docker dies? And since everything(monit/container_checker/newone in 9068) is polling based and it is polled every minute, it is not realtime and so likely to miss critical service restarts. Benefits of system ready:
Agree, it makes sense to put the entire System ready feature implementation into System health monitor code, and also leverage the system ready feature to check for service readiness inside the system health monitor. This can definitely be considered for the next release, as we don’t have much time for 202111 release. |
Not sure pushing a feature that need to be redesign is the right way. |
Hi @sg893052 , like I said, this design brings some good points, I like that part especially for the event driven idea. But it is still something similar to system-health (Azure/sonic-buildimage#9068 also support monitoring services). I am not sure that adding a new service "SYSTEM READY" to 202111 and then merge it to system health in future is a good idea. We cannot handle the backward compatible for user. |
If we want to merge this feature to system health, some thought here:
|
I am ok to revert my merge. Please go ahead |
Support for System Ready ( Followup to #875) Repo PR title State sonic-buildimage System Ready sonic-utilities Show commands for System Ready
Support for System Ready