Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[healthd] Use unix_socket_path instead of loopback ip #14843

Merged
merged 5 commits into from
May 26, 2023

Conversation

vivekrnv
Copy link
Contributor

@vivekrnv vivekrnv commented Apr 25, 2023

Why I did it

interfaces-config service restarts networking service, which in-turn results in loopback interface address is being removed and reassigned back

If the system-health happens to start during that instance expections and logs like this are seen:

Apr 15 18:14:49.357869 r-panther-20 ERR healthd: update system status exception:Unable to connect to redis: Cannot assign requested address
Apr 15 18:14:49.429778 r-panther-20 ERR healthd: subscribe_statedb exited- Unable to connect to redis: Cannot assign requested address
Apr 15 18:14:52.218594 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:52.219714 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:55.218636 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:55.218722 r-panther-20 ERR healthd: system_service_Map_base::at

How I did it

use unix socket path

How to verify it

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@vivekrnv vivekrnv requested a review from lguohan as a code owner April 25, 2023 13:37
@vivekrnv
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vivekrnv
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liat-grozovik
Copy link
Collaborator

@qiluo-msft could you please help to review?

@liat-grozovik liat-grozovik requested a review from qiluo-msft May 2, 2023 07:31
@moshemos
Copy link

moshemos commented May 3, 2023

@qiluo-msft can you please help merge it, we need this fix for our 202211 release

@liat-grozovik
Copy link
Collaborator

@Junchao-Mellanox could you please help to review the concept?

@Junchao-Mellanox
Copy link
Collaborator

Junchao-Mellanox commented May 9, 2023

Try to understand the issue. Is system-health the only service that will connect redis during loopback interface flapping?

Maybe we should fix this line https://github.com/sonic-net/sonic-buildimage/blob/a738c39328b3e069a83162fe036ae693a8bd3d27/src/system-health/health_checker/sysmonitor.py#LL34C31-L34C42 . Use unix socket instead of tcp socket.

dgsudharsan
dgsudharsan previously approved these changes May 9, 2023
@vivekrnv
Copy link
Contributor Author

vivekrnv commented May 9, 2023

Try to understand the issue. Is system-health the only service that will connect redis during loopback interface flapping?

Maybe we should fix this line https://github.com/sonic-net/sonic-buildimage/blob/a738c39328b3e069a83162fe036ae693a8bd3d27/src/system-health/health_checker/sysmonitor.py#LL34C31-L34C42 . Use unix socket instead of tcp socket.

Regarding the first proposal, even other processes of system health will be using the loopback to connect to redis. So, the problem can still be seen even if just the system ready is updated.

There were a few instances of the same kind of issue reported. Using unix socket apparently causes some permission issues #10179 (comment)

@qiluo-msft can clarify on that.

@Junchao-Mellanox
Copy link
Collaborator

system health is a service managed by systemd, it is not a CLI. So I assume the permission issue is not relevant.

@vivekrnv
Copy link
Contributor Author

vivekrnv commented May 9, 2023

system health is a service managed by systemd, it is not a CLI. So I assume the permission issue is not relevant.

Hmm, FYI another instance of where this approach is used. c93716a. Seems to be a common practice across sonic.

@Junchao-Mellanox
Copy link
Collaborator

rsyslog service is a different case. It uses UDP which depends on LO interface. System health does not rely on UDP/TCP. So, I don't see an issue to use unix socket here.

@vivekrnv
Copy link
Contributor Author

vivekrnv commented May 9, 2023

rsyslog service is a different case. It uses UDP which depends on LO interface. System health does not rely on UDP/TCP. So, I don't see an issue to use unix socket here.

So you suggest moving all DBconnector initialization in system-health to Unix socket?

@vivekrnv vivekrnv changed the title [healthd] Start system-health after interfaces-config [healthd] Use unix_socket_path instead of loopback ip in system-health May 10, 2023
@vivekrnv vivekrnv changed the title [healthd] Use unix_socket_path instead of loopback ip in system-health [healthd] Use unix_socket_path instead of loopback ip May 10, 2023
@vivekrnv
Copy link
Contributor Author

@Junchao-Mellanox, updated with your recommendation. Can you review?

@vivekrnv
Copy link
Contributor Author

@judyjoseph to review for multi-asic.

@qiluo-msft Since this service runs in host, it is same behavior for multi-asic device too.

One question I have is which user is system-health service running on ? If it is not "root" we will have permissions issues accessing the unix socket path ?

cat ./system-health.service
[Unit]
Description=SONiC system health monitor
Requires=database.service updategraph.service
After=database.service updategraph.service

[Service]
ExecStart=/usr/local/bin/healthd
Restart=always

[Install]
WantedBy=multi-user.target

This comment might help answer your question. #14843 (comment)

@vivekrnv
Copy link
Contributor Author

@judyjoseph any other comments?

Copy link
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. And please make sure it is validated in a JIT based environment -- so that there is no permission issues.

@vivekrnv
Copy link
Contributor Author

vivekrnv commented May 19, 2023

LGTM. And please make sure it is validated in a JIT based environment -- so that there is no permission issues.

Can you expand on what JIT means? #Resolved

@keboliu
Copy link
Collaborator

keboliu commented May 26, 2023

Hi @qiluo-msft would you please merge this since it's approved?

@liat-grozovik liat-grozovik merged commit bc9c054 into sonic-net:master May 26, 2023
@keboliu
Copy link
Collaborator

keboliu commented May 29, 2023

@StormLiangMS would you please help to cherry-pick?

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #15249

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request May 29, 2023
- Why I did it

interfaces-config service restarts networking service, which in-turn results in loopback interface address is being removed and reassigned back

If the system-health happens to start during that instance expections and logs like this are seen:

Apr 15 18:14:49.357869 r-panther-20 ERR healthd: update system status exception:Unable to connect to redis: Cannot assign requested address
Apr 15 18:14:49.429778 r-panther-20 ERR healthd: subscribe_statedb exited- Unable to connect to redis: Cannot assign requested address
Apr 15 18:14:52.218594 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:52.219714 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:55.218636 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:55.218722 r-panther-20 ERR healthd: system_service_Map_base::at

- How I did it
use unix socket path

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
@vivekrnv
Copy link
Contributor Author

vivekrnv commented Jun 7, 2023

Issue was also seen on 202205 as reported here #15364 and thus request to backport to 202205

@bingwang-ms
Copy link
Contributor

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jun 15, 2023
- Why I did it

interfaces-config service restarts networking service, which in-turn results in loopback interface address is being removed and reassigned back

If the system-health happens to start during that instance expections and logs like this are seen:

Apr 15 18:14:49.357869 r-panther-20 ERR healthd: update system status exception:Unable to connect to redis: Cannot assign requested address
Apr 15 18:14:49.429778 r-panther-20 ERR healthd: subscribe_statedb exited- Unable to connect to redis: Cannot assign requested address
Apr 15 18:14:52.218594 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:52.219714 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:55.218636 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:55.218722 r-panther-20 ERR healthd: system_service_Map_base::at

- How I did it
use unix socket path

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202205: #15480

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.