Skip to content

Conversation

@zhengkezhou1
Copy link
Contributor

fix: #141

introduces a configurable number of retries for ZMQ connections. Previously, connection failures would terminate the process, but this new setting allows for more resilient network communication.

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
@zhengkezhou1 zhengkezhou1 force-pushed the ZMQ-connection-retries branch from 6ddab40 to e5f6f98 Compare August 20, 2025 09:14
@irar2
Copy link
Collaborator

irar2 commented Aug 20, 2025

@zhengkezhou1 Thanks for the PR!
Please add the new command line parameter to README, and please add a test to config_test.go

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
@irar2
Copy link
Collaborator

irar2 commented Aug 20, 2025

@zhengkezhou1 Please also add a test that actually checks the retries:
Don't start the sub, and try to start the pub, then after a delay start the pub and check that the publisher is successfully created.

Also, please validate that the number of retries is not negative (it can be set to a negative value in a config file - actually, I am not sure about that, maybe the fact that the field is uint is sufficient for yaml unmarshal to fail for negative values?).
And add a test that checks that invalid values are rejected (in config_test.go) both from command line and from a yaml file.

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
@zhengkezhou1
Copy link
Contributor Author

@zhengkezhou1 Please also add a test that actually checks the retries: Don't start the sub, and try to start the pub, then after a delay start the pub and check that the publisher is successfully created.

The ZMQ Connect operation is asynchronous, which means it immediately returns success without waiting to establish a physical connection.

Comment on lines 382 to 385
{
name: "invalid ZMQ retries time",
args: []string{"cmd", "zmq-retries-times", "-1", "--config", "../../manifests/config.yaml"},
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a scenario here for when retries is a negative number.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, please also check what happens when it is set to a negative number in config file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i add a new file invalid-config.yaml to address this. also test default value(0) in config.yaml

Comment on lines 382 to 385
{
name: "invalid ZMQ retries time",
args: []string{"cmd", "zmq-retries-times", "-1", "--config", "../../manifests/config.yaml"},
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, please also check what happens when it is set to a negative number in config file

Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
Copy link
Collaborator

@irar2 irar2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@irar2 irar2 merged commit 34c29ca into llm-d:main Aug 21, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add retries to connect to ZMQ

2 participants