rpc: join small packets in `send_msg` and `recv_msg` #16892

jukofyork · 2025-10-31T11:34:28Z

This PR just joins the 1-byte command and 8-byte size packets, with the main payload.

It didn't seem to make much difference to me to start with, but possibly different network setups handle TCP_NODELAY differently.

Combined with #15405 it does seem to give a large speedup to TG now (more testing needed to be sure).

It may be better to create the buffer once outside of the function if it doesn't need to be thread safe, so just making a draft for now to see if worthwhile for others.

rgerganov · 2025-10-31T12:33:21Z

Combining multiple send calls into a single one doesn't bring measurable improvement on my setup. The reason for this is that TCP/IP doesn't have to wait for ACK after sending a packet before sending another packet. This can be easily illustrated with tcpdump.

Here I am sending 3 SET_TENSOR commands with 256 bytes to an RPC server (master):

13:51:50.553828 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 1:2, ack 1, win 502, length 1
13:51:50.553836 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 2:10, ack 1, win 502, length 8
13:51:50.553841 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 10:266, ack 1, win 502, length 256
13:51:50.553859 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 266:267, ack 1, win 502, length 1
13:51:50.553865 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 267:275, ack 1, win 502, length 8
13:51:50.553870 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 275:531, ack 1, win 502, length 256
13:51:50.553875 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 531:532, ack 1, win 502, length 1
13:51:50.553877 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 532:540, ack 1, win 502, length 8
13:51:50.553912 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 540:796, ack 1, win 502, length 256
13:51:50.586978 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 2, win 251, length 0
13:51:50.587134 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 10, win 251, length 0
13:51:50.587236 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 266, win 249, length 0
13:51:50.587237 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 267, win 249, length 0
13:51:50.587591 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 531, win 249, length 0
13:51:50.587593 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 532, win 249, length 0
13:51:50.587594 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 540, win 249, length 0
13:51:50.587595 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 796, win 249, length 0

The total time is 50.587595 - 50.553828 = 0.033766 sec.

Here I am sending 3 SET_TENSOR commands, each command in a single packet (this PR):

13:52:52.190746 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 1:266, ack 1, win 502, length 265
13:52:52.190773 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 266:531, ack 1, win 502, length 265
13:52:52.190784 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 531:796, ack 1, win 502, length 265
13:52:52.224278 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 266, win 249, length 0
13:52:52.224280 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 531, win 249, length 0
13:52:52.224281 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 796, win 249, length 0

The total time is 52.224281 - 52.190746 = 0.033535 sec.

In both cases we are bound by the network latency which is ~33ms. In other words, combining multiple send calls doesn't save us a round-trip to the server.

For this simple example the difference is less than 1ms for a very small tensor. For large tensors it'd be worse due to the additional copies being made.

jukofyork · 2025-10-31T12:59:27Z

Combining multiple send calls into a single one doesn't bring measurable improvement on my setup. The reason for this is that TCP/IP doesn't have to wait for ACK after sending a packet before sending another packet. This can be easily illustrated with tcpdump.

Here I am sending 3 SET_TENSOR commands with 256 bytes to an RPC server (master):

13:51:50.553828 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 1:2, ack 1, win 502, length 1
13:51:50.553836 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 2:10, ack 1, win 502, length 8
13:51:50.553841 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 10:266, ack 1, win 502, length 256
13:51:50.553859 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 266:267, ack 1, win 502, length 1
13:51:50.553865 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 267:275, ack 1, win 502, length 8
13:51:50.553870 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 275:531, ack 1, win 502, length 256
13:51:50.553875 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 531:532, ack 1, win 502, length 1
13:51:50.553877 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 532:540, ack 1, win 502, length 8
13:51:50.553912 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 540:796, ack 1, win 502, length 256
13:51:50.586978 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 2, win 251, length 0
13:51:50.587134 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 10, win 251, length 0
13:51:50.587236 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 266, win 249, length 0
13:51:50.587237 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 267, win 249, length 0
13:51:50.587591 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 531, win 249, length 0
13:51:50.587593 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 532, win 249, length 0
13:51:50.587594 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 540, win 249, length 0
13:51:50.587595 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 796, win 249, length 0

The total time is 50.587595 - 50.553828 = 0.033766 sec.

Here I am sending 3 SET_TENSOR commands, each command in a single packet (this PR):

13:52:52.190746 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 1:266, ack 1, win 502, length 265
13:52:52.190773 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 266:531, ack 1, win 502, length 265
13:52:52.190784 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 531:796, ack 1, win 502, length 265
13:52:52.224278 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 266, win 249, length 0
13:52:52.224280 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 531, win 249, length 0
13:52:52.224281 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 796, win 249, length 0

The total time is 52.224281 - 52.190746 = 0.033535 sec.

In both cases we are bound by the network latency which is ~33ms. In other words, combining multiple send calls doesn't save us a round-trip to the server.

Yeah, it didn't help me at all when I tried this last week and:

For this simple example the difference is less than 1ms for a very small tensor. For large tensors it'd be worse due to the additional copies being made.

The extra memcpy did reduce the TG slightly (~0.25 tokens/s).

The only reason I tried it again was to try and get it working with the volatile cache idea, as it was sending the first 9 bytes as 2 packets and then spending a significant time working on hashing the payload to send and I suspected it might help this more.

It was only when I tried it with your other graph-reuse PR that I seemed to gain an extra 2-3 tokens/s on top of the already big increase from that PR (eg: 15.5 --> 19.5 --> 22.5), but I need to test much more carefully if this isn't due to some other change in the codebase since I ran the last tests.

I will try and run tcpdump too when I get back after the weekend and repot if I find anything.

jukofyork · 2025-11-05T18:48:13Z

I'm closing this, as it doesn't seem to help with the reason pointed out by @rgerganov regarding latency.

jukofyork and others added 2 commits October 23, 2025 17:26

Coalesced packets in send_rpc_cmd and send_msg

5ce87d1

Merge branch 'ggml-org:master' into rpc-changes

56adbb7

jukofyork mentioned this pull request Oct 31, 2025

rpc : reuse compute graphs #15405

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 31, 2025

jukofyork closed this Nov 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rpc: join small packets in `send_msg` and `recv_msg` #16892

rpc: join small packets in `send_msg` and `recv_msg` #16892

Uh oh!

jukofyork commented Oct 31, 2025 •

edited

Loading

Uh oh!

rgerganov commented Oct 31, 2025

Uh oh!

jukofyork commented Oct 31, 2025 •

edited

Loading

Uh oh!

jukofyork commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rpc: join small packets in send_msg and recv_msg #16892

rpc: join small packets in send_msg and recv_msg #16892

Uh oh!

Conversation

jukofyork commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgerganov commented Oct 31, 2025

Uh oh!

jukofyork commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rpc: join small packets in `send_msg` and `recv_msg` #16892

rpc: join small packets in `send_msg` and `recv_msg` #16892

jukofyork commented Oct 31, 2025 •

edited

Loading

jukofyork commented Oct 31, 2025 •

edited

Loading