Skip to content

Conversation

@jukofyork
Copy link
Collaborator

@jukofyork jukofyork commented Oct 31, 2025

This PR just joins the 1-byte command and 8-byte size packets, with the main payload.

It didn't seem to make much difference to me to start with, but possibly different network setups handle TCP_NODELAY differently.

Combined with #15405 it does seem to give a large speedup to TG now (more testing needed to be sure).

It may be better to create the buffer once outside of the function if it doesn't need to be thread safe, so just making a draft for now to see if worthwhile for others.

@rgerganov
Copy link
Collaborator

Combining multiple send calls into a single one doesn't bring measurable improvement on my setup. The reason for this is that TCP/IP doesn't have to wait for ACK after sending a packet before sending another packet. This can be easily illustrated with tcpdump.

Here I am sending 3 SET_TENSOR commands with 256 bytes to an RPC server (master):

13:51:50.553828 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 1:2, ack 1, win 502, length 1
13:51:50.553836 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 2:10, ack 1, win 502, length 8
13:51:50.553841 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 10:266, ack 1, win 502, length 256
13:51:50.553859 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 266:267, ack 1, win 502, length 1
13:51:50.553865 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 267:275, ack 1, win 502, length 8
13:51:50.553870 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 275:531, ack 1, win 502, length 256
13:51:50.553875 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 531:532, ack 1, win 502, length 1
13:51:50.553877 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 532:540, ack 1, win 502, length 8
13:51:50.553912 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 540:796, ack 1, win 502, length 256
13:51:50.586978 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 2, win 251, length 0
13:51:50.587134 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 10, win 251, length 0
13:51:50.587236 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 266, win 249, length 0
13:51:50.587237 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 267, win 249, length 0
13:51:50.587591 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 531, win 249, length 0
13:51:50.587593 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 532, win 249, length 0
13:51:50.587594 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 540, win 249, length 0
13:51:50.587595 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 796, win 249, length 0

The total time is 50.587595 - 50.553828 = 0.033766 sec.

Here I am sending 3 SET_TENSOR commands, each command in a single packet (this PR):

13:52:52.190746 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 1:266, ack 1, win 502, length 265
13:52:52.190773 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 266:531, ack 1, win 502, length 265
13:52:52.190784 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 531:796, ack 1, win 502, length 265
13:52:52.224278 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 266, win 249, length 0
13:52:52.224280 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 531, win 249, length 0
13:52:52.224281 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 796, win 249, length 0

The total time is 52.224281 - 52.190746 = 0.033535 sec.

In both cases we are bound by the network latency which is ~33ms. In other words, combining multiple send calls doesn't save us a round-trip to the server.

For this simple example the difference is less than 1ms for a very small tensor. For large tensors it'd be worse due to the additional copies being made.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 31, 2025
@jukofyork
Copy link
Collaborator Author

jukofyork commented Oct 31, 2025

Combining multiple send calls into a single one doesn't bring measurable improvement on my setup. The reason for this is that TCP/IP doesn't have to wait for ACK after sending a packet before sending another packet. This can be easily illustrated with tcpdump.

Here I am sending 3 SET_TENSOR commands with 256 bytes to an RPC server (master):

13:51:50.553828 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 1:2, ack 1, win 502, length 1
13:51:50.553836 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 2:10, ack 1, win 502, length 8
13:51:50.553841 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 10:266, ack 1, win 502, length 256
13:51:50.553859 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 266:267, ack 1, win 502, length 1
13:51:50.553865 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 267:275, ack 1, win 502, length 8
13:51:50.553870 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 275:531, ack 1, win 502, length 256
13:51:50.553875 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 531:532, ack 1, win 502, length 1
13:51:50.553877 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 532:540, ack 1, win 502, length 8
13:51:50.553912 enx08920454eea4 Out IP 10.65.96.126.36248 > 10.6.123.118.8443: Flags [P.], seq 540:796, ack 1, win 502, length 256
13:51:50.586978 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 2, win 251, length 0
13:51:50.587134 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 10, win 251, length 0
13:51:50.587236 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 266, win 249, length 0
13:51:50.587237 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 267, win 249, length 0
13:51:50.587591 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 531, win 249, length 0
13:51:50.587593 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 532, win 249, length 0
13:51:50.587594 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 540, win 249, length 0
13:51:50.587595 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.36248: Flags [.], ack 796, win 249, length 0

The total time is 50.587595 - 50.553828 = 0.033766 sec.

Here I am sending 3 SET_TENSOR commands, each command in a single packet (this PR):

13:52:52.190746 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 1:266, ack 1, win 502, length 265
13:52:52.190773 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 266:531, ack 1, win 502, length 265
13:52:52.190784 enx08920454eea4 Out IP 10.65.96.126.54706 > 10.6.123.118.8443: Flags [P.], seq 531:796, ack 1, win 502, length 265
13:52:52.224278 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 266, win 249, length 0
13:52:52.224280 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 531, win 249, length 0
13:52:52.224281 enx08920454eea4 In  IP 10.6.123.118.8443 > 10.65.96.126.54706: Flags [.], ack 796, win 249, length 0

The total time is 52.224281 - 52.190746 = 0.033535 sec.

In both cases we are bound by the network latency which is ~33ms. In other words, combining multiple send calls doesn't save us a round-trip to the server.

Yeah, it didn't help me at all when I tried this last week and:

For this simple example the difference is less than 1ms for a very small tensor. For large tensors it'd be worse due to the additional copies being made.

The extra memcpy did reduce the TG slightly (~0.25 tokens/s).

The only reason I tried it again was to try and get it working with the volatile cache idea, as it was sending the first 9 bytes as 2 packets and then spending a significant time working on hashing the payload to send and I suspected it might help this more.

It was only when I tried it with your other graph-reuse PR that I seemed to gain an extra 2-3 tokens/s on top of the already big increase from that PR (eg: 15.5 --> 19.5 --> 22.5), but I need to test much more carefully if this isn't due to some other change in the codebase since I ran the last tests.

I will try and run tcpdump too when I get back after the weekend and repot if I find anything.

@jukofyork
Copy link
Collaborator Author

I'm closing this, as it doesn't seem to help with the reason pointed out by @rgerganov regarding latency.

@jukofyork jukofyork closed this Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants