-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker communication optimization (aka removing netstring dependency) #644
Conversation
Makes a lot of sense to me 👍 |
Please don't merge this yet. I have to do a proper review and there are changes I don't like. For instance the channel message size needs to be +4MB size in order to allow big stats JSON files. |
Sure, take your time. This doesn't change message size to the best of my knowledge. Also I do not fully understand why there is a size limit in the first place, it should be possible to have a vector that can grow on demand instead of fixed buffer size without noticable performance impact, in fact Rust on receiving size supports anything up to 4GB already and on TypeScript side it is artificially limited as well (we can just remove |
…t to worker over channels, unify channel and payload channel message handling
…ding large numbers of memory re-allocations
861035d
to
f1888b3
Compare
@nazar-pc need help about how to proceed with this PR and the Meson one. This one removes netstring and the other moves it to Meson build system so obviously they are gonna conflict. |
@nazar-pc some cosmetic changes requested and some questions. I'm a bit afraid of these changes because this worked reliably for years (I don't doubt about this PR). Are you testing this code already in your mediasoup setups? |
…ation # Conflicts: # rust/CHANGELOG.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nazar-pc need help about how to proceed with this PR and the Meson one. This one removes netstring and the other moves it to Meson build system so obviously they are gonna conflict.
I didn't assume that all my changes will necessarily be accepted so I made two independent PRs. This one is simpler (I think) and should go first if seems reasonable. I will update Meson PR to resolve any conflicts, including nestring removal.
I'm a bit afraid of these changes because this worked reliably for years (I don't doubt about this PR). Are you testing this code already in your mediasoup setups?
I do not have a production deployment with these changes, but I have been using them for ~2 weeks in local development, various tests and benchmarks and have not found any problems so far.
To summarize, the messaging mechanism with this PR is the following:
- Reads
- Try to read 4 bytes
- If there is not enough data - do nothing and wait for more data
- Otherwise interpret as 32-bit length for upcoming message in native endianness for the platform
- Try to read the message according to the length in above 4 bytes
- If there is not enough data - do nothing and wait for more data
- Otherwise interpret as a new message and remove length and message from read buffer
- Repeat
- Both C++ and TypeScript have an optimization that doesn't remove the beginning of the read buffer immediately and instead reads all complete messages and does just one
std::memov()
at the very end, this reduces number of unnecessary memory copies
- Try to read 4 bytes
- Writes
- Write 4 bytes with 32-bit length of the message in native endianness
- Write the message itself
So while changes are invasive, would argue the algorithm in the end is simpler that the one there was in place before.
Testing this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues in my tests.
Merged in v3 :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Hi, We recently upgraded mediasoup in production and observed some really weird behaviour in direct transports and datachannels. I think i've tracked it down the the 3.9.0 release and this pr. What we observed was crazy high memory usage and a multi minute delay of chat message being delivered only when we had 8 people connected to a room. Whats interesting is that they'd come in waves and everyone in the room would get the messages all at once then nothing. Normal usage of data channels with webrtc transports appears normals. We think the issue relates to direct transports in conjunction to webrtc transports. Our architecture looks something like this sender webrtc data producer -> sender direct data consumer -> sender direct data producer -> receiver direct data consumer -> recevier direct data producer -> receiver webrtc data consumer We can repo this in our testing environment but Im not sure how to debug this behaviour. |
@GEverding your question should be posted on the forum, not in this PR |
It is not like someone complained about it, but it bothered me quite a bit working around worker communication.
The basic argument is that we use human-readable format for pipes, where no one would be able to read the content anyway, yet we use things like number formatting for every single message going back and forth between library and worker.
This PR like others is best to be reviewed commit by commit, this way it will make the most sense.
What this PR does
Results
Future work
One of the motivation factors besides those mentioned above is that I want to eventually eliminate pipes in Rust<->C++ communication and instead access each other's memory directly if possible. Having just a size prefix as bytes and data after it without netstring sprinkled in makes it slightly easier to refactor things later.
Once pipes are eliminated it should be (theoretically) possible to more or less efficiently implement custom SFU logic/extensions in Rust land by grabbing and dispatching packets using direct transport.