Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance on Linux? #336

Closed
alshdavid opened this issue May 11, 2024 · 13 comments
Closed

Slow performance on Linux? #336

alshdavid opened this issue May 11, 2024 · 13 comments

Comments

@alshdavid
Copy link

alshdavid commented May 11, 2024

Hi, I have written a wrapper util on top of ipc_channel that handles the handshake, swapping channels between the host/child and adds a request/response API.

The performance on my M1 MBP was great, but I was surprised to find that the performance on Linux was significantly slower!

So I wrote a benchmark to test it out. The benchmark sends n requests, blocking on their responses (100k requests means 200k messages over the channel).

I'm not sure if it's my configuration, perhaps something else is interfering, however here are my results

Hardware

  • Windows: AMD 5950x - Windows 10
  • Linux: AMD 5950x - Fedora 39
  • MacOS: M1 Macbook Pro

Results

Platform message count duration
macos 10k 0.487s
windows 10k 0.356s
linux 10k 2.301s
macos 100k 1.550s
windows 100k 3.497s
linux 100k 13.608s
macos 1m 14.404s
windows 1m 34.769s
linux 1m 150.514s

Time taken for n round trip messages - Lower is better

chart

I am have tried with/without the memfd option enabled and I have tried making this async (using tokio channels/threads) with the same outcome.

This is my wrapper (benchmarks are under examples)
https://github.com/alshdavid/ipc-channel-adapter

To run the benchmark run just bench {number_of_requests} e.g. just bench 100000

I'm investigating if another dependency is interfering, will update with my findings - but at the surface, any idea why this might be?

@alshdavid
Copy link
Author

alshdavid commented May 11, 2024

When running the benchmark using tokio and sending all the requests at once and waiting for them to return concurrently it's a lot better.

Tested with just bench-async

Platform message count duration
macos 100k 1.176s
windows 100k 0.368s
linux 100k 4.026s

@alshdavid
Copy link
Author

I was able to replicate this on Ubuntu. Wonder where the performance loss is occuring

@mrobinson
Copy link
Member

mrobinson commented Oct 2, 2024

Hello, we've been testing this benchmark on our own systems. When plugged we see benchmark results in line with the ones you have posted for non-Linux platforms, @alshdavid. That said, we've noticed that power saving mode or throttling due to being unplugged has a massive effect on the results. For instance, when I switch my machine to "Power Saver" in Gnome the results I get are:

Ryzen 7 7840U / Ubuntu

Power Save Count Duration
Off 100k 3.072s
Off 1m 34.654s
On 100k 7.392s
On 1m 71.326s

Macbook M3 Max

Energy Mode Count Duration
High 100k 2.389s
High 1m 22.772s
Low 100k 2.720s
Low 1m 26.808s

Perhaps what's happening here is that the Linux implementation is very sensitive to power saving mode.

@mukilan
Copy link
Member

mukilan commented Oct 2, 2024

I can confirm the same (i.e worse performance on power saving and numbers on par with OP's Windows and MacOS for performance mode) on NixOS 24.05, 24 × 12th Gen Intel® Core™ i7-12800HX, 64GB RAM

message count power saving perfomance mode
100K 21.441s 3.821s
1M 67.033s 24.460s

@glyn
Copy link
Contributor

glyn commented Dec 17, 2024

Although the above measurements show Linux to be ten times slower than macOS and five times slower than Windows, it's not clear to me why this is unexpected. The platform layer has distinct code for Linux, macOS, and Windows based on completely different OS primitives, so some performance differences would not be surprising. In particular, I wonder if the macOS support risks using Mach ports, rather than BSD features, for better performance.

I'm also intrigued whether a factor of ten in these benchmarks represents a measurable performance problem for Servo (or other projects consuming IPC channel, if there are any).

(I found one Servo issue specifically about layout of real world web pages being up to two times slower on Linux, when using "parallel" rather than "sequential" layout, but I have no idea if that could be caused by IPC channel performance differences.)

@alshdavid
Copy link
Author

We were evaluating using IPC channels at Atlassian for a project that has a Rust core which calls out to an external processes (Nodejs and other runtimes) to execute "plugin" code.

The messaging overhead on Linux machines however made it impractical so that had us looking at alternative options. IPC is certainly still preferred as it's far simpler and a much nicer mental model than the alternatives.

@glyn
Copy link
Contributor

glyn commented Dec 18, 2024

Thanks @alshdavid. Although Servo is probably the main consumer of IPC channel, I would be grateful for more information about your use case:

  1. How much faster would IPC channel have had to be to make its use practical for you?
  2. Did you find an alternative on Linux with acceptable performance?
  3. If so, was the alternative IPC-based, did it avoid IPC completely, or what?

@alshdavid
Copy link
Author

alshdavid commented Dec 18, 2024

We are writing web build tooling, specifically the Atlaspack bundler, in Rust to help improve the feedback loop for developers working on internal projects.

At the moment Atlaspack is a fork of Parcel that is being incrementally rewritten to Rust.

The Rust core needs to call out to plugins written in JavaScript (essentially middleware for phases of the build). We intend to expand support for other languages.

Nodejs has the capability to consume Rust code in the form of a dynamic c library where we use node's bindings to convert the Rust API to JavaScript (Go, Python, etc also share this capability)

The initial thinking was that we could create a separate Nodejs package that acted as a client for the IPC server provided by the core. That way, to add language support, we just need to create a new language specific client package that consumes the IPC API we design.

image

The problem is this is a very chatty (millions of requests over IPC) and the overhead quickly adds up to be substantial.

Alternatives

Embed the runtime

One option we looked at is embedding the runtime within the core, either statically or loading it as a dynamic c library
image

The downside to this is increases the binary size, locks the version of Nodejs to the one supplied by the library (can cause incompatibilities), complicates the story for statically compiled libraries, and increases the complexity/build time for CI/CD.

Wasm / Wasi

Maybe? Need to explore this option further

Embed the bundler within Nodejs

This is what we are currently going with until we can think of a better solution. The Rust port will take some time so we are hoping we will find a better solution eventually

This involves building the entire bundler as a Nodejs NAPI module (compiling the bundler as a dynamic C library consumed by a Nodejs entry point) and running it from within a Nodejs host process.

image

This limits the ability to use different languages, increases the complexity and is harder to reason about as the entrypoint is a JavaScript file that jumps into Rust which jumps back and forth into Nodejs & Nodejs worker threads.

When compared to this approach - 1 million IPC messages adds an overhead of +30s to +60s which is important because we are aiming to have an overall complete build time of ~60s.

@glyn
Copy link
Contributor

glyn commented Dec 18, 2024

That's helpful - thank you. So it seems we don't yet have evidence that any multi-process implementation could perform sufficiently well for very chatty use cases such as yours on Linux.

@alshdavid
Copy link
Author

alshdavid commented Dec 18, 2024

Unlikely. Is the overhead seen here a result of the serialization/deserialization of values across the IPC bridge? If that's the case, can we just send pointers?

I am toying around with the idea of using shared memory between the process to store Rust channels which act as a bridge - though I don't know enough about how that actually works yet. Still quite new to working with OS APIs.

Naively, I'm hoping I can store only a Rust channel in shared memory and send pointers to heap values between processes. Though I don't know if the receiving process can access the referenced value or if the OS prevents this (virtualized memory?).

Perhaps I can have access to a shared heap by forking the parent process? Or perhaps there is a custom Rust allocator that manages a cross process shared heap

@glyn
Copy link
Contributor

glyn commented Dec 18, 2024

Unlikely. Is the overhead seen here a result of the serialization/deserialization of values across the IPC bridge? If that's the case, can we just send pointers?

I believe IPC channel is predicated on (de)serialising values sent across the channel. So I suspect "direct" transmission of values is beyond the scope of IPC channel.

I am toying around with the idea of using shared memory between the process to store Rust channels which act as a bridge - though I don't know enough about how that actually works yet. Still quite new to working with OS APIs.

Shared memory or memory mapped files are likely part of any performant solution. Indeed the current implementation already uses shared memory.

These resources may be useful:

https://users.rust-lang.org/t/shared-memory-for-interprocess-communication/92408
https://stackoverflow.com/questions/14225010/fastest-technique-to-pass-messages-between-processes-on-linux

Naively, I'm hoping I can store only a Rust channel in shared memory and send pointers to heap values between processes. Though I don't know if the receiving process can access the referenced value or if the OS prevents this (virtualized memory?).

Perhaps I can have access to a shared heap by forking the parent process? Or perhaps there is a custom Rust allocator that manages a cross process shared heap

I personally think sharing (part of) the Rust heap between processes is a non-starter. It might be possible to build a library for managing shared memory or memory-mapped files as a way of passing values between processes, but that's likely to be a large piece of work.


That said, it feels to me that this discussion is going beyond an issue against the current IPC channel implementation and is getting into the realm of speculating about better alternatives. Would you be comfortable closing the issue?

@alshdavid
Copy link
Author

True, I am happy to close this issue. Thanks for helping out 🙏

@glyn
Copy link
Contributor

glyn commented Dec 20, 2024

@alshdavid Thanks and I wish you good progress with https://github.com/atlassian-labs/atlaspack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants