Slow performance on Linux? #336

alshdavid · 2024-05-11T05:39:21Z

Hi, I have written a wrapper util on top of ipc_channel that handles the handshake, swapping channels between the host/child and adds a request/response API.

The performance on my M1 MBP was great, but I was surprised to find that the performance on Linux was significantly slower!

So I wrote a benchmark to test it out. The benchmark sends n requests, blocking on their responses (100k requests means 200k messages over the channel).

I'm not sure if it's my configuration, perhaps something else is interfering, however here are my results

Hardware

Windows: AMD 5950x - Windows 10
Linux: AMD 5950x - Fedora 39
MacOS: M1 Macbook Pro

Results

Platform	message count	duration
`macos`	10k	`0.487s`
`windows`	10k	`0.356s`
`linux`	10k	`2.301s`
`macos`	100k	`1.550s`
`windows`	100k	`3.497s`
`linux`	100k	`13.608s`
`macos`	1m	`14.404s`
`windows`	1m	`34.769s`
`linux`	1m	`150.514s`

Time taken for n round trip messages - Lower is better

I am have tried with/without the memfd option enabled and I have tried making this async (using tokio channels/threads) with the same outcome.

This is my wrapper (benchmarks are under examples)
https://github.com/alshdavid/ipc-channel-adapter

To run the benchmark run just bench {number_of_requests} e.g. just bench 100000

I'm investigating if another dependency is interfering, will update with my findings - but at the surface, any idea why this might be?

The text was updated successfully, but these errors were encountered:

alshdavid · 2024-05-11T08:18:39Z

When running the benchmark using tokio and sending all the requests at once and waiting for them to return concurrently it's a lot better.

Tested with just bench-async

Platform	message count	duration
`macos`	100k	`1.176s`
`windows`	100k	`0.368s`
`linux`	100k	`4.026s`

alshdavid · 2024-05-13T01:57:47Z

I was able to replicate this on Ubuntu. Wonder where the performance loss is occuring

mrobinson · 2024-10-02T09:36:37Z

Hello, we've been testing this benchmark on our own systems. When plugged we see benchmark results in line with the ones you have posted for non-Linux platforms, @alshdavid. That said, we've noticed that power saving mode or throttling due to being unplugged has a massive effect on the results. For instance, when I switch my machine to "Power Saver" in Gnome the results I get are:

Ryzen 7 7840U / Ubuntu

Power Save	Count	Duration
Off	100k	3.072s
Off	1m	34.654s
On	100k	7.392s
On	1m	71.326s

Macbook M3 Max

Energy Mode	Count	Duration
High	100k	2.389s
High	1m	22.772s
Low	100k	2.720s
Low	1m	26.808s

Perhaps what's happening here is that the Linux implementation is very sensitive to power saving mode.

mukilan · 2024-10-02T09:37:43Z

I can confirm the same (i.e worse performance on power saving and numbers on par with OP's Windows and MacOS for performance mode) on NixOS 24.05, 24 × 12th Gen Intel® Core™ i7-12800HX, 64GB RAM

message count	power saving	perfomance mode
100K	21.441s	3.821s
1M	67.033s	24.460s

glyn · 2024-12-17T04:18:16Z

Although the above measurements show Linux to be ten times slower than macOS and five times slower than Windows, it's not clear to me why this is unexpected. The platform layer has distinct code for Linux, macOS, and Windows based on completely different OS primitives, so some performance differences would not be surprising. In particular, I wonder if the macOS support risks using Mach ports, rather than BSD features, for better performance.

I'm also intrigued whether a factor of ten in these benchmarks represents a measurable performance problem for Servo (or other projects consuming IPC channel, if there are any).

(I found one Servo issue specifically about layout of real world web pages being up to two times slower on Linux, when using "parallel" rather than "sequential" layout, but I have no idea if that could be caused by IPC channel performance differences.)

alshdavid · 2024-12-17T06:37:59Z

We were evaluating using IPC channels at Atlassian for a project that has a Rust core which calls out to an external processes (Nodejs and other runtimes) to execute "plugin" code.

The messaging overhead on Linux machines however made it impractical so that had us looking at alternative options. IPC is certainly still preferred as it's far simpler and a much nicer mental model than the alternatives.

glyn · 2024-12-18T03:26:35Z

Thanks @alshdavid. Although Servo is probably the main consumer of IPC channel, I would be grateful for more information about your use case:

How much faster would IPC channel have had to be to make its use practical for you?
Did you find an alternative on Linux with acceptable performance?
If so, was the alternative IPC-based, did it avoid IPC completely, or what?

alshdavid · 2024-12-18T03:45:08Z

We are writing web build tooling, specifically the Atlaspack bundler, in Rust to help improve the feedback loop for developers working on internal projects.

At the moment Atlaspack is a fork of Parcel that is being incrementally rewritten to Rust.

The Rust core needs to call out to plugins written in JavaScript (essentially middleware for phases of the build). We intend to expand support for other languages.

Nodejs has the capability to consume Rust code in the form of a dynamic c library where we use node's bindings to convert the Rust API to JavaScript (Go, Python, etc also share this capability)

The initial thinking was that we could create a separate Nodejs package that acted as a client for the IPC server provided by the core. That way, to add language support, we just need to create a new language specific client package that consumes the IPC API we design.

The problem is this is a very chatty (millions of requests over IPC) and the overhead quickly adds up to be substantial.

Alternatives

Embed the runtime

One option we looked at is embedding the runtime within the core, either statically or loading it as a dynamic c library

The downside to this is increases the binary size, locks the version of Nodejs to the one supplied by the library (can cause incompatibilities), complicates the story for statically compiled libraries, and increases the complexity/build time for CI/CD.

Wasm / Wasi

Maybe? Need to explore this option further

Embed the bundler within Nodejs

This is what we are currently going with until we can think of a better solution. The Rust port will take some time so we are hoping we will find a better solution eventually

This involves building the entire bundler as a Nodejs NAPI module (compiling the bundler as a dynamic C library consumed by a Nodejs entry point) and running it from within a Nodejs host process.

This limits the ability to use different languages, increases the complexity and is harder to reason about as the entrypoint is a JavaScript file that jumps into Rust which jumps back and forth into Nodejs & Nodejs worker threads.

When compared to this approach - 1 million IPC messages adds an overhead of +30s to +60s which is important because we are aiming to have an overall complete build time of ~60s.

glyn · 2024-12-18T04:13:14Z

That's helpful - thank you. So it seems we don't yet have evidence that any multi-process implementation could perform sufficiently well for very chatty use cases such as yours on Linux.

alshdavid · 2024-12-18T04:41:22Z

Unlikely. Is the overhead seen here a result of the serialization/deserialization of values across the IPC bridge? If that's the case, can we just send pointers?

I am toying around with the idea of using shared memory between the process to store Rust channels which act as a bridge - though I don't know enough about how that actually works yet. Still quite new to working with OS APIs.

Naively, I'm hoping I can store only a Rust channel in shared memory and send pointers to heap values between processes. Though I don't know if the receiving process can access the referenced value or if the OS prevents this (virtualized memory?).

Perhaps I can have access to a shared heap by forking the parent process? Or perhaps there is a custom Rust allocator that manages a cross process shared heap

glyn · 2024-12-18T11:04:38Z

Unlikely. Is the overhead seen here a result of the serialization/deserialization of values across the IPC bridge? If that's the case, can we just send pointers?

I believe IPC channel is predicated on (de)serialising values sent across the channel. So I suspect "direct" transmission of values is beyond the scope of IPC channel.

I am toying around with the idea of using shared memory between the process to store Rust channels which act as a bridge - though I don't know enough about how that actually works yet. Still quite new to working with OS APIs.

Shared memory or memory mapped files are likely part of any performant solution. Indeed the current implementation already uses shared memory.

These resources may be useful:

https://users.rust-lang.org/t/shared-memory-for-interprocess-communication/92408
https://stackoverflow.com/questions/14225010/fastest-technique-to-pass-messages-between-processes-on-linux

Naively, I'm hoping I can store only a Rust channel in shared memory and send pointers to heap values between processes. Though I don't know if the receiving process can access the referenced value or if the OS prevents this (virtualized memory?).

Perhaps I can have access to a shared heap by forking the parent process? Or perhaps there is a custom Rust allocator that manages a cross process shared heap

I personally think sharing (part of) the Rust heap between processes is a non-starter. It might be possible to build a library for managing shared memory or memory-mapped files as a way of passing values between processes, but that's likely to be a large piece of work.

That said, it feels to me that this discussion is going beyond an issue against the current IPC channel implementation and is getting into the realm of speculating about better alternatives. Would you be comfortable closing the issue?

alshdavid · 2024-12-20T01:54:13Z

True, I am happy to close this issue. Thanks for helping out 🙏

glyn · 2024-12-20T03:34:11Z

@alshdavid Thanks and I wish you good progress with https://github.com/atlassian-labs/atlaspack.

jschwe mentioned this issue Sep 20, 2024

Missing Windows Support open-s4c/benchkit#119

Open

alshdavid closed this as completed Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance on Linux? #336

Slow performance on Linux? #336

alshdavid commented May 11, 2024 •

edited

Loading

alshdavid commented May 11, 2024 •

edited

Loading

alshdavid commented May 13, 2024

mrobinson commented Oct 2, 2024 •

edited

Loading

mukilan commented Oct 2, 2024 •

edited

Loading

glyn commented Dec 17, 2024

alshdavid commented Dec 17, 2024

glyn commented Dec 18, 2024

alshdavid commented Dec 18, 2024 •

edited

Loading

glyn commented Dec 18, 2024

alshdavid commented Dec 18, 2024 •

edited

Loading

glyn commented Dec 18, 2024

alshdavid commented Dec 20, 2024

glyn commented Dec 20, 2024

Slow performance on Linux? #336

Slow performance on Linux? #336

Comments

alshdavid commented May 11, 2024 • edited Loading

alshdavid commented May 11, 2024 • edited Loading

alshdavid commented May 13, 2024

mrobinson commented Oct 2, 2024 • edited Loading

mukilan commented Oct 2, 2024 • edited Loading

glyn commented Dec 17, 2024

alshdavid commented Dec 17, 2024

glyn commented Dec 18, 2024

alshdavid commented Dec 18, 2024 • edited Loading

glyn commented Dec 18, 2024

alshdavid commented Dec 18, 2024 • edited Loading

glyn commented Dec 18, 2024

alshdavid commented Dec 20, 2024

glyn commented Dec 20, 2024

alshdavid commented May 11, 2024 •

edited

Loading

alshdavid commented May 11, 2024 •

edited

Loading

mrobinson commented Oct 2, 2024 •

edited

Loading

mukilan commented Oct 2, 2024 •

edited

Loading

alshdavid commented Dec 18, 2024 •

edited

Loading

alshdavid commented Dec 18, 2024 •

edited

Loading