Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jemalloc probably won't work well on aarch64-linux #91

Closed
thoughtpolice opened this issue Feb 9, 2023 · 7 comments
Closed

jemalloc probably won't work well on aarch64-linux #91

thoughtpolice opened this issue Feb 9, 2023 · 7 comments

Comments

@thoughtpolice
Copy link
Contributor

thoughtpolice commented Feb 9, 2023

Leaving this here while I'm using the laptop, so that I don't forget it. Maybe something can be done, or not. But this will probably come back to bite someone eventually, I suspect.


jemalloc seems to currently be a dependency as Buck2's global allocator. While I understand jemalloc is a big part of what makes Facebook tick, and it's excellent, there is a problem: jemalloc compiles the page size of the host operating system into the library, effectively making it part of its ABI. In other words, if you build jemalloc on a host with page size X, and then run it on an OS with page size Y, and X != Y, then things get bad; your program just crashes.

Normally, until relatively recently, this hasn't a problem. Why? Because most systems have mostly decided collectively that 4096 byte pages are good enough (that's wrong, but not important.) So almost everything uses that — except for the new fancy Apple Silicon M-series, such as my M2 MBA. These systems exclusively makes use of not 4k, but 16k pages. This page size is perfectly allowed by the architecture (actually, 4k, 8k, 16k, 32k, and 64k are all valid on aarch64) and 16k pages are a great choice for many platforms, especially client ones.

So the problem begins to crop up once people start building aarch64-linux binaries for their platforms; e.g. Arch Linux ARM or NixOS, which distribute aarch64 binaries. Until the advent of Apple Silicon, you could reasonably expect everything to use the same page size. But now we have this newly, reasonably popular platform using 16k pages. There's a weird thing happening here: most of the systems building packages for users are some weird ARM board (or VM) in a lab churning out binaries 24/7. They just need to run Linux and not set on fire. They aren't very fast and they typically are old CPUs, and often are running custom, hacked Linux kernsl that barely work. But most developers or end users? They want good performance and lots of features, with a stable kernel. For ARM platforms, the only options they reasonably have today for supported ARM systems are Raspberry Pis, Nvidia Jetson series, and now Apple Silicon. And Apple Silicon is, without comparison, the best bang for your buck and the highest performer. So there's a thing here where users are more likely to use one platform I feel, and it's becoming more popular — while systems churning out packages will use another, incompatible one.

This isn't a theoretical concern; Asahi Linux users like myself still (somewhat often) run into broken software. jemalloc isn't the only thing that doesn't support non-4k pages easily, it's just one of the more notorious and easy-to-spot culprits, and it turns otherwise working packages into non-working ones: https://github.com/AsahiLinux/docs/wiki/Broken-Software

Right now, I'm building buck2 manually, so this isn't a concern. But it means my binaries aren't applicable to non-AS users, and vice versa.


So there are a few reasonable avenues of attack here:

  • Don't use a custom allocator at all, and rely on libc.
    • Probably not good; most libc's notoriously aim for "good" steady state performance, not peak performance under harsher conditions.
  • Turn off jemalloc only on aarch64
    • Maybe OK, though a weird incongruence.
  • Turn on jemalloc only when the user (e.g. internal FB builds) ask for it.
    • Maybe OK; at least you could make the argument y'all have enough money to support customized builds like this while the rest of us need something else.
    • You're already doing your own custom builds already, so maybe this isn't a big deal
  • Switch to another allocator, whole-sale
    • Could also make it a configurable toggle
    • Making it a toggle is potentially a footgun though; it's the kind of "useless knob" that people only bang on once the other ones don't work and they're desperate. This makes it more likely to bitrot, for it to lag in testing and performance eval, etc.
    • I've had very good experience with mimalloc; much like jemalloc it also has an excellent design, fun codebase, and respectable author (Daan Leijen fan club.)
      • But I haven't confirmed it avoids this particular quirk of jemalloc's design. Maybe a dead end.
    • It would probably require a bunch of testing on a large codebase to see what kind of impact this change has. I suspect the FB codebase is a good place to try. ;)

I don't know which one of these is the best option.

@thoughtpolice
Copy link
Contributor Author

Please note that this isn't causing huge problems for me. Yet. But eventually I want to distribute aarch64-linux builds of my Nix package for buck2. So, this is mainly just to catalogue the issue since I suspect any movement towards an actual solution will require a bit of stakeholder input, and because someone else may eventually run into it.

@ndmitchell
Copy link
Contributor

We have now disabled jemalloc everywhere apart from Mac/Linux, since it doesn't play well on other OS's like Illumos #120. We also disable jemalloc if you are doing a build of Buck2 with Buck2, mostly because we haven't setup Buck2 to build Jemalloc. Is that enough?

@thoughtpolice
Copy link
Contributor Author

That's close, but an aarch64-linux package for e.g. NixOS (just as an example) won't be able to work cross-system unless we also turn off jemalloc there, too.

Would a patch to make jemalloc an optional Cargo feature be accepted? Then we could just turn it off, and it could default to on, to leave the current behavior. I could write this.

It might also be worth exploring if other allocators can boost performance while more gracefully handling these requirements.

@ndmitchell
Copy link
Contributor

Internally we'll probably always use jemalloc, and it's been well tuned, so I am suspicious there is anything else out there with higher performance. But if you find something, we'd switch.

Happy for it to be an optional Cargo feature. Note that it is already gated on a few things. Alternatively have you tried asking upstream at jemalloc, in case they can have a fallback path for the NixOS example?

@thoughtpolice
Copy link
Contributor Author

I believe Jason has stated multiple times that the page size being (effectively) part of the API isn't going to change because it would require a large rework; see jemalloc/jemalloc#467. That said, apparently jemalloc can support sizes smaller than the baked in page size (e.g. build for 16k, run on 4k using the --with-lg-page option) — and it looks like NixOS enables that feature! Which is nice, but...

It would probably be good to still add a flag for places that don't enable this, and also because packages like jemallocator tend to do things like build their own copy of jemalloc as part of their build.rs, which then thwarts usage of the NixOS version with the appropriate flags. While this can (and probably should) be fixed on our side, it's likely not the last time something like this will happen.

@ndmitchell
Copy link
Contributor

A feature flag to disable jemalloc seems reasonable, and like it would solve all the issues here. Patch welcome.

@thoughtpolice
Copy link
Contributor Author

This was fixed by #693 as the pre-built binaries now use 16k pages. Users who build aarch64 buck2 binaries on their own still need to enable it themselves, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants