-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove seccomp and landlock #47
Conversation
This patch completely removes landlock from the Linux sandbox. While landlock provides solid filesystem isolation even without requiring any intricate Linux sandboxing knowledge, it is currently still too limited to build a "bulletproof" filesystem sandbox. As such, it is better suited for best-effort isolation of "assumed safe" applications, rather than sandboxing of "potentially hazardous" software. For Birdcage, these are the biggest limitations: - Locking down sockets/pipes - High kernel requirement for all basic features (FS_TRUNCATE is 6.2)
This patch completely removes seccomp from the Linux sandbox. Currently the only usage of seccomp was to block system calls for network filtering. However since user namespaces already isolate networking, seccomp isn't necessary anymore. Seccomp could be useful in the future to limit some system calls that could cause undesired system changes, but these types of system calls usually require elevated permissions already.
723ae65
to
aab3123
Compare
Pinging @l0kod since I know you've documented our landlock usage in some places, so I just wanted to make sure you're aware we're not using it anymore. I think in our past discussions you've already pointed out that landlock is better suited as a "best-effort sandboxing" mechanism for "normal" applications. So this might not come as much of a surprise to you. I definitely do feel like it's a pleasure to use and I'm sure once development has progressed a little further its application in areas other than "best-effort sandboxing" will become more prevalent too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the kernel requirement now should be Linux 3.8 (for user namespaces). That's awesome.
Yeah I came to the same conclusion and considered adding it, but I wasn't 100% sure and that is so low that most setups should just work anyway, so no need to make guarantees you can't keep. |
Thanks for the heads up.
To clarify a bit, Landlock is designed with backward and forward compatibility in mind, and we encourage everyone to sandbox as much as possible. Given the nature of the Linux kernel and distros, most application developers cannot have guarantees on which kernel version (and then which security features) will be available at run time. To protect users as much as possible, we then encourage to follow a best-effort security approach, which means to not disable the whole sandboxing even if only some security features are available on users' kernels. This is easier said than done, hence the ongoing effort on the rust-landlock library to make developers' live easier.
Good to know, thanks! Commenting on the commit messages:
This is correct, Landlock is not feature-complete yet (e.g. compared to SELinux), but as explained above still very useful. The reason is because developing new kernel features requires time, and it is much better to follow an incremental development and adoption. About the "bulletproof filesystem sandbox", only SELinux could be described like that.
With the current Linux kernel, it is correct to say that Landlock alone is not enough to protect against any malicious code that would like to harm the system, but as always, it depends on the threat model. Landlock can already protect against illegitimate access to (sensitive) data stored on the filesystem, but it cannot protect against e.g., network data transfers yet.
FYI, restricting sockets according to their type is planned, and control of other IPC mechanisms should follow. I'm wondering about the threat model specific to pipes though.
We indeed need a 6.2 kernel (e.g. Ubuntu 22.04 LTS) to be able to restrict truncate operations . It should be noted that it will become more and more common, especially thanks to the LTS kernel lifetime shrinking (and then bringing new features more quickly).
The initial goal of seccomp is to limit kernel's attack surface. Replacing its use with user namespaces would mainly increase the kernel attack surface.
Not only "privileged" syscalls can harm the kernel. This is why most container runtimes use seccomp. Linux namespaces are not intended to be used by untrusted processes. For any sandboxer, I strongly suggest to use seccomp (to limit kernel attack surface and avoid sandbox escape) and Landlock (for an ephemeral and run-time-configurable access control system), and to disable user namespaces (and drop capabilities) inside the sandbox. |
Looking forward to test this out.
There's no threat model with pipes, I just copied that from the landlock kernel docs. The issue is just with sockets specifically because of Wayland/X11 allowing nefarious stuff done through socket communication.
That's fair, though the previous seccomp sandbox was extremely limited in its restrictions anyway, focusing only on networking. I definitely want to bring it back, but I'd like to focus more on a general "sensible syscall" list rather than just locking down networking.
That's fair and our previous seccomp filter was heavily inspired by docker's filter already. It just needs to be reworked in its focus.
Namespace usage can definitely be locked down with seccomp inside the sandbox in the future and specifically the comment about complexity ( However for now, they still seem like a more reliable tool to lock down a sandbox than landlock, which is probably why most sanboxes I've looked into use namespaces extensively.
While this might be true in 2-3 years, I'm not sure I can see this working today. As you've said socket restrictions are upcoming, but not supported yet. So it'll be years before the common LTS releases have this functionality. Layering of landlock/mount namespaces isn't possible, so the only options are using landlock without locking down sockets, or using mount namespaces which supports locking down everything. So considering the options available today, it seems to me like user mount namespaces are the only viable choice, despite their numerous issues. Especially for an application that doesn't want to provide an suid sandbox binary. Also I really appreciate your feedback, so thanks for volunteering your time. |
Unfortunately, this has been our exact dilemma with the first Landlock ABI because of But, of course, the newer Landlock ABIs have resolved this issue.
Looking forward to this! Once kernels with this feature are generally available, I hope to see Birdcage make the switch back to Landlock. |
Indeed. I was curious because I think restricting pipes/fifos doesn't make sense for a sandbox, but we'll see if there is a need.
Sounds good.
The main reason most sandboxes use namespaces is because Landlock is pretty new and it wasn't available a few years ago: not everyone is aware of Landlock, not every users have an up-to-date kernel, and namespaces may seem good-enough for sandboxers that invested in them. And as you pointed out, some filesystem access types cannot be restricted with Landlock yet.
Not requiring a SUID sandbox binary is definitely a good thing. Using namespaces (when available) makes sense too (e.g. to create a limited view of the filesystem). I understand that seccomp is not the priority, which is OK. If "layering of landlock/mount namespace isn't possible", layering of mount/Landlock definitely works. I'm suggesting to use namespaces, then seccomp (locking down the availability of user namespaces and uncommon syscalls), then Landlock (partial restrictions but could still help with fine-grained access control, if supported by the running system). I'm not sure about the availability of user namespaces to unprivileged users though. I know some (most?) distros disable it. I encourage you to check that. For instance, see Unprivileged user namespace restrictions via AppArmor in Ubuntu.
I also appreciate this serene discussion and your feedback on Landlock. It's of course up to you to choose what suits you best, taking into account different constraints. As a Landlock developer, I might be biased, but these are genuine security suggestions. One reason Landlock exists is because of the lack of real sandboxing features for Linux: namespaces are very useful, but we'd like the right tool for the right job. |
If the |
About the |
Using landlock for more granular control on top of mount namespaces might be worthwhile. My main concern with landlock on top of mount namespaces to restrict hierarchy access was that the filesystem tree is basically, fake, so you're not really enforcing anything that you haven't enforced with mount namespaces already anyway. But one feature regression I did notice with mount namespaces is that you cannot create a write-only bind mount. So that's where landlock could come in again. That said, I don't expect write-only access to be a major requirement of any application we sandbox. |
This might be close but not the same. For instance, with Landlock you can allow access to a file or directory, but not its parent (e.g. |
This is fair, but very hard to abstract on top of with a generic sandbox that should work on macOS too. And for most scenarios this is probably too granular anyway, at least with our sandbox I don't actually know which operations exactly are going to be performed. |
We're already working on sandboxing by [blocking all unneeded syscalls](#882). However, due to the wide scope it will take a while longer. This PR starts with a much smaller scope, only blocking network-related syscalls until the above is ready. For security we block the following with `seccomp`: - creation of new sockets - these are unneeded in PVF jobs, and we can safely block them without affecting consensus. - `io_uring` - as discussed [here](paritytech/polkadot#7334 (comment)), io_uring allows for networking and needs to be blocked. See below for a discussion on the safety of doing this. - `connect`ing to sockets - the above two points are enough for networking and is what birdcage does (or [used to do](phylum-dev/birdcage#47)) to restrict networking. However, it is possible to [connect to abstract unix sockets](https://lore.kernel.org/landlock/20231023.ahphah4Wii4v@digikod.net/T/#u) to do some kinds of sandbox escapes, so we also block the `connect` syscall. (Intentionally left out of implementer's guide because it felt like too much detail.) `io_uring` is just a way of issuing system calls in an async manner, and there is nothing stopping wasmtime from legitimately using it. Fortunately, at the moment it does not. Generally, not many applications use `io_uring` in production yet, because of the numerous kernel CVEs discovered. It's still under a lot of development. Android outright banned `io_uring` for these reasons. Considering `io_uring`'s status, and that it very likely would get detected either by our [recently-added static analysis](#1663) or by testing, I think it is fairly safe to block it. If execution hits an edge case code path unique to a given machine, it's already taken a non-deterministic branch anyway. After all, we just care that the majority of validators reach the same result and preserve consensus. So worst-case scenario, there's a dispute, and we can always admit fault and refund the wrong validator. On the other hand, if all validators take the code path that results in a seccomp violation, then they would all vote against the current candidate, which is also fine. The violation would get logged (in big scary letters) and hopefully some validator reports it to us. Actually, a worst-worse-case scenario is that 50% of validators vote against, so that there is no consensus. But so many things would have to go wrong for that to happen: 1. An update to `wasmtime` is introduced that uses io_uring (unlikely as io_uring is mainly for IO-heavy applications) 2. The new syscall is not detected by our static analysis 3. It is never triggered in any of our tests 4. It then gets triggered on some super edge case in production on 50% of validators causing a stall (bad but very unlikely) 5. Or, it triggers on only a few validators causing a dispute (more likely but not as bad?) Considering how many things would have to go wrong here, we believe it's safe to block `io_uring`. Closes #619 Original PR in Polkadot repo: paritytech/polkadot#7334
See individual commits for details.