Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hvt_drop_privileges() behaviour #282

Open
mato opened this issue Oct 9, 2018 · 7 comments
Open

hvt_drop_privileges() behaviour #282

mato opened this issue Oct 9, 2018 · 7 comments
Labels
enhancement host/linux Applicable to Linux hosts target/hvt Applicable to hvt target

Comments

@mato
Copy link
Member

mato commented Oct 9, 2018

#276 adds scaffolding for hvt_drop_privileges() and an OpenBSD pledge implementation.

hvt_drop_privileges() is called by the tender just before entering the VCUP loop. i.e.

  • after all host resources have been acquired (all modules set up)
  • guest ELF has been loaded and memory set up.

We should decide what the behaviour should be on Linux and FreeBSD, to get "secure by default" behaviour. At this time I'm not prepared to add actual seccomp() filtering or Capsicum here, but I think that the behaviour should include at least:

  1. Droppng privileges to an unprivileged user if the tender is running as root.
  2. chroot() to a suitable directory.

@hannesm: Presumably the FreeBSD case is easier, since there will be a suitable non-privileged UID always available and /var/empty could be used for the chroot?

For the Linux case, I'm not sure what the best default behaviour that is guaranteed to work cross-distribution is -- could nobody be used as the unprivileged user? I don't know of any equivalent of /var/empty, but there might be something in the LSB/FHS.

A second case is when this is run out of a container in Linux -- can we guarantee that a nobody will always be available there?

/cc @adamsteen

@hannesm
Copy link
Contributor

hannesm commented Oct 9, 2018

@mato sounds sensible for FreeBSD /cc @sg2342 who may have an opinion here

@sg2342
Copy link
Contributor

sg2342 commented Oct 15, 2018

i think for FreeBSD a call to cap_enter(2) would be the way to go.

@sg2342
Copy link
Contributor

sg2342 commented Oct 21, 2018

turns out, that cap_enter(2) will not work because because ppoll(2) is not permitted in capability mode.
However: the poll(2) code in the FreeBSD kernel is capsicum enabled and shares the relevant parts with the ppoll(2). So the only thing missing in the FreeBSD source tree is an entry for ppoll(2) in sys/kern/capabilities.conf (and make -C sys/kern/ sysent to regen the syscall configuration).

FreeBSD bug 232495

@mato
Copy link
Member Author

mato commented Nov 6, 2018

I've been thinking about what to do here for the Linux case. There are two cases where the behaviour should be quite different:

  1. In a non-containerized setup: unlike the BSDs there should be no need to run the hvt tender as root. The only thing the tender needs access to is /dev/kvm, which is easily granted through normal permission bits on the device file.
  2. If the tender is running in a container (irrespective of container runtime): It is legitimate to run as root as we can assume it's not "real root", and in a minimal, fully deprivileged container where the tender is the only process running there's not much point in doing anything else (e.g. chroot).

Therefore, I think that for the first case (classic system) we should require running as non-root and make it the user's responsibility to ensure that access to /dev/kvm is available (via being a member of the kvm group -- I believe some distros may even enable this by default for everyone). No chroot()-ing should be done, as there is no sensible default.

In the second case, this requires defining what "running in a container" means and then being able to detect it. I'd prefer to avoid various heuristics (see e.g. here: https://github.com/genuinetools/bpfd/blob/master/proc/proc.go) and instead explicitly support only the case where the tender is running as the sole PID 1, which implies it's inside a PID namespace, which to me seems a "good enough" way to detect if it is running in a "container". So, if and only if:

if (getpid() == 1 && getppid() == 0)

returns true, we allow running as root (probably easiest to test just with getuid() for now rather than CAP_SYS_ADMIN) and do not do anything else, as in this setup there's likely no point in (or meaningful default for) a chroot().

Will try to pass this by some Linux container experts for opinions...

@mato
Copy link
Member Author

mato commented Jan 21, 2019

On the FreeBSD side, it looks like we might have to revert this entirely (see #312).

Update on the Linux side -- I'm no longer convinced we should do anything along the lines of classic privilege dropping there. Rather, with the introduction of "spt" (#310) we should look into applying a seccomp sandbox to the hvt tender also.

@mato mato added host/freebsd Applicable to FreeBSD hosts host/linux Applicable to Linux hosts labels Feb 20, 2019
@mato mato added the target/hvt Applicable to hvt target label Mar 27, 2019
@mato
Copy link
Member Author

mato commented Mar 27, 2019

Ok, so, in the light of:

  1. Discussion in improve FreeBSD create / destroy behaviour #316, the real fix here is as described in improve FreeBSD create / destroy behaviour #316 (comment) (fixing the FreeBSD vmm APIs).
  2. The build system refactoring in Remove specialization of hvt tender at unikernel compile time #326 changes the way hvt modules work, and the HVT_DROP_PRIVILEGES=0 case will effectively be non-functional until I re-design that to not use compile-time flags. However, in the mean time, I need the tests to pass!

I'm going to make an executive decision here, which is: The FreeBSD privilege dropping introduced in #286 will be reverted on master shortly.

Also, I'm going to close all issues/PRs related to this except #282, since the multiple discussions are confusing. We can discuss how to proceed forward there.

Note to self: TODO: The existing VM cleanup code for FreeBSD and OpenBSD should be audited and at warn() should be used in the atexit handler(s) if any syscalls fail.
/cc @sg2342 @hannesm

@mato mato closed this as completed Mar 27, 2019
@mato mato reopened this Mar 27, 2019
mato added a commit to mato/solo5 that referenced this issue Mar 27, 2019
This is a large change, and the majority of it is effectively a full
re-write of the build system. The actual removal of hvt compile-time
specialization is fairly straightforward.

User-visible changes:

- 'configure.sh' now needs to be run manually before running 'make',
  this is consistent with POLA.
- conversely, 'make clean' no longer cleans Makeconf. Use 'distclean' or
  'clobber' for that.
- 'configure.sh' will now print the targets that can (will) be built on
  this system. The strategy is still "build everything we can", however
  I have disabled Genode on all systems except Linux due to toolchain
  issues.
- You can now build a subset of targets from the top-level 'make', by
  specifying 'CONFIG_XXX=' (disable) or 'CONFIG_XXX=1' (enable) either
  on the command line, or editing the generated Makeconf.
- Makefiles use silent rules by default. To get the old verbose ones
  back, use 'make V=1'.
- The 'solo5-hvt' tender is no longer "specialized" to the unikernel.
  We build two tenders, 'solo5-hvt' with all non-debug modules
  configured and 'solo5-hvt-debug' with additional debug modules (gdb,
  dumpcore where available).
- 'solo5-hvt-configure' is kept around for now for backward
  compatibility with OPAM/MirageOS but is essentially a NOP.

Developer-visible changes:

- The build system now has proper support for auto-generation of
  dependencies. This means you can safely edit source files, run make
  and be sure you will get a complete incremental build.
- Makefiles have been refactored to use common best practices, remove
  repetition, consistent variable names and clear interfaces between
  configure.sh/Makeconf/Makefiles, all the while keeping them simple
  enough to understand for me on a Monday morning before coffee. I.e.
  limit use of macros, eval, etc.
- hvt tender modules are no longer defined by compile-time flags,
  instead a dynamic array is placed into a special ELF section
  (.modules).  This means that a hvt tender binary can be combined from
  an arbitrary set of hvt_module_XXX object files, which is the right
  way to do things going forward and also simplifies the build system
  (not needing to build multiple targets from the same set of sources).

Shortcomings / TODOs:

- Dependency files (*.d) are stored in-tree. I spent several days on
  trying to figure out how to get them to work out of tree, but in
  combination with the non-recursive use of subdirectories in 'bindings'
  I could not figure out the required Makefile magic.
- HVT_DROP_PRIVILEGES=0 is non-functional with the new modules
  arrangement, but needs a re-design anyway.

Other changes included as part of this PR:

- Revert privilege dropping on FreeBSD (see discussion in Solo5#282).
- The build system changes effectively implement option 1 in Solo5#292, i.e.
  on x86_64 -m no-red-zone is only used for bindings, not for
  application code.
- tests/tests.bats has been refactored for DRY as it was getting totally
  unmaintainable.
@mato
Copy link
Member Author

mato commented Apr 29, 2020

#366 implements a capsicum(4) sandbox for Solo5/hvt on FreeBSD 12+. Removing the FreeBSD labels from this, as I consider this done there.

@mato mato removed the host/freebsd Applicable to FreeBSD hosts label Apr 29, 2020
@mato mato changed the title hvt_drop_privileges() behaviour for Linux and FreeBSD hvt_drop_privileges() behaviour Apr 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement host/linux Applicable to Linux hosts target/hvt Applicable to hvt target
Projects
None yet
Development

No branches or pull requests

3 participants