Author: Ed Schouten
Date: 2022-05-23
Buildbarn's worker process (bb_worker) can be configured to populate build directories of actions in two different ways:
- By instantiating it as a native directory on a local file system. To speed up this process, it may use a directory where recently used files are cached. Files are hardlinked from/to this cache directory.
- By instantiating it in-memory, while using FUSE to make it accessible to the build action. An instance of the LocalBlobAccess storage backend needs to be used to cache file contents.
While the advantage of the former is that it does not introduce any overhead while executing, the process may be slow for large input roots, especially if only a fraction gets used in practice. The FUSE file system has the advantage that data is loaded lazily, meaning files and directories of the input root are only downloaded from the CAS if their contents are read during execution. This is particularly useful for actions that ship their own SDKs.
An issue with FUSE is that it remains fairly Linux specific. Other operating systems also ship with implementations of FUSE or allow it to be installed as a kernel module/extension, but these implementations tend to vary in terms of quality and conformance. For example, for macOS there is macFUSE (previously called OSXFUSE). Though bb_worker can be configured to work with macFUSE, it does tend to cause system lockups under high load. Fixing this is not easy, as macFUSE is no longer Open Source Software.
For this reason we would like to offer an alternative to FUSE, namely an
integrated NFSv4 server that listens on localhost
and is mounted on
the same system. FUSE will remain supported and recommended for use on
Linux; NFSv4 should only be used on systems where the use of FUSE is
undesirable and a high-quality NFSv4 client is available.
We will focus on implementing NFSv4.0 as defined in RFC 7530. Implementing newer versions such as NFSv4.1 (RFC 8881) and NFSv4.2 (RFC 7862) is of little use, as clients such as the one shipped with macOS don't support them. We should also not be using NFSv3 (RFC 1813), as due to its lack of compound operations, it is far more 'chatty' than NFSv4. This would lead to unnecessary context switching between bb_worker and build actions.
The code for our existing FUSE file system
is about 7500 lines of code. We have already invested heavily in it, and
it has received many bugfixes for issues that we have observed in
production use. The last thing we want to do is to add a brand new
pkg/filesystem/nfsv4
package that has to reimplement all of this for
NFSv4. Not only will this be undesirable from a maintenance perspective,
it also puts us at risk of introducing behavioral differences between
the FUSE and NFSv4 implementations, making it hard to switch between the
two.
As FUSE and NFSv4 are conceptually identical (i.e., request-response
based services for accessing a POSIX-like file system), we should try to
move to a shared codebase. This ADR therefore proposes that the existing
pkg/filesystem/fuse
package is decomposed into three new packages:
pkg/filesystem/virtual
, which will contain the vast majority of code that can be made independent of go-fuse.pkg/filesystem/virtual/fuse
, which contains the coupling with go-fuse.pkg/filesystem/virtual/configuration
, which contains the code for instantiating a virtual file system based on configuration settings and exposing (mounting) it.
This decomposition does require us to make various refactoring changes.
Most notably, the Directory
and Leaf
interfaces currently depend on many data types that are part of go-fuse,
such as status codes, directory entry structures, and file attribute
structures. All of these need to be replaced with equivalents that are
generic enough to support both the semantics of FUSE and NFSv4.
Differences between these two protocols that need to be bridged include
the following:
-
FUSE's READDIR operation is stateful, in that it's surrounded by OPENDIR and RELEASEDIR operations. This means that we currently load all of the directory contents at once, and paginate the results as part of READDIR. With NFSv4 this operation needs to be stateless. We solve this by pushing down the pagination into the
Directory
implementations. The newVirtualReadDir()
method we add takes a starting offset and pushes directory entries into aDirectoryEntryReporter
until space has run out. This makes both FUSE and NFSv4 use a stateless approach. -
FUSE has the distinction between READDIR and READDIRPLUS. The former only returns filenames, file types and inode numbers, while the latter returns full
stat
information. NFSv4 only provides a single READDIR operation, but just like with the GETATTR operation, the caller can provide a bitmask of attributes that it's interested in receiving. As FUSE's semantics can be emulated on top of NFSv4's, we'll change our API to use anAttributesMask
type as well. -
When FUSE creates a regular file for reading/writing, it calls CREATE, providing it the identifier of the parent directory and the filename. This differs from opening an existing file, which is done through OPEN, providing it the identifier of the file. NFSv4, however, only provides a single OPEN operation that is used in all cases. We'll therefore replace
Directory.FUSECreate()
withDirectory.VirtualOpenChild()
, which can be used by FUSE CREATE and NFSv4 OPEN.Leaf.FUSEOpen()
will remain available under the nameLeaf.VirtualOpenSelf()
, to be used by FUSE OPEN.
With all of these changes landed, we'll be able to instantiate a virtual file system as part of bb_worker that can both be interacted with from within FUSE and NFSv4 servers.
Status: This work has been completed as part of commit c4bbd24
.
The new virtual file system layer can be found in pkg/filesystem/virtual
.
NFSv4 clients refer to files and directories on the server by file handle. File handles are byte arrays that can be between 0 and 128 bytes in size (0 and 64 bytes for NFSv3). Many NFS servers on UNIX-like platforms construct file handles by concatenating an inode number with a file generation count to ensure that file handles remain distinct, even if inodes are recycled. As handles are opaque to the client, a server can choose any format it desires.
As NFSv4 is intended to be a (mostly) stateless protocol, the server has absolutely no information on when a client is going to stop interacting with a file handle. At any point in time, a request can contain a file handle that was returned previously. Our virtual file system must therefore not just allow resolution by path, but also by file handle. This differs fundamentally from FUSE, where the kernel and userspace service share knowledge on which subset of the file system is in play between the two. The kernel issues FORGET calls when purging file entries from its cache, allowing the userspace service to release the resource as well.
Where NFSv4's semantics become problematic is not necessarily in
bb_worker, but in bb_clientd. This service provides certain
directories that are infinitely big (i.e., /cas/*
, which allow you to
access arbitrary CAS contents). Unlike build directories, files in these
directories have an infinite lifetime, meaning that any dynamic
allocation scheme for file handles would result in memory leaks. For
files in these directories we will need to dump their state (i.e., the
REv2 digest) into the file handle itself, allowing the file to be
reconstructed when needed.
To achieve the above, we will add a new HandleAllocator
API that is
capable of decorating Directory
and Leaf
objects to give them their
own identity and perform lifecycle management. In practice, this means
that a plain Directory
or Leaf
implementation will no longer have an
inode number or link count. Only by wrapping the object using
HandleAllocator
will this information become available through
VirtualGetAttributes()
. In the case of FUSE, HandleAllocator
will do
little more than compute an inode number, just like InodeNumberTree
already did. For NFSv4, it will additionally generate a file handle and
store the object in a global map, so that the object can be resolved by
the NFSv4 server if the client performs a PUTFH operation.
The HandleAllocator
API will distinguish between three types of
allocations:
- Stateful objects: Files or directories that are mutable. Each instance has its own dynamically allocated inode number/file handle.
- Stateless objects: Files or directories that are immutable, but have a state that cannot be reproduced from just a file handle. Examples may include symbolic links created by build actions. As a symlink's target can be larger than 128 bytes, there is no way to embed a symlink's state into a file handle. This means that even though the symlink may have a deterministic inode number/file handle, its lifecycle should be tracked explicitly.
- Resolvable objects: Files or directories that are immutable, and have state that is small enough to embed into the file handle. Examples include CAS backed files, as a SHA-256 sum, file size and executable bit can easily fit in a file handle.
Status: This work has also been completed as part of commit c4bbd24
.
Implementations of HandleAllocator
have been added for FUSE
and NFSv4.
All versions of NFS are built on top of ONC RPCv2 (RFC 5531), also known as "Sun RPC". Like most protocols built on top of ONC RPCv2, NFS uses XDR (RFC 4506) for encoding request and response payloads. The XDR description for NFSv4.0 can be found in RFC 7531.
A nice feature of schema languages like XDR is that they can be used to perform code generation. For each of the types described in the RFC, we may emit an equivalent native type in Go, together with serialization and deserialization methods. In addition to making our server's code more readable, it is less error prone than attempting to read/write raw bytes from/to a socket.
A disadvantage of this approach is that it does add overhead. When converted to native Go types, requests are no longer stored contiguously in some buffer, but may be split up into multiple objects, which may become heap allocated (thus garbage collector backed). Though this is a valid concern, we will initially assume that this overhead is acceptable. In-kernel implementations tend to make a different trade-off in this regard, but this is likely a result of memory management in kernel space being far more restricted.
A couple of implementations of XDR for Go exist:
Unfortunately, none of these implementations are complete enough to be of use for this specific use case. We will therefore design our own implementation, which we will release as a separate project that does not depend on any Buildbarn code.
Status: The XDR to Go compiler has been released on GitHub.
With all of the previous tasks completed, we have all of the building
blocks in place to be able to add an NFSv4 server to the
bb-remote-execution codebase. All that is left is to write an
implementation of program NFS4_PROGRAM
,
for which the XDR to Go compiler automatically generates the following
interface:
type Nfs4Program interface {
NfsV4Nfsproc4Null(context.Context) error
NfsV4Nfsproc4Compound(context.Context, *Compound4args) (*Compound4res, error)
}
This implementation, which we will call baseProgram
, needs to process
the operations provided in Compound4args
by translating them to calls
against instances of Directory
and Leaf
.
For most NFSv4 operations this implementation will be relatively simple.
For example, for RENAME it involves little more than extracting the
directory objects and filenames from the request, followed by calling
Directory.VirtualRename()
.
Most of the complexity of baseProgram
will lie in how operations like
OPEN, CLOSE, LOCK, and LOCKU are implemented. These operations establish
and alter state on the server, meaning that they need to be guarded
against replays and out-of-order execution. To solve this, NFSv4.0
requires that these operations are executed in the context of
open-owners and lock-owners. Each open-owner and lock-owner has a
sequence ID associated with it, which gets incremented whenever an
operation succeeds. The server can thus detect replays of previous
requests by comparing the sequence ID in the client's request with the
value stored on the server. If the sequence ID corresponds to the last
operation to execute, a cached response is returned. A new transaction
will only be performed if the sequence ID is one larger than the last
observed.
The exact semantics of this sequencing model is fairly complex. It
is covered extensively in chapter 9 of RFC 7530,
which is about 40 pages long. The following attempts to summarize which
data types we have declared as part of baseProgram
, and how they map
to the NFSv4.0 sequencing model.
baseProgram
: In the case of bb_worker, zero or more build directories are declared to be exposed through an NFSv4 server.clientState
: Zero or more clients may be connected to the NFSv4 server.clientConfirmationState
: The client may have one or more client records, which are created through SETCLIENTID. Multiple client records can exist if the client loses all state and reconnects (e.g., due to a reboot).confirmedClientState
: Up to one of these client records can be confirmed using SETCLIENTID_CONFIRM. This structure stores the state of a healthy client that is capable of opening files and acquiring locks.openOwnerState
: Confirmed clients may have zero or more open-owners. This structure stores the current sequence number of the open-owner. It also holds the response of the last CLOSE, OPEN, OPEN_CONFIRM or OPEN_DOWNGRADE call for replays.openOwnerFileState
: An open-owner may have zero or more open files. The first time a file is opened through this open-owner, the client needs to call OPEN_CONFIRM.
lockOwnerState
: Confirmed clients may have zero or more lock-owners. This structure stores the current sequence number of the lock-owner. It also holds the response of the last LOCK or LOCKU call for replays.lockOwnerFileState
: A lock-owner may have one or more files with lock state.ByteRangeLock
: A lock state may hold locks on byte ranges in the file.
Even though NFSv4.0 does provide a RELEASE_LOCKOWNER operation for
removing lock-owners, no facilities are provided for removing unused
open-owners and client records. baseProgram
will be implemented in
such a way that these objects are removed automatically if a
configurable amount of time passes. This is done as part of general lock
acquisition, meaning clients are collectively responsible for cleaning
stale state.
A disadvantage of NFSv4.0's sequencing model is that open-owners are not capable of sending OPEN requests in parallel. It is not expected that this causes a bottleneck in our situation, as running the NFSv4 server on the worker itself means that latency is virtually nonexistent. It is worth noting that NFSv4.1 has completely overhauled this part of the protocol, thereby removing this restriction. Implementing this model is, as explained earlier on, out of scope.
Status: This work has been completed as part of commit f00b857
.
The new NFSv4 server can be found in pkg/filesystem/virtual/nfsv4
.
The FUSE file system is currently configured through
the MountConfiguration
message.
Our plan is to split this message up, moving all FUSE specific options
into a FUSEMountConfiguration
message. This message can then be placed
in a oneof
together with an NFSv4MountConfiguration
message that
enables the use of the NFSv4 server. Switching back and forth between
FUSE and NFSv4 should thus be trivial.
Status: This work has been completed as part of commit f00b857
.
The new MountConfiguration
message with separate FUSE and NFSv4 backends can
be found here.
In this ADR we have mainly focused on the use of NFSv4 for bb_worker. These changes will also make it possible to launch bb_clientd with an NFSv4 server. bb_clientd's use case differs from bb_worker's, in that the use of the Remote Output Service heavily depends on being able to quickly invalidate directory entries when lazy-loading files are inserted into the file system. FUSE is capable of facilitating this by sending FUSE_NOTIFY_INVAL_ENTRY messages to the kernel. A similar feature named CB_NOTIFY is present in NFSv4.1 and later, but rarely implemented by clients.
Maybe bb_clientd can be made to work by disabling client-side directory caching. Would performance still be acceptable in that case?