Buildbarn Architecture Decision Record #9: An NFSv4 server for bb_worker

Author: Ed Schouten
Date: 2022-05-23

Context

Buildbarn's worker process (bb_worker) can be configured to populate build directories of actions in two different ways:

By instantiating it as a native directory on a local file system. To speed up this process, it may use a directory where recently used files are cached. Files are hardlinked from/to this cache directory.
By instantiating it in-memory, while using FUSE to make it accessible to the build action. An instance of the LocalBlobAccess storage backend needs to be used to cache file contents.

While the advantage of the former is that it does not introduce any overhead while executing, the process may be slow for large input roots, especially if only a fraction gets used in practice. The FUSE file system has the advantage that data is loaded lazily, meaning files and directories of the input root are only downloaded from the CAS if their contents are read during execution. This is particularly useful for actions that ship their own SDKs.

An issue with FUSE is that it remains fairly Linux specific. Other operating systems also ship with implementations of FUSE or allow it to be installed as a kernel module/extension, but these implementations tend to vary in terms of quality and conformance. For example, for macOS there is macFUSE (previously called OSXFUSE). Though bb_worker can be configured to work with macFUSE, it does tend to cause system lockups under high load. Fixing this is not easy, as macFUSE is no longer Open Source Software.

For this reason we would like to offer an alternative to FUSE, namely an integrated NFSv4 server that listens on localhost and is mounted on the same system. FUSE will remain supported and recommended for use on Linux; NFSv4 should only be used on systems where the use of FUSE is undesirable and a high-quality NFSv4 client is available.

We will focus on implementing NFSv4.0 as defined in RFC 7530. Implementing newer versions such as NFSv4.1 (RFC 8881) and NFSv4.2 (RFC 7862) is of little use, as clients such as the one shipped with macOS don't support them. We should also not be using NFSv3 (RFC 1813), as due to its lack of compound operations, it is far more 'chatty' than NFSv4. This would lead to unnecessary context switching between bb_worker and build actions.

Reusing as much code as possible between FUSE and NFSv4

The code for our existing FUSE file system is about 7500 lines of code. We have already invested heavily in it, and it has received many bugfixes for issues that we have observed in production use. The last thing we want to do is to add a brand new pkg/filesystem/nfsv4 package that has to reimplement all of this for NFSv4. Not only will this be undesirable from a maintenance perspective, it also puts us at risk of introducing behavioral differences between the FUSE and NFSv4 implementations, making it hard to switch between the two.

As FUSE and NFSv4 are conceptually identical (i.e., request-response based services for accessing a POSIX-like file system), we should try to move to a shared codebase. This ADR therefore proposes that the existing pkg/filesystem/fuse package is decomposed into three new packages:

pkg/filesystem/virtual, which will contain the vast majority of code that can be made independent of go-fuse.
pkg/filesystem/virtual/fuse, which contains the coupling with go-fuse.
pkg/filesystem/virtual/configuration, which contains the code for instantiating a virtual file system based on configuration settings and exposing (mounting) it.

This decomposition does require us to make various refactoring changes. Most notably, the Directory and Leaf interfaces currently depend on many data types that are part of go-fuse, such as status codes, directory entry structures, and file attribute structures. All of these need to be replaced with equivalents that are generic enough to support both the semantics of FUSE and NFSv4. Differences between these two protocols that need to be bridged include the following:

FUSE's READDIR operation is stateful, in that it's surrounded by OPENDIR and RELEASEDIR operations. This means that we currently load all of the directory contents at once, and paginate the results as part of READDIR. With NFSv4 this operation needs to be stateless. We solve this by pushing down the pagination into the Directory implementations. The new VirtualReadDir() method we add takes a starting offset and pushes directory entries into a DirectoryEntryReporter until space has run out. This makes both FUSE and NFSv4 use a stateless approach.
FUSE has the distinction between READDIR and READDIRPLUS. The former only returns filenames, file types and inode numbers, while the latter returns full stat information. NFSv4 only provides a single READDIR operation, but just like with the GETATTR operation, the caller can provide a bitmask of attributes that it's interested in receiving. As FUSE's semantics can be emulated on top of NFSv4's, we'll change our API to use an AttributesMask type as well.
When FUSE creates a regular file for reading/writing, it calls CREATE, providing it the identifier of the parent directory and the filename. This differs from opening an existing file, which is done through OPEN, providing it the identifier of the file. NFSv4, however, only provides a single OPEN operation that is used in all cases. We'll therefore replace Directory.FUSECreate() with Directory.VirtualOpenChild(), which can be used by FUSE CREATE and NFSv4 OPEN. Leaf.FUSEOpen() will remain available under the name Leaf.VirtualOpenSelf(), to be used by FUSE OPEN.

With all of these changes landed, we'll be able to instantiate a virtual file system as part of bb_worker that can both be interacted with from within FUSE and NFSv4 servers.

Status: This work has been completed as part of commit c4bbd24. The new virtual file system layer can be found in pkg/filesystem/virtual.

Indexing the virtual file system by file handle

NFSv4 clients refer to files and directories on the server by file handle. File handles are byte arrays that can be between 0 and 128 bytes in size (0 and 64 bytes for NFSv3). Many NFS servers on UNIX-like platforms construct file handles by concatenating an inode number with a file generation count to ensure that file handles remain distinct, even if inodes are recycled. As handles are opaque to the client, a server can choose any format it desires.

As NFSv4 is intended to be a (mostly) stateless protocol, the server has absolutely no information on when a client is going to stop interacting with a file handle. At any point in time, a request can contain a file handle that was returned previously. Our virtual file system must therefore not just allow resolution by path, but also by file handle. This differs fundamentally from FUSE, where the kernel and userspace service share knowledge on which subset of the file system is in play between the two. The kernel issues FORGET calls when purging file entries from its cache, allowing the userspace service to release the resource as well.

Where NFSv4's semantics become problematic is not necessarily in bb_worker, but in bb_clientd. This service provides certain directories that are infinitely big (i.e., /cas/*, which allow you to access arbitrary CAS contents). Unlike build directories, files in these directories have an infinite lifetime, meaning that any dynamic allocation scheme for file handles would result in memory leaks. For files in these directories we will need to dump their state (i.e., the REv2 digest) into the file handle itself, allowing the file to be reconstructed when needed.

To achieve the above, we will add a new HandleAllocator API that is capable of decorating Directory and Leaf objects to give them their own identity and perform lifecycle management. In practice, this means that a plain Directory or Leaf implementation will no longer have an inode number or link count. Only by wrapping the object using HandleAllocator will this information become available through VirtualGetAttributes(). In the case of FUSE, HandleAllocator will do little more than compute an inode number, just like InodeNumberTree already did. For NFSv4, it will additionally generate a file handle and store the object in a global map, so that the object can be resolved by the NFSv4 server if the client performs a PUTFH operation.

The HandleAllocator API will distinguish between three types of allocations:

Stateful objects: Files or directories that are mutable. Each instance has its own dynamically allocated inode number/file handle.
Stateless objects: Files or directories that are immutable, but have a state that cannot be reproduced from just a file handle. Examples may include symbolic links created by build actions. As a symlink's target can be larger than 128 bytes, there is no way to embed a symlink's state into a file handle. This means that even though the symlink may have a deterministic inode number/file handle, its lifecycle should be tracked explicitly.
Resolvable objects: Files or directories that are immutable, and have state that is small enough to embed into the file handle. Examples include CAS backed files, as a SHA-256 sum, file size and executable bit can easily fit in a file handle.

Status: This work has also been completed as part of commit c4bbd24. Implementations of HandleAllocator have been added for FUSE and NFSv4.

XDR description to Go compilation

All versions of NFS are built on top of ONC RPCv2 (RFC 5531), also known as "Sun RPC". Like most protocols built on top of ONC RPCv2, NFS uses XDR (RFC 4506) for encoding request and response payloads. The XDR description for NFSv4.0 can be found in RFC 7531.

A nice feature of schema languages like XDR is that they can be used to perform code generation. For each of the types described in the RFC, we may emit an equivalent native type in Go, together with serialization and deserialization methods. In addition to making our server's code more readable, it is less error prone than attempting to read/write raw bytes from/to a socket.

A disadvantage of this approach is that it does add overhead. When converted to native Go types, requests are no longer stored contiguously in some buffer, but may be split up into multiple objects, which may become heap allocated (thus garbage collector backed). Though this is a valid concern, we will initially assume that this overhead is acceptable. In-kernel implementations tend to make a different trade-off in this regard, but this is likely a result of memory management in kernel space being far more restricted.

A couple of implementations of XDR for Go exist:

github.com/davecgh/go-xdr
github.com/stellar/go/xdr
github.com/xdrpp/goxdr

Unfortunately, none of these implementations are complete enough to be of use for this specific use case. We will therefore design our own implementation, which we will release as a separate project that does not depend on any Buildbarn code.

Status: The XDR to Go compiler has been released on GitHub.

The actual NFSv4 server

With all of the previous tasks completed, we have all of the building blocks in place to be able to add an NFSv4 server to the bb-remote-execution codebase. All that is left is to write an implementation of program NFS4_PROGRAM, for which the XDR to Go compiler automatically generates the following interface:

type Nfs4Program interface {
	NfsV4Nfsproc4Null(context.Context) error
	NfsV4Nfsproc4Compound(context.Context, *Compound4args) (*Compound4res, error)
}

This implementation, which we will call baseProgram, needs to process the operations provided in Compound4args by translating them to calls against instances of Directory and Leaf.

For most NFSv4 operations this implementation will be relatively simple. For example, for RENAME it involves little more than extracting the directory objects and filenames from the request, followed by calling Directory.VirtualRename().

Most of the complexity of baseProgram will lie in how operations like OPEN, CLOSE, LOCK, and LOCKU are implemented. These operations establish and alter state on the server, meaning that they need to be guarded against replays and out-of-order execution. To solve this, NFSv4.0 requires that these operations are executed in the context of open-owners and lock-owners. Each open-owner and lock-owner has a sequence ID associated with it, which gets incremented whenever an operation succeeds. The server can thus detect replays of previous requests by comparing the sequence ID in the client's request with the value stored on the server. If the sequence ID corresponds to the last operation to execute, a cached response is returned. A new transaction will only be performed if the sequence ID is one larger than the last observed.

The exact semantics of this sequencing model is fairly complex. It is covered extensively in chapter 9 of RFC 7530, which is about 40 pages long. The following attempts to summarize which data types we have declared as part of baseProgram, and how they map to the NFSv4.0 sequencing model.

baseProgram: In the case of bb_worker, zero or more build directories are declared to be exposed through an NFSv4 server.
- clientState: Zero or more clients may be connected to the NFSv4 server.
  - clientConfirmationState: The client may have one or more client records, which are created through SETCLIENTID. Multiple client records can exist if the client loses all state and reconnects (e.g., due to a reboot).
    - confirmedClientState: Up to one of these client records can be confirmed using SETCLIENTID_CONFIRM. This structure stores the state of a healthy client that is capable of opening files and acquiring locks.
      - openOwnerState: Confirmed clients may have zero or more open-owners. This structure stores the current sequence number of the open-owner. It also holds the response of the last CLOSE, OPEN, OPEN_CONFIRM or OPEN_DOWNGRADE call for replays.
        
        openOwnerFileState: An open-owner may have zero or more open files. The first time a file is opened through this open-owner, the client needs to call OPEN_CONFIRM.
      - lockOwnerState: Confirmed clients may have zero or more lock-owners. This structure stores the current sequence number of the lock-owner. It also holds the response of the last LOCK or LOCKU call for replays.
        
        lockOwnerFileState: A lock-owner may have one or more files with lock state.
        
        ByteRangeLock: A lock state may hold locks on byte ranges in the file.

Even though NFSv4.0 does provide a RELEASE_LOCKOWNER operation for removing lock-owners, no facilities are provided for removing unused open-owners and client records. baseProgram will be implemented in such a way that these objects are removed automatically if a configurable amount of time passes. This is done as part of general lock acquisition, meaning clients are collectively responsible for cleaning stale state.

A disadvantage of NFSv4.0's sequencing model is that open-owners are not capable of sending OPEN requests in parallel. It is not expected that this causes a bottleneck in our situation, as running the NFSv4 server on the worker itself means that latency is virtually nonexistent. It is worth noting that NFSv4.1 has completely overhauled this part of the protocol, thereby removing this restriction. Implementing this model is, as explained earlier on, out of scope.

Status: This work has been completed as part of commit f00b857. The new NFSv4 server can be found in pkg/filesystem/virtual/nfsv4.

Changes to the configuration schema

The FUSE file system is currently configured through the MountConfiguration message. Our plan is to split this message up, moving all FUSE specific options into a FUSEMountConfiguration message. This message can then be placed in a oneof together with an NFSv4MountConfiguration message that enables the use of the NFSv4 server. Switching back and forth between FUSE and NFSv4 should thus be trivial.

Status: This work has been completed as part of commit f00b857. The new MountConfiguration message with separate FUSE and NFSv4 backends can be found here.

Future work

In this ADR we have mainly focused on the use of NFSv4 for bb_worker. These changes will also make it possible to launch bb_clientd with an NFSv4 server. bb_clientd's use case differs from bb_worker's, in that the use of the Remote Output Service heavily depends on being able to quickly invalidate directory entries when lazy-loading files are inserted into the file system. FUSE is capable of facilitating this by sending FUSE_NOTIFY_INVAL_ENTRY messages to the kernel. A similar feature named CB_NOTIFY is present in NFSv4.1 and later, but rarely implemented by clients.

Maybe bb_clientd can be made to work by disabling client-side directory caching. Would performance still be acceptable in that case?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0009-nfsv4.md

0009-nfsv4.md

Buildbarn Architecture Decision Record #9: An NFSv4 server for bb_worker

Context

Reusing as much code as possible between FUSE and NFSv4

Indexing the virtual file system by file handle

XDR description to Go compilation

The actual NFSv4 server

Changes to the configuration schema

Future work

Files

0009-nfsv4.md

Latest commit

History

0009-nfsv4.md

File metadata and controls

Buildbarn Architecture Decision Record #9: An NFSv4 server for bb_worker

Context

Reusing as much code as possible between FUSE and NFSv4

Indexing the virtual file system by file handle

XDR description to Go compilation

The actual NFSv4 server

Changes to the configuration schema

Future work