Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

RFC: Graphene multi-process synchronization #2158

Closed
pwmarcz opened this issue Feb 15, 2021 · 15 comments
Closed

RFC: Graphene multi-process synchronization #2158

pwmarcz opened this issue Feb 15, 2021 · 15 comments

Comments

@pwmarcz
Copy link
Contributor

pwmarcz commented Feb 15, 2021

Background

We want to rewrite Graphene filesystem code; the goals include code quality, fixing numerous bugs, improving Linux compatibility, and improving multi-process use cases (see main github issue for summary).

We probably need to consider synchronization first. This seems to be the deciding factor for shaping the architecture, and the main way in which Graphene is different from traditional OSes that share memory.

The following document attempts to focus the discussion by specifying common assumptions and proposing some solutions.

Assumptions

  • No shared memory. There is no shared memory in SGX, and future platforms supported by Graphene might also not have any. We can store information on host, or communicate over IPC.

  • Main process. We assume that the initial process will always be running (i.e. other process cannot continue when the main process is killed). Thanks to that assumption, we can use the main process for storing authoritative state, broadcasting etc.

  • Untrusted host, trusted internal state. Without fully discussing Graphene's trust model, we prefer to keep things in memory to writing them in host, and prefer Graphene's primitives (e.g. IPC) to host ones (e.g. host sockets).

  • Performance: We care about performance, and want at least the non-shared case (e.g. one process, or only one process using a given resource) to have a reasonably low overhead.

Use cases

We want to support the following use cases for the shared states:

  • Share modifications of files between programs: There should be a shared, writable filesystem. If one process modifies, renames or deletes a file, other processes should see it.

    This can be implemented in various ways: directly on a host filesystem, encrypted in some way, or as IPC.

  • Share in-memory filesystem (tmpfs): We want to have a way to store files in memory. Reasons include performance (not having to exit SGX enclave), confidentiality and integrity (data is protected from access and modification by host).

  • Share file handles: Processes can share file handles, and their state (mostly file offset) should be synchronized, so that e.g. write calls from multiple processes can be atomic and not overwrite one another.

  • See external modifications of host files? For instance, a long-running Graphene process should notice that a new file has been created from outside. (Is this important?)

Solution: use the main process?

The nuclear option.

Do not store anything filesystem-related, except maybe a table of process handles. Forward all requests (openat, read, write, readdir...) to host.

This solves most problems with synchronization, at the cost of overhead.

Not all requests can be handled this way (e.g. mmap).

It might make sense to keep metadata in main process, and handle I/O in the requesting process somehow (e.g. if the application requests read, we notify the main process of changed position, and read the data ourselves). Then again, that doesn't make sense for tmpfs.

Note that the IPC requests can have a "fast path" for the main process itself.

Solution: treat dentry/inode as cache

This is roughly what FUSE does: assume that the objects we have (dentry/inode) are just proxy to a remote objects, and the information stored in them can be expired or otherwise invalid.

After any change that modifies a given file (e.g. change size, change directory listing), use IPC to notify other processes of the change. If other processes have cached information about a file, they can evict it from cache.

Expiry time could also help, so that changes introduced from outside are visible eventually.

Solution: private/shared file handles?

Updates such as file position might be problematic, since they happen pretty often. We might want to distinguish a case where a handle is held by one process only, or shared between processes. Note that a fork might convert a private handle to a shared one, and close or process exit might convert it back.

Private handles might be handled directly in a process using them, and shared ones in the main process. This way the non-shared case has no overhead.

Probably a bad idea? Looks too complex.

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Feb 16, 2021

Updated after comments by @boryspoplawski and @mkow:

  • it's not "modifications by hosts files" that's important, but having some kind of shared filesystem;
  • in-memory tmpfs might be important because we don't have to exit the enclave to access it.

By the way, if tmpfs is implemented over IPC (read/write involves calls over IPC), does that negate the performance benefits?

@dimakuv
Copy link

dimakuv commented Feb 16, 2021

A remark about mmap. I think mmaped file are quite common in real-world applications. On the other hand, we have the assumption of no shared memory. This leads to only one solution: synchronize file contents only on flush/close.

By extension, since we cannot guarantee any access sync on mmap, there is no point in guaranteeing this access sync on read/write.

In other words, it looks like we can only meaningfully synchronize at flush/close points in time, not on every read/write (and obviously not on every access to the mmapped region).

Given this, I wonder:

  1. Are there FS implementations / research prototypes that argue in favor of such weak synchronization?
  2. Are there applications that definitely break with such weak synchronization?

@dimakuv
Copy link

dimakuv commented Feb 16, 2021

We might want to distinguish a case where a handle is held by one process only, or shared between processes. Note that a fork might convert a private handle to a shared one, and close or process exit might convert it back.

I guess you didn't mean close or process exit but rather successful waitpid/arrived SIGCHLD? Because from the perspective of the parent process, the handle is "private" again when all children terminated.

Also, I think we can have a simple scheme of "I am the main process, do I have any children?" to switch between shared/private file handles. So, we don't do this switch at the granularity of each file handle but rather have a simple rule "If I am the main process, and there are no children now, I will always use private file handles; otherwise I will always use shared file handles". This covers many Bash-like multi-process scenarios, where Bash/Python/R spawn a couple child processes to query and prepare/cleanup the system and then the main workloads runs in a single process. (Maybe that's what you meant.)

On the other hand, yes, the solution of shared file handles (vs the naive "let the host OS deal with them") sounds complicated.

@mkow
Copy link
Member

mkow commented Feb 16, 2021

This covers many Bash-like multi-process scenarios, where Bash/Python/R spawn a couple child processes to query and prepare/cleanup the system and then the main workloads runs in a single process. (Maybe that's what you meant.)

Wait, but the main workload process in this scenario will be a child of the Bash process, so it won't be able to benefit from private handles in this implementation?

@dimakuv
Copy link

dimakuv commented Feb 16, 2021

Wait, but the main workload process in this scenario will be a child of the Bash process, so it won't be able to benefit from private handles in this implementation?

I was thinking about Python and R cases. Python likes to spawn gcc/ldd/whatever before doing some actual work in the main process. R likes to spawn rm after doing actual work.

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Feb 19, 2021

I guess you didn't mean close or process exit but rather successful waitpid/arrived SIGCHLD? Because from the perspective of the parent process, the handle is "private" again when all children terminated.

Well, the handle is "private" when the children terminate (which indeed can be detected using waitpid/SIGCHLD, but in Graphene could also be a result of IPC notification), but can be also a result of the child process closing its own copy.

It might help to track ownership in main process, especially if we need locking (see below).

In other words, it looks like we can only meaningfully synchronize at flush/close points in time, not on every read/write (and obviously not on every access to the mmapped region).

What about position? I don't know if it happens in practice, but I can imagine multiple processes writing to the same handle for a log file (for example, a forking HTTP server). And with O_APPEND, this applies not only to shared handles, but also inodes.

To make this work correctly, we don't need to synchronize data after each write, but we need to synchronize file position. Which means some kind of lock during read and write.

@dimakuv
Copy link

dimakuv commented Feb 22, 2021

What about position? I don't know if it happens in practice, but I can imagine multiple processes writing to the same handle for a log file (for example, a forking HTTP server). And with O_APPEND, this applies not only to shared handles, but also inodes.
To make this work correctly, we don't need to synchronize data after each write, but we need to synchronize file position. Which means some kind of lock during read and write.

That's true, we need to synchronize the file position... So yes, we still need some locking on read/write, if only to broadcast/lookup the updated file position. Also, I'm pretty sure Apache httpd and Nginx do exactly what you described: several processes append to the same log file.

This will probably be slow... I'm still curious if any research papers/blogs did this and evaluated the overheads (or maybe just look at some docs on NFS?).

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Feb 22, 2021

I only know that NFS does not support O_APPEND. From man 2 open:

O_APPEND may lead to corrupted files on NFS filesystems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can't be done without a race condition.

Also, our situation is a bit harder because file handles can be shared as well, not only files.

I'm starting to have an idea:

  • Make shared objects (inode, handle) serializable, so that they can be sent between processes
  • Main process runs an IPC server that handles locking and synchronization
  • Optimize common cases:
    • multiple processes reading an object that doesn't change: shouldn't call server every time
    • single process updating an object that no other process accesses: shouldn't communicate with server until the object is actually shared)

I'm working on a draft now, but I'm wondering if a protocol / distributed locking scheme like that already exists? It would be better to base it on a prior art.

@dimakuv
Copy link

dimakuv commented Feb 22, 2021

Make shared objects (inode, handle) serializable, so that they can be sent between processes

We already have this logic, but it is tightly coupled to the checkpointing subsystem of Graphene. We need to make this code usable by other subsystems (and refactor all that macro-magic with CP and RS functions). One other use case that I wanted to enabled in Graphene some time ago, but got stuck for the similar reasons: #1511.

But this is definitely needed, and we already have the code. Just need to decouple it from the checkpointing subsystem.

Main process runs an IPC server that handles locking and synchronization

Yes! Spreading this responsibility across several "leader" nodes in the original code of Graphene proved to be extremely buggy and complicated. We definitely must have only one (main-process) leader.

Optimize common cases.

Yes.

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Feb 22, 2021

@mkow suggests that what I describe is similar to cache coherence protocols such as MESI.

On surface, it looks somewhat similar: I want to track which clients have a copy of an object, and whether it is modified compared to baseline (in which case the other cached copies become invalid).

Some differences I see:

  • Processes may die. What if a process dies holding an object that it has modified? Or maybe we can assume that if a process dies without a proper cleanup from Graphene's side, it's an unrecoverable error?
  • In addition to coherence, we need a facility for some locking: for instance, an exclusive lock for situations such as O_APPEND write (so that only one writer holds this lock at a time).

With luck, the same abstraction could be used for file locks (fcntl/flock), and maybe for more features that normally would require shared memory between processes?

@boryspoplawski
Copy link
Contributor

@mkow suggests that what I describe is similar to cache coherence protocols such as MESI.

I think such protocols generally assume, that inter-cache (processor) communication is cheaper than full write to memory, don't they? In our case, write to the underlying file might be cheaper than round-trip communication with IPC master process.
(I'm only talking about performance, not correctness)

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Feb 22, 2021

That's true, and the protocol might look different in our case. But I'm hoping the common cases will not require either IPC or file access. For instance, a process could respond to repeated stat, getdents etc. with cached values, and IPC/file access would be necessary only when the cache is invalidated.

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Mar 9, 2021

I thought about how to write such a filesystem and decided to write a prototype, in Python: https://github.com/pwmarcz/fs-demo

Key points:

  • There is an abstraction ("sync handle") that behaves more-or-less like a reader/writer lock. After use, the lock is not returned immediately, so that subsequent uses are cheap and do not require IPC.
  • For synchronizing directory entries, it's enough to use the locking functionality (and reload the dentry when we had to take the lock from someone else). For file handles (and possibly other objects), we also share some data together with the lock.
  • The IPC protocol is not too complicated (it has 4 commands: req_get, get, req_put, put). Implementing it is a bit more complicated, unfortunately (you have to keep track of which processes have a given handle).

Please take a look at the linked repo, and tell me what you think! There is a more detailed explanation, with diagrams, and some additional considerations at the bottom.

A sanity check would be very valuable, as would be suggestions how to simplify this, or pointers to some prior art. And of course, I'll try to explain better if something is unclear.

@dimakuv
Copy link

dimakuv commented Jul 22, 2021

@pwmarcz Since this is (mostly) implemented, do you still want to keep this issue open? Or can we close it?

@pwmarcz
Copy link
Contributor Author

pwmarcz commented Jul 22, 2021

So far I have only implemented a basic version, but this discussion seems finished for now. Closing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants