-
Notifications
You must be signed in to change notification settings - Fork 261
RFC: Graphene multi-process synchronization #2158
Comments
Updated after comments by @boryspoplawski and @mkow:
By the way, if tmpfs is implemented over IPC (read/write involves calls over IPC), does that negate the performance benefits? |
A remark about By extension, since we cannot guarantee any access sync on In other words, it looks like we can only meaningfully synchronize at Given this, I wonder:
|
I guess you didn't mean Also, I think we can have a simple scheme of "I am the main process, do I have any children?" to switch between shared/private file handles. So, we don't do this switch at the granularity of each file handle but rather have a simple rule "If I am the main process, and there are no children now, I will always use private file handles; otherwise I will always use shared file handles". This covers many Bash-like multi-process scenarios, where Bash/Python/R spawn a couple child processes to query and prepare/cleanup the system and then the main workloads runs in a single process. (Maybe that's what you meant.) On the other hand, yes, the solution of shared file handles (vs the naive "let the host OS deal with them") sounds complicated. |
Wait, but the main workload process in this scenario will be a child of the Bash process, so it won't be able to benefit from private handles in this implementation? |
I was thinking about Python and R cases. Python likes to spawn |
Well, the handle is "private" when the children terminate (which indeed can be detected using waitpid/SIGCHLD, but in Graphene could also be a result of IPC notification), but can be also a result of the child process closing its own copy. It might help to track ownership in main process, especially if we need locking (see below).
What about position? I don't know if it happens in practice, but I can imagine multiple processes writing to the same handle for a log file (for example, a forking HTTP server). And with O_APPEND, this applies not only to shared handles, but also inodes. To make this work correctly, we don't need to synchronize data after each |
That's true, we need to synchronize the file position... So yes, we still need some locking on read/write, if only to broadcast/lookup the updated file position. Also, I'm pretty sure Apache httpd and Nginx do exactly what you described: several processes append to the same log file. This will probably be slow... I'm still curious if any research papers/blogs did this and evaluated the overheads (or maybe just look at some docs on NFS?). |
I only know that NFS does not support O_APPEND. From
Also, our situation is a bit harder because file handles can be shared as well, not only files. I'm starting to have an idea:
I'm working on a draft now, but I'm wondering if a protocol / distributed locking scheme like that already exists? It would be better to base it on a prior art. |
We already have this logic, but it is tightly coupled to the checkpointing subsystem of Graphene. We need to make this code usable by other subsystems (and refactor all that macro-magic with But this is definitely needed, and we already have the code. Just need to decouple it from the checkpointing subsystem.
Yes! Spreading this responsibility across several "leader" nodes in the original code of Graphene proved to be extremely buggy and complicated. We definitely must have only one (main-process) leader.
Yes. |
@mkow suggests that what I describe is similar to cache coherence protocols such as MESI. On surface, it looks somewhat similar: I want to track which clients have a copy of an object, and whether it is modified compared to baseline (in which case the other cached copies become invalid). Some differences I see:
With luck, the same abstraction could be used for file locks (fcntl/flock), and maybe for more features that normally would require shared memory between processes? |
I think such protocols generally assume, that inter-cache (processor) communication is cheaper than full write to memory, don't they? In our case, write to the underlying file might be cheaper than round-trip communication with IPC master process. |
That's true, and the protocol might look different in our case. But I'm hoping the common cases will not require either IPC or file access. For instance, a process could respond to repeated |
I thought about how to write such a filesystem and decided to write a prototype, in Python: https://github.com/pwmarcz/fs-demo Key points:
Please take a look at the linked repo, and tell me what you think! There is a more detailed explanation, with diagrams, and some additional considerations at the bottom. A sanity check would be very valuable, as would be suggestions how to simplify this, or pointers to some prior art. And of course, I'll try to explain better if something is unclear. |
@pwmarcz Since this is (mostly) implemented, do you still want to keep this issue open? Or can we close it? |
So far I have only implemented a basic version, but this discussion seems finished for now. Closing. |
Background
We want to rewrite Graphene filesystem code; the goals include code quality, fixing numerous bugs, improving Linux compatibility, and improving multi-process use cases (see main github issue for summary).
We probably need to consider synchronization first. This seems to be the deciding factor for shaping the architecture, and the main way in which Graphene is different from traditional OSes that share memory.
The following document attempts to focus the discussion by specifying common assumptions and proposing some solutions.
Assumptions
No shared memory. There is no shared memory in SGX, and future platforms supported by Graphene might also not have any. We can store information on host, or communicate over IPC.
Main process. We assume that the initial process will always be running (i.e. other process cannot continue when the main process is killed). Thanks to that assumption, we can use the main process for storing authoritative state, broadcasting etc.
Untrusted host, trusted internal state. Without fully discussing Graphene's trust model, we prefer to keep things in memory to writing them in host, and prefer Graphene's primitives (e.g. IPC) to host ones (e.g. host sockets).
Performance: We care about performance, and want at least the non-shared case (e.g. one process, or only one process using a given resource) to have a reasonably low overhead.
Use cases
We want to support the following use cases for the shared states:
Share modifications of files between programs: There should be a shared, writable filesystem. If one process modifies, renames or deletes a file, other processes should see it.
This can be implemented in various ways: directly on a host filesystem, encrypted in some way, or as IPC.
Share in-memory filesystem (tmpfs): We want to have a way to store files in memory. Reasons include performance (not having to exit SGX enclave), confidentiality and integrity (data is protected from access and modification by host).
Share file handles: Processes can share file handles, and their state (mostly file offset) should be synchronized, so that e.g.
write
calls from multiple processes can be atomic and not overwrite one another.See external modifications of host files? For instance, a long-running Graphene process should notice that a new file has been created from outside. (Is this important?)
Solution: use the main process?
The nuclear option.
Do not store anything filesystem-related, except maybe a table of process handles. Forward all requests (
openat
,read
,write
,readdir
...) to host.This solves most problems with synchronization, at the cost of overhead.
Not all requests can be handled this way (e.g.
mmap
).It might make sense to keep metadata in main process, and handle I/O in the requesting process somehow (e.g. if the application requests
read
, we notify the main process of changed position, and read the data ourselves). Then again, that doesn't make sense for tmpfs.Note that the IPC requests can have a "fast path" for the main process itself.
Solution: treat dentry/inode as cache
This is roughly what FUSE does: assume that the objects we have (dentry/inode) are just proxy to a remote objects, and the information stored in them can be expired or otherwise invalid.
After any change that modifies a given file (e.g. change size, change directory listing), use IPC to notify other processes of the change. If other processes have cached information about a file, they can evict it from cache.
Expiry time could also help, so that changes introduced from outside are visible eventually.
Solution: private/shared file handles?
Updates such as file position might be problematic, since they happen pretty often. We might want to distinguish a case where a handle is held by one process only, or shared between processes. Note that a
fork
might convert a private handle to a shared one, andclose
or process exit might convert it back.Private handles might be handled directly in a process using them, and shared ones in the main process. This way the non-shared case has no overhead.
Probably a bad idea? Looks too complex.
The text was updated successfully, but these errors were encountered: