[Proposal] Native code/module access via WASI/VFS #237

hugovincent · 2021-10-14T15:03:47Z

hugovincent
Oct 14, 2021
Maintainer

Status: work in progress.

Rationale:

We want a way of exposing certain native functionality (such as side-channel safe
cryptography, or efficient NN inference) to Wasm programs on Veracruz.

The normal Wasm way of doing this is with imports, directly giving the Wasm
program functions it can call. Depending on the implementation of the Wasm runtime,
this may imply copying of data in and out of the Wasm linear memory. It also implies
giving the native functionality access to the Wasm Linear Memory (to read or write
any data not passed by value), which is a security risk for arbitrary native code.

While Wasm Interface Types [1] would be helpful here, for Veracruz we additionally
want to restrict (at least for now) the functionality that is made available with
native modules, and for now don't want programs to be able to bring their own arbitrary
native modules. This is to limit the policy complexity, to maximise the benefit we
get from sandboxing in the first place, and to make our future lives easier as we
start looking at bounding program behaviour.

Therefore, for now, we propose to add only necessary native functionality as trusted,
Veracruz-provided modules, and to access them primarily using the WASI system interface
and VFS. The first such module will be for cryptography, where the functionality can
not adequately be implemented in Wasm due to the inability to write side-channel safe
code portably in Wasm (and also, where native instructions such as Armv8-CE or AES-NI
provide a very significant performance uplift).

Patterns:

Direct VFS access: native services are accessed using a "device" file on the VFS.
This has the advantages of simplicity and familiarity, and disadvantages of
requiring multiple copies of data in memory (e.g. to decrypt a file in the VFS, the
program needs to read the ciphertext into linear memory, write it to the device file,
and read back the results into linear memory). This pattern also avoids confused deputy
threats (see below), since the native module can not directly accces the VFS.
Indirect VFS access: native services are accessed using a device file as above, but
these native services can also directly access the VFS. Instead of copying via the
linear memory, the program hands the native functionality a handle/capability to an
object on the VFS (such as an encrypted blob) and another capability to
a VFS object which will contain the result. We must use capabilities here to avoid
confused deputy attacks, wherein a program can e.g. confuse the decryption module into
giving it access to a VFS file it does not have policy-granted permission to access.

As with POSIX and other programming of this sort, both approaches suffer from the challenge
of in-band vs out-of-band communications and signalling, e.g. for the decryption service
example, where the program supplies the decryption key. Traditionally this is solved
with ioctls and signals; we want to avoid that complexity if possible in our VFS,
however the alternatives (e.g. sockets) don't seem better. For the first prototype, we
propose to handle comms (e.g. for keying the decryptor module) in band using an RPC style
mechanism (e.g. commands and arguments serialized to protobuf and written to the device file)
and to ignore out of band signalling.

Note that it's possible to combine these models, e.g. input indirectly, result directly, etc.

Example:

Here is a very rough sketch of how a file decryption service might look (from the Wasm program's perspective):

Program opens /crypto/file-decryption/aes as <cap-aes>
Program creates a temporary file on the VFS for the decrypted plaintext,
as <cap-result>. No one else will have a capability to access this.
Program writes the following JSON-coded [2] string to <cap-aes>:
```
{
   "mode":   "cbc",
   "key":    "0x1234...",
   "source": <capability/FD of ciphertext file>,
   "dest":   <cap-result>,
}
```
and closes <cap-aes>. Conceptually, this triggers the operation itself,
perhaps blocking until it completes [3].
Program reads results from <cap-result> and goes on with it's day.

Internally, the AES file decryption module is implemented as part of the Veracruz
binary, like the rest of the VFS – a dynamically extensible design (i.e. runtime loadable
native modules) is possible within this interface, but out of scope for this proposal.

(Note: some operations might benefit from in-place/destructive processing;
in such cases, the API could permit the same cap be passed as both source
and dest, although this is messy and complicates the capability system).

[1] https://github.com/WebAssembly/interface-types/blob/main/proposals/interface-types/Explainer.md

[2] JSON in strings avoids requiring the application to have a dependency on a library for
object serialization (such as protobuf or Cap'n Proto) for the common case where the
application only has to write to (not read from) these synthetic files. Such strings can
easily be constructed with standard-library functionality in most languages. If the
application wants to, it can of course include a JSON library for constructing and parsing
these strings.

[3] There are obvious problems and obvious solutions to this, we can deal with them later. A
simple solution would be to have this operation return immediately, and have subsequent
reads from <cap-result> block until the operation has produced sufficient result for the
read to return. If the application closes <cap-source> before reading from <cap-result>
(and no other entities hold a capability to <cap-source>) then the system could
transparently perform the operation destructively. This could be extended to chaining
multiple native operations together (e.g. file decrypt and video decode) without having
to roundtrip in and out of Wasm linear memory.

ShaleXIONG · 2021-10-14T15:10:05Z

ShaleXIONG
Oct 14, 2021
Maintainer

Question: "Wasm execution blocks until the operation completes [3]." How does the wasm execution know when to block? In the sense that the wasm program might write the file, e.g. /crypto/file-decryption/aes, in several wasi_fd_write calls, (it really depends on the compiler) or never close the file /crypto/file-decryption/aes.

14 replies

ericvh Oct 14, 2021
Collaborator

The fd_write call should pass control to the native module before it returns. Since the command write could be fragmented into several fd_write calls, we can adopt one of the possibilities above, e.g. ~~greedily parse, or wait til closure before parsing and performing the command,~~ block the subsequent result read.

this could (and maybe should) be managed in the wasi filesystem interface layer -- essentially buffering the write until we detect the read at which point we pass the buffer to the aes function?

hugovincent Oct 14, 2021
Maintainer Author

Good point about multithreading – but that doesn't exist in Wasm at this time, so I ignored it :-)
Triggering the operation and blocking on whatever write call closes a complete command (i.e. closing } here) is probably the cleanest way (though still a bit nasty) of avoiding the hazard.

ew...that's nasty. What is the basis for it chopping up your write?

The standard library (Rust std, libc etc) is free to chop up fwrite()/etc for whatever reason, e.g. for buffer management. Lots of systems (used to?) buffer small writes to optimize the common case of printf() by reducing the number of syscalls. AFAIK it shouldn't chop up a raw write() (here, fd_write()) call.

ShaleXIONG Oct 14, 2021
Maintainer

@ericvh I do not know the full story of why the rust-to-wasm compile decides to chop write. My guess here is:

Rust-to-wasm compiler assumes the file system might run out of space, so they decide to write as many blocks as possible;
The content might be located in non-continuous blocks, depending on how rust decide to implement basic types such as Vec and String.

To conclude I think we can block on the read of <cap-result>. It might not be applied to other modules, for example, ML module. However it is a starting point.

hugovincent Oct 14, 2021
Maintainer Author

@ShaleXIONG are you seeing chopping on standard library fwrite (or equivalent) or on something akin to POSIX write itself? See https://stackoverflow.com/a/11414265

ShaleXIONG Oct 15, 2021
Maintainer

@hugovincent Some fwrite equivalent calls from the rust standard lib has been chopped, when compile to wasm.

egrimley-arm · 2021-10-14T15:15:23Z

egrimley-arm
Oct 14, 2021
Maintainer

"and receives back a capability to another object containing the result": Is that bit wrong/misleading? Because the JSON example suggests that the capability for the result is created by the caller and passed into the service.

1 reply

hugovincent Oct 14, 2021
Maintainer Author

Yeah, it's a simplification. I'll change the text to be more clear.

ericvh · 2021-10-14T15:15:59Z

ericvh
Oct 14, 2021
Collaborator

I'd simplify to include source data in the request, this removes any requirement for the component program to have to access the wasi visible VFS. The response includes the result. Any sort of streaming (seek style operations) gets handled by the WASM accessing the interface through multiple invocations. (which is pattern 1). Pattern 2 will be a useful optimization, but maybe get the MVP first.

Pattern 1 makes the interface on the native module essentially equivalent to what one might expect from a web application, all data it needs to process the request is in the request body, and it sends the result.

13 replies

ShaleXIONG Oct 14, 2021
Maintainer

@ericvh the capabilities are bound to file descriptors. The capabilities do not do anything to the wasm linear memory. To be more precise, The capabilities in an FD constraint what WASI-API the wasm program can calls, hence manipulating the underneath VFS. Also to clarify, WASI does not specify any memory space, I believe when you say wasi memory, you mean the memory space where VFS occupies.

ericvh Oct 14, 2021
Collaborator

Yes, I meant the memory space where the VFS occupies -- my assumption is that this memory is allocated within the runtime, and the wasm linear address space is allocated in the same runtime. I suppose that doesn't mean that it is allocated in the linear address space, but there's also probably no OS enforced separation between the two -- just that the wasm code can't access that buffer directly. But the AES library code could, and by extension probably access the rest of the memory space (including the linear address space map).

I guess I need a better description of how the runtime is organizing and protecting this memory. A block diagram might be useful.

hugovincent Oct 14, 2021
Maintainer Author

@ericvh Rust, mostly.

dominic-mulligan-arm Oct 15, 2021
Maintainer

Capabilities are referencing objects in the file system data structure in the isolate's RAM, not within the linear memory of the Wasm program (though as @ericvh surmised above, the two are basically both data structures in the isolate's RAM, but morally kept separate).

ShaleXIONG Oct 15, 2021
Maintainer

@ericvh the entire Veracruz runtime (memory space) is split into the following few components (boxes represent memory space):

   +++++++                                 +++++++++++++++++++++++++++++++++
|    VFS   |      <----WASI---->      | WASM program + WASM execution engine |
   +++++++                  ^             ++++++++++++++++++++++++++++++++
                            |
public Client API for clients to read and write files in VFS
                            |
                            |
                     Clients

WASM execution engine occupies some memory inside Veracruz runtime and uses it as the linear memory space for WASM program. Note that different WASM execution engines have their own way to actually tightly control and manage the memory space and they often just provide simple read memory and write memory functions. We do not think we are in a position to hack any execution engine so we can do some fancy memory tricks.
VFS lives in a separate memory space. There is no restriction it must access via WASI, but we currently assume this is the case.
WASI effect is a bunch of imported functions for the WASM program, but standardized (in a sense that the compile know how to call these functions). The WASM execution engine just passes parameters of the form of primitive types, e.g. 32-bit numbers, and pointers, which points to some memory space in linear memory. In this proposal the idea is to let AES module directly access VFS memory space, e,g. <cap-source>, but the WASM program need to pass some crucial information to AES module, e.g. <cap-aes>. However, the challenge here is how to ensure the AES module access the VFS in a secured way and also how the WASM program passes information with the help of VFS and standardized WASI, for example, through writing to a predefined special file, <cap-aes> here.
"public API for clients to read and write files in VFS". They are basically read, write and append a file. When clients send some requests in the protocol buffer form, it eventually lands here. These client API will be converted to WASI function calls, for unifying the VFS access pattern and most important, access control. I think AES can (1) use these APIs to access the VFS, (2) they write their own layer to link to WASI API , OR (3) they directly access VFS bypassing WASI calling convention?

hugovincent · 2021-10-14T15:48:01Z

hugovincent
Oct 14, 2021
Maintainer Author

Open questions:

How do we serialize (into strings) capabilities/file handles?
How do we create temporary files for results? Perhaps our WASI implementation could support something akin to mktemp (see: Allow mk*temp on WASI WebAssembly/wasi-libc#229).

8 replies

hugovincent Oct 14, 2021
Maintainer Author

I disagree that a path is the same as a capability. A path is forgeable, and it's not a token of authority (rather the authorization check is done when the path is opened).

but I guess that begs the question of what capabilities does the library have (as specified differently from the application), which means we'll have to specify that in the policy

The native module, in the general case, should have no capabilities; any capabilities to access to the VFS should be passed to it; this is a fundamental part of the proposal. There may be specific cases where we want the native module to have ambient authority (i.e. have always-usable capabilities), or perhaps ambient authority to a private directory for that module, but can't think of a need in the general case.

The simple/obvious way of serializing capabilities is just to serialize the (integer) FD/handle; this requires the runtime to reverse-map back to the underlying VFS object in runtime memory.

A nice property of something like mktemp (and sorry, I really meant more like tmpfile) in the capability context is it doesn't imply that a path ever even exists, it just creates a file and gives you a handle to it (and it's automatically destroyed on close).

egrimley-arm Oct 14, 2021
Maintainer

tmpfile is exactly the sort of thing I was looking for and failing to find in WASI, though CloudABI seems to have something similar. (I couldn't even see how to implement tmpfile seeing as we don't have the POSIX-specified /tmp directory.) But if we're allowed to create new "syscalls", we can add something like tmpfile.

hugovincent Oct 14, 2021
Maintainer Author

@egrimley-arm yep we don't have it since WASI doesn't have it. Since at the point we add it, we strictly speaking are diverging from WASI, we don't need to be WASI (or POSIX) compliant, so the lack of a /tmp is a non-issue. I would recommend we just implement it in a way that doesn't actually create a named file at all.

ericvh Oct 14, 2021
Collaborator

The fundamental assumption here is that all native module file system calls have to have a path to the VFS that is identical to the wasi path. So we'll need that library interface specified for whatever bindings we use for the native modules (I guess based on the verbal discussion we are only worried about Rust which probably simplifies things a great deal)

dominic-mulligan-arm Oct 15, 2021
Maintainer

Hmm, it's hard to keep track of the current state of consensus (or lack of it) in this discussion, so apologies if I'm retreading old ground, here... But, let's get much more concrete:

The native module, in the general case, should have no capabilities; any capabilities to access to the VFS should be passed to it; this is a fundamental part of the proposal. There may be specific cases where we want the native module to have ambient authority (i.e. have always-usable capabilities), or perhaps ambient authority to a private directory for that module, but can't think of a need in the general case.

To my mind, as a first attempt, the native module should know nothing about the VFS at all (though I suppose this depends on where you draw the boundary around the "native module"), and should therefore never have a need to receive a capability. We should try to hammer it into some restricted interface, e.g.:

trait Service {
  type Configuration;
  type Error;
  fn new() -> Self;
  fn configure(&mut self, config: Self::Configuration) -> Result<(), Self::Error>;
  fn serve(&self, in: &mut [u8]) -> Result<(), Self::Error>;
  fn about(&self) -> String;
}

or something equally straight-jacketed. The runtime state then keeps track of the native services, and in particular we then have one of these registered with the runtime state that calls out to e.g. some aes.c code, or similar, when its methods are invoked.

On top of that, in the runtime, we build a layer that interposes between these services and the FS, reads appropriate configuration files and configures them, before calling out to them and writing results back. This is triggered by the Wasm program reading from the results file (either a pre-existing file in e.g. /native_srv/aes/) or e.g. a temporary file created by a new system call analogue of tempfile (note: I don't think there's a strong argument that we should stay within the confines of wasm32-wasi if that's becoming a problem for us).

This has the advantage of:

Decoupling actually interfacing with services, and linking them into the runtime, from concerns about who reads what, and so on, from the file system.
Seems to be general enough to cover a large number of e.g. crypto operations, including encryption, signing, key generation, and similar, which are of interest in the first instance.

ericvh · 2021-10-14T15:50:43Z

ericvh
Oct 14, 2021
Collaborator

Just to expand on my mount table derivation -- right now the wasi fs has a couple of structs managing the filesystem.
At the top level you have:

fd_table of pointers to a particular file table entry (are these capabilities?)
inode_table of paths and their associated inode
inode table - the actually buffers backing each file
rights_table - permissions? (based on policy?)
prestat_table - pre-open FD table (based on policy?)

So really at the file level, we are talking about inodes, my suggestion is we extend inode to have file system operation function table per inode. For normal files (the ramfs) it just points to the current functions. For synthetic files (such as aes) it points to the library implementations. There can be a mount function which registers a library's synthetic path in the path_table and creates the inode with the right function hooks.

How to reference the library in the mount could be done one of two ways:

predefined by policy (in which case it shows up as a preopen fd?) - we used namespace files for this which basically specified root filesystem of your process (which would include both the "root" and the synthetics). It basically looks like a per-process fstab.
mount system call exposed to the wasm program, since we will have a well-defined set of built ins you can use a canonical name for each. In Inferno and Plan 9 we used a single character prefixed with a # -- for example, the SSL "device" in Inferno was #D (http://www.vitanuova.com/inferno/man/3/ssl.html) -- its also a good example of the clone file system from earlier.

2 replies

ShaleXIONG Oct 14, 2021
Maintainer

fd_table of pointers to a particular file table entry (are these capabilities?) <- YES, the capabilities are bound to FDs. fd_entry also includes other necessary information, e.g. where is the cursor etc.
rights_table - permissions? (based on policy?) <- YES, based on policy.
prestat_table - pre-open FD table (based on policy?) <- YES, based on policy.

There is a pending merge #193 , I already extend inode with file system operation, e.g. read and write files. We can further extend it to include synthetic files.

"How to reference the library in the mount" -> I think we predefine then by policy, similarly to "stdin" "stdout" and "stderr"?

ericvh Oct 14, 2021
Collaborator

predefine works, but we'll still need a mapping method -- although I suppose if we have a table of native modules those could have a string to match against.

egrimley-arm · 2021-11-12T11:53:57Z

egrimley-arm
Nov 12, 2021
Maintainer

Some proposals:

There should be a single "device" file (/dev/module?) for accessing all native modules, rather than a different one for each module.
A native module must be invoked with a single call to fd_write. That way we don't need to worry about parsing the command to see if it's finished or commands from different libraries getting interleaved. We could provide library functions to facilitate making a single call to fd_write but it does not seem justified to complicate Veracruz code just so that applications can use standard library functions for writing to /dev/module.
The data written to /dev/module will consist of the target module's name, null-terminated, followed by the data for that module. If the name of the target module is not recognised then the call to fd_write will fail.
If the module name is recognised then the module is invoked and the call to fd_write will return with Success after the operation has completed. So the call to fd_write returns with Success even if the module encounters an error: that error would need to be communicated in some other module-dependent way; see below. (This does not preclude having a module with a command that sets off an asynchronous process that signals its completion by writing data to some other file, but in general there should be no need for the caller to do anything else, such close the file descriptor used for writing to /dev/module or read from some other file, to make the operation happen.)
The format of the data that follows the null-terminated module name is defined by the module. JSON would be a good default choice but it would be a bad idea to require JSON because some modules might require binary data, like a cryptographic key, and it would be daft to translate such things in and out of Base64 for no good reason.
Native modules can access the file system, but only through file descriptors provided by the caller. So a command to invoke a very simple AES decryption module might look something like: "AESD\0{\"in\":\"123\",\"out\":\"456\",\"err\":\"789\",\"key\":...,}"
Most modules will want to have a way of reporting an error, so one of the parameters should be a file descriptor of a file where an error message can be written. That means there is no way for the module to complain about the JSON being ill-formed, but a module that expects JSON could always provide a friendly response to "MODULE\0{\"err\":\"123\"}". On the other hand, if the code is working and stable and the caller is not expecting there to be an error then there is no need for the caller to provide the err parameter.

2 replies

ericvh Nov 13, 2021
Collaborator

Spelling out in, out, and err here (#6) is probably less necessary -- in fact not just keeping it to strictly ordered operands is likely to create extra overhead unless we know we have lots of potentially optional arguments. I'd also agree with your statement that json is probably overkill here since we are ultimately an API interface in the same sandbox on the same machine, formatting the input then parsing are just going to be extra overhead.

Having a single entry-point /dev/module has the negative side of not being able to spell out which native objects an application can access in the policy (well, at least the easiest version which is giving the app a capability handle to /dev/aes). Ultimately having the string be parsed in the input stream versus in the file system is just changing where the filename parsing happens, but filesystem may make more sense because then things like discovery "just work" and we can leverage existing file system ENOTFOUND for modules not actually existing versus having to express those in the module error stream.

egrimley-arm Nov 15, 2021
Maintainer

Yes, it's probably better to get the file system to parse the module name and map it to a function entry point and then also benefit from filesystem permissions. (That then makes it possible for the module to return a failure condition via the call to fd_write but I'm not sure we'd want to make use of that: it will probably be less confusing to have only the usual filesystem errors as return values from fd_write.)

dreemkiller · 2021-11-12T14:57:04Z

dreemkiller
Nov 12, 2021
Maintainer

This looks well thought-out, butthere is a lot of string parsing in this proposal: The module name, JSON, etc. String parsing is extremely error-prone, and has a history of producing severe security vulnerabilities. Could we consider other options?

1 reply

ericvh Nov 13, 2021
Collaborator

not to be difficult, but even if we disambiguate module inputs using the file system versus a string -- the parsing is still happening somewhere, just somewhere different. Using JSON (or any other encoding to express arguments) could be less important, but at the end of the day parsing a binary stream versus a string...seems kinda like the same thing. So it would seem you are arguing against serialization in general which likely buggers using the file system for the operational interface.

This comment has been minimized.

Sign in to view

[Proposal] Native code/module access via WASI/VFS #237

hugovincent Oct 14, 2021 Maintainer

Rationale:

Patterns:

Example:

Replies: 8 comments · 41 replies

ShaleXIONG Oct 14, 2021 Maintainer

ericvh Oct 14, 2021 Collaborator

hugovincent Oct 14, 2021 Maintainer Author

ShaleXIONG Oct 14, 2021 Maintainer

hugovincent Oct 14, 2021 Maintainer Author

ShaleXIONG Oct 15, 2021 Maintainer

egrimley-arm Oct 14, 2021 Maintainer

hugovincent Oct 14, 2021 Maintainer Author

ericvh Oct 14, 2021 Collaborator

ShaleXIONG Oct 14, 2021 Maintainer

ericvh Oct 14, 2021 Collaborator

hugovincent Oct 14, 2021 Maintainer Author

dominic-mulligan-arm Oct 15, 2021 Maintainer

ShaleXIONG Oct 15, 2021 Maintainer

hugovincent Oct 14, 2021 Maintainer Author

hugovincent Oct 14, 2021 Maintainer Author

egrimley-arm Oct 14, 2021 Maintainer

hugovincent Oct 14, 2021 Maintainer Author

ericvh Oct 14, 2021 Collaborator

dominic-mulligan-arm Oct 15, 2021 Maintainer

ericvh Oct 14, 2021 Collaborator

ShaleXIONG Oct 14, 2021 Maintainer

ericvh Oct 14, 2021 Collaborator

egrimley-arm Nov 12, 2021 Maintainer

ericvh Nov 13, 2021 Collaborator

egrimley-arm Nov 15, 2021 Maintainer

This comment has been minimized.

dreemkiller Nov 12, 2021 Maintainer

ericvh Nov 13, 2021 Collaborator

hugovincent
Oct 14, 2021
Maintainer

Replies: 8 comments 41 replies

ShaleXIONG
Oct 14, 2021
Maintainer

ericvh Oct 14, 2021
Collaborator

hugovincent Oct 14, 2021
Maintainer Author

ShaleXIONG Oct 14, 2021
Maintainer

hugovincent Oct 14, 2021
Maintainer Author

ShaleXIONG Oct 15, 2021
Maintainer

egrimley-arm
Oct 14, 2021
Maintainer

hugovincent Oct 14, 2021
Maintainer Author

ericvh
Oct 14, 2021
Collaborator

ShaleXIONG Oct 14, 2021
Maintainer

ericvh Oct 14, 2021
Collaborator

hugovincent Oct 14, 2021
Maintainer Author

dominic-mulligan-arm Oct 15, 2021
Maintainer

ShaleXIONG Oct 15, 2021
Maintainer

hugovincent
Oct 14, 2021
Maintainer Author

hugovincent Oct 14, 2021
Maintainer Author

egrimley-arm Oct 14, 2021
Maintainer

hugovincent Oct 14, 2021
Maintainer Author

ericvh Oct 14, 2021
Collaborator

dominic-mulligan-arm Oct 15, 2021
Maintainer

ericvh
Oct 14, 2021
Collaborator

ShaleXIONG Oct 14, 2021
Maintainer

ericvh Oct 14, 2021
Collaborator

egrimley-arm
Nov 12, 2021
Maintainer

ericvh Nov 13, 2021
Collaborator

egrimley-arm Nov 15, 2021
Maintainer

dreemkiller
Nov 12, 2021
Maintainer

ericvh Nov 13, 2021
Collaborator