[LibOS] Update mmapped regions when writing to encrypted files #1523

kailun-qin · 2023-08-30T04:09:44Z

Description of the changes

Previously, Gramine didn't propagate changes to a mmaped encrypted file back to the memory. This is unexpected for apps that use both mmap and read/write on the same file (e.g., etcd-io/bbolt has such implementation to write data via write() and read it via the mmaped memory).

This commit introduces overwriting of affected mmapped regions on encrypted file writes. To reduce performance penalty in normal cases (where apps use either write() or mmap()), this commit also creates a new field in the file handle to record the number of mmaped regions of this file. We hence simply check on this field to determine whether to perform the updates to make the overhead negligible if none mmapped.

Fixes #1432 and #1591.

How to test this PR?

Newly introduced test + CI.

This change is

dimakuv

Reviewed 8 of 8 files at r1, all commit messages.
Reviewable status: all files reviewed, 15 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @kailun-qin)

-- commits line 2 at r1:
Why do you only concentrate on encrypted files? I think the same problem happens on chroot and tmpfs files. Please double-check these other files, and if the problem is there -- also update their code.

libos/include/libos_vma.h line 134 at r1 (raw file):

/* Call `read` for file mappings in given range (should be page-aligned) */
int read_range(uintptr_t begin, uintptr_t end);

These names are way too generic, and they are global to the whole codebase.

So let's change to something more specific. I suggest reload_mmaped_from_file_range() and reload_mmaped_from_file_handle()

libos/src/bookkeep/libos_vma.c line 516 at r1 (raw file):

    if (vma->file) {
        put_handle(vma->file);
        (void)__atomic_sub_fetch(&vma->file->num_mmapped, 1, __ATOMIC_RELEASE);

It can be simply __ATOMIC_RELAXED

libos/src/bookkeep/libos_vma.c line 804 at r1 (raw file):

    if (new_vma->file) {
        get_handle(new_vma->file);
        (void)__atomic_add_fetch(&new_vma->file->num_mmapped, 1, __ATOMIC_RELEASE);

ditto

libos/src/bookkeep/libos_vma.c line 1052 at r1 (raw file):

    if (new_vma->file) {
        get_handle(new_vma->file);
        (void)__atomic_add_fetch(&new_vma->file->num_mmapped, 1, __ATOMIC_RELEASE);

ditto

libos/src/bookkeep/libos_vma.c line 1368 at r1 (raw file):

    struct libos_handle* hdl = arg;

    if (vma->flags & (VMA_UNMAPPED | VMA_INTERNAL | MAP_ANONYMOUS))

I think you should also add | MAP_PRIVATE.

IIUC, privately-mapped pages are populated from file only once (on the first mmap), and then they are completely decoupled from a file.

Only shared-mapped pages track the changes in the file. Also, the case we're trying to fix has MAP_SHARED: https://github.com/haraldh/shmem-test/blob/d7ee6880a2c156bd8e73fd287a3f491962d6a1bf/main.c#L89

libos/src/fs/chroot/encrypted.c line 445 at r1 (raw file):

    if (ret < 0) {
        unlock(&hdl->inode->lock);
        goto out;

Because out now is only one line, I prefer here to simply return ret

libos/src/fs/chroot/encrypted.c line 459 at r1 (raw file):

        ret = read_handle(hdl);
out:
    return ret < 0 ? ret : (ssize_t)actual_count;

Here I would prefer to remove out: and simply have return (ssize_t)actual_count

libos/test/fs/read_write_mmap.c line 3 at r1 (raw file):

#include "common.h"

static void read_write(const char* file_path) {

Better name read_write_mmap()

libos/test/fs/read_write_mmap.c line 16 at r1 (raw file):

    void* buf_read = alloc_buffer(size);

    void* m = mmap_fd(file_path, fd, PROT_READ | PROT_WRITE, 0, size);

MAP_SHARED must be added. UPDATE: It is already implicitly added:

gramine/libos/test/fs/common.c

Line 101 in 10c2668

void* address = mmap(NULL, size, protection, MAP_SHARED, fd, offset);

libos/test/fs/read_write_mmap.c line 26 at r1 (raw file):

    read_fd(file_path, fd, buf_read, size);
    printf("read(%s) RW (mmap) OK\n", file_path);

You have the exact same string repeating twice. Maybe add numbers to make strings unique, and check in test_enc.py with these numbers?

Like this:

    ...
    printf("read(%s) 1 RW (mmap) OK\n", file_path);
    ...
    printf("read(%s) 2 RW (mmap) OK\n", file_path);

libos/test/fs/read_write_mmap.c line 28 at r1 (raw file):

    printf("read(%s) RW (mmap) OK\n", file_path);
    if (memcmp(m, buf_read, size) != 0)
        fatal_error("Read data is different from what was written in the mapping\n");

Read data via read()

libos/test/fs/read_write_mmap.c line 32 at r1 (raw file):

    fill_random(buf_write, size);
    seek_fd(file_path, fd, 0, SEEK_SET);
    printf("seek(%s) RW (mmap) OK\n", file_path);

ditto

libos/test/fs/read_write_mmap.c line 40 at r1 (raw file):

    printf("read(%s) RW (mmap) OK\n", file_path);
    if (memcmp(buf_write, buf_read, size) != 0)
        fatal_error("Read data is different from what was written\n");

Read data via read() is different from what was written via write()

libos/test/fs/read_write_mmap.c line 42 at r1 (raw file):

        fatal_error("Read data is different from what was written\n");
    if (memcmp(buf_write, m, size) != 0)
        fatal_error("Read data in the mapping is different from what was written\n");

... was written via write()

libos/test/fs/test_enc.py line 108 at r1 (raw file):

        self.assertIn('munmap_fd(' + size + ') OK', stdout)
        self.assertIn('close(' + file_path + ') RW (mmap) OK', stdout)

Why only test_enc.py? Why not make it generic for all: normal, enc, and tmpfs?

kailun-qin

Reviewable status: 3 of 12 files reviewed, 15 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)

-- commits line 2 at r1:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why do you only concentrate on encrypted files? I think the same problem happens on chroot and tmpfs files. Please double-check these other files, and if the problem is there -- also update their code.

Done.

libos/include/libos_vma.h line 134 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

These names are way too generic, and they are global to the whole codebase.

So let's change to something more specific. I suggest reload_mmaped_from_file_range() and reload_mmaped_from_file_handle()

Done.

libos/src/bookkeep/libos_vma.c line 516 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

It can be simply __ATOMIC_RELAXED

Done.

libos/src/bookkeep/libos_vma.c line 804 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto

Done.

libos/src/bookkeep/libos_vma.c line 1052 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto

Done.

libos/src/bookkeep/libos_vma.c line 1368 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I think you should also add | MAP_PRIVATE.

IIUC, privately-mapped pages are populated from file only once (on the first mmap), and then they are completely decoupled from a file.

Only shared-mapped pages track the changes in the file. Also, the case we're trying to fix has MAP_SHARED: https://github.com/haraldh/shmem-test/blob/d7ee6880a2c156bd8e73fd287a3f491962d6a1bf/main.c#L89

Correct, fixed.

libos/src/fs/chroot/encrypted.c line 445 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Because out now is only one line, I prefer here to simply return ret

Done.

libos/src/fs/chroot/encrypted.c line 459 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Here I would prefer to remove out: and simply have return (ssize_t)actual_count

OK, now I explicitly ignore the return of reload_mmaped_from_file_handle().

libos/test/fs/read_write_mmap.c line 3 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Better name read_write_mmap()

Done.

libos/test/fs/read_write_mmap.c line 26 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

You have the exact same string repeating twice. Maybe add numbers to make strings unique, and check in test_enc.py with these numbers?

Like this:
    ...
    printf("read(%s) 1 RW (mmap) OK\n", file_path);
    ...
    printf("read(%s) 2 RW (mmap) OK\n", file_path);

Done.

libos/test/fs/read_write_mmap.c line 28 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Read data via read()

Done.

libos/test/fs/read_write_mmap.c line 32 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto

Done.

libos/test/fs/read_write_mmap.c line 40 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Read data via read() is different from what was written via write()

Done.

libos/test/fs/read_write_mmap.c line 42 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

... was written via write()

Done.

libos/test/fs/test_enc.py line 108 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why only test_enc.py? Why not make it generic for all: normal, enc, and tmpfs?

Done.

kailun-qin

Reviewable status: 3 of 12 files reviewed, 15 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "squash! " found in commit messages' one-liners (waiting on @dimakuv)

-- commits line 2 at r1:

Previously, kailun-qin (Kailun Qin) wrote…

Done.

One more note is that for chroot, there's no such issue w/ Gramine-direct while w/ Gramine-SGX, shared memory-mapped files with write permission (e.g., mmap(PROT_WRITE, MAP_SHARED, fd)) is not allowed. But I updated it anyway.

dimakuv

Reviewed 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "squash! " found in commit messages' one-liners (waiting on @kailun-qin)

a discussion (no related file):
PR looks good to me. Just to be sure, can you run your newly added test on chroot, enc and tmpfs in direct and sgx modes and make sure that all 6 runs fail without your fix?

-- commits line 2 at r1:

Previously, kailun-qin (Kailun Qin) wrote…

One more note is that for chroot, there's no such issue w/ Gramine-direct while w/ Gramine-SGX, shared memory-mapped files with write permission (e.g., mmap(PROT_WRITE, MAP_SHARED, fd)) is not allowed. But I updated it anyway.

Yes, you're right, that's correct. Thanks for updating it, that's exactly what I wanted.

libos/src/bookkeep/libos_vma.c line 1385 at r3 (raw file):

}

static int read_all(uintptr_t begin, uintptr_t end, struct libos_handle* hdl) {

Please also change this name, to reload_mmaped_from_file_all()

libos/test/fs/test_fs.py line 99 at r3 (raw file):

    # Gramine's implementation of file_map doesn't currently support shared memory-mapped
    # files with write permission in PAL/Linux-SGX (like mmap(PROT_WRITE, MAP_SHARED, fd)).

files -> regular chroot files

kailun-qin

Reviewable status: 10 of 12 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

PR looks good to me. Just to be sure, can you run your newly added test on chroot, enc and tmpfs in direct and sgx modes and make sure that all 6 runs fail without your fix?

Yes (actually 5 tests, 1 skipped), pls see below outputs tested based on commit

gramine-direct:

==================================================== test session starts =====================================================
...
test_enc.py::TC_50_EncryptedFiles::test_111_read_write_mmap FAILED                                                     [ 33%]
test_fs.py::TC_00_FileSystem::test_111_read_write_mmap PASSED                                                          [ 66%]
test_tmpfs.py::TC_10_Tmpfs::test_111_read_write_mmap FAILED                                                            [100%]
...
========================================================== FAILURES ==========================================================
_______________________________________ TC_50_EncryptedFiles.test_111_read_write_mmap ________________________________________
...
---------------------------------------------------- Captured stdout call ----------------------------------------------------
[0.024] Random seed: 1693377998
[0.024] open(tmp/enc_output/test_111) RW (mmap) OK
[0.051] mmap_fd(1048576) OK
[0.109] read(tmp/enc_output/test_111) 1 RW (mmap) OK
[0.135] seek(tmp/enc_output/test_111) 1 RW (mmap) OK
[0.151] write(tmp/enc_output/test_111) RW (mmap) OK
[0.151] seek(tmp/enc_output/test_111) 2 RW (mmap) OK
[0.163] read(tmp/enc_output/test_111) 2 RW (mmap) OK
[0.163] ERROR: Read data in the mapping is different from what was written via write()
____________________________________________ TC_10_Tmpfs.test_111_read_write_mmap ____________________________________________
...
---------------------------------------------------- Captured stdout call ----------------------------------------------------
[0.018] Random seed: 1693377998
[0.018] open(/mnt-tmpfs/test_111) RW (mmap) OK
[0.020] mmap_fd(1048576) OK
[0.046] read(/mnt-tmpfs/test_111) 1 RW (mmap) OK
[0.071] seek(/mnt-tmpfs/test_111) 1 RW (mmap) OK
[0.071] write(/mnt-tmpfs/test_111) RW (mmap) OK
[0.071] seek(/mnt-tmpfs/test_111) 2 RW (mmap) OK
[0.071] read(/mnt-tmpfs/test_111) 2 RW (mmap) OK
[0.072] ERROR: Read data in the mapping is different from what was written via write()
===================================== 2 failed, 1 passed, 58 deselected in 0.87 seconds

gramine-sgx:

==================================================== test session starts =====================================================
...
test_enc.py::TC_50_EncryptedFiles::test_111_read_write_mmap FAILED                                                     [ 33%]
test_fs.py::TC_00_FileSystem::test_111_read_write_mmap SKIPPED                                                         [ 66%]
test_tmpfs.py::TC_10_Tmpfs::test_111_read_write_mmap FAILED                                                            [100%]

========================================================== FAILURES ==========================================================
_______________________________________ TC_50_EncryptedFiles.test_111_read_write_mmap ________________________________________
...
---------------------------------------------------- Captured stdout call ----------------------------------------------------
[0.607] Random seed: 1693378128
[0.607] open(tmp/enc_output/test_111) RW (mmap) OK
[0.633] mmap_fd(1048576) OK
[0.683] read(tmp/enc_output/test_111) 1 RW (mmap) OK
[0.707] seek(tmp/enc_output/test_111) 1 RW (mmap) OK
[0.723] write(tmp/enc_output/test_111) RW (mmap) OK
[0.723] seek(tmp/enc_output/test_111) 2 RW (mmap) OK
[0.732] read(tmp/enc_output/test_111) 2 RW (mmap) OK
[0.732] ERROR: Read data in the mapping is different from what was written via write()
...
____________________________________________ TC_10_Tmpfs.test_111_read_write_mmap ____________________________________________
...
---------------------------------------------------- Captured stdout call ----------------------------------------------------
[0.591] Random seed: 1693378129
[0.591] open(/mnt-tmpfs/test_111) RW (mmap) OK
[0.598] mmap_fd(1048576) OK
[0.623] read(/mnt-tmpfs/test_111) 1 RW (mmap) OK
[0.648] seek(/mnt-tmpfs/test_111) 1 RW (mmap) OK
[0.648] write(/mnt-tmpfs/test_111) RW (mmap) OK
[0.648] seek(/mnt-tmpfs/test_111) 2 RW (mmap) OK
[0.648] read(/mnt-tmpfs/test_111) 2 RW (mmap) OK
[0.648] ERROR: Read data in the mapping is different from what was written via write()
===================================== 2 failed, 1 skipped, 58 deselected in 2.09 seconds =====================================

libos/src/bookkeep/libos_vma.c line 1385 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please also change this name, to reload_mmaped_from_file_all()

Done.

libos/test/fs/test_fs.py line 99 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

files -> regular chroot files

Done.

kailun-qin

Reviewable status: 10 of 12 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)

a discussion (no related file):

pls see below outputs tested based on commit

commit 02f7a85, sorry about missing the ID somehow

dimakuv

Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

a discussion (no related file):

Previously, kailun-qin (Kailun Qin) wrote…

pls see below outputs tested based on commit

commit 02f7a85, sorry about missing the ID somehow

Awesome, thanks for these logs! Everything looks good.

dzobbe · 2023-11-04T19:20:22Z

Hi @kailun-qin @dimakuv what is your plan with regard to this PR? When do you plan to merge it? I am asking because I have the same exact issue and this PR helps :)

Thanks a lot

kailun-qin

@dzobbe Thanks for the information and great to know that it works!

We have at least two requests now :). @mkow: would you mind taking a look when you get a chance? Thanks!

Reviewable status: all files reviewed, all discussions resolved, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

mkow

Sure, but probably not this week, I'm sick :(

Reviewable status: all files reviewed, all discussions resolved, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

dzobbe · 2023-11-07T11:57:29Z

Thanks @mkow much appreciated

dzobbe · 2023-11-14T08:40:38Z

Hi @mkow this is just a reminder on this PR. If you could have a look, that would be great. Thanks a lot in advance ;)

mkow

Reviewed 1 of 8 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin)

libos/include/libos_handle.h line 173 at r4 (raw file):

We hence simply check on this field to determine whether to perform the updates to make the overhead negligible if none mmapped.

What if someone wrote to the same file using a different handle than it was mapped from?

libos/src/fs/chroot/encrypted.c line 457 at r4 (raw file):

    /* If there are any MAP_SHARED mappings for the file, this will read data from `enc`. */
    if (__atomic_load_n(&hdl->num_mmapped, __ATOMIC_ACQUIRE) != 0)
        (void)reload_mmaped_from_file_handle(hdl);

Shouldn't we fail here? (but probably without updating file position, in case the caller wants to retry? not sure)
The host on SGX can fail this single call, forcing the memory mappings to be outdated.

kailun-qin

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

libos/include/libos_handle.h line 173 at r4 (raw file):

What if someone wrote to the same file using a different handle than it was mapped from?

Good question! Well, I think we can support it, albeit with certain complexities and overheads. Do we really want to do this?

libos/src/fs/chroot/encrypted.c line 457 at r4 (raw file):

Shouldn't we fail here? (but probably without updating file position, in case the caller wants to retry? not sure)

Not sure +1. But I'm fine w/ checking failure here.

dimakuv

Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)

libos/include/libos_handle.h line 173 at r4 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

What if someone wrote to the same file using a different handle than it was mapped from?

Good question! Well, I think we can support it, albeit with certain complexities and overheads. Do we really want to do this?

Actually, very good point.

Can't we move this field to the inode? All the FSes that we care about correctly support inodes now (tmpfs, chroot, encrypted).

Then we can do things like:

static bool vma_filter_needs_read(struct libos_vma* vma, void* arg) {
    struct libos_handle* hdl = arg;

    if (vma->flags & (VMA_UNMAPPED | VMA_INTERNAL | MAP_ANONYMOUS | MAP_PRIVATE))
        return false;

    assert(vma->file);

    if (hdl && hdl->inode && vma->file->inode && vma->file->inode != hdl->inode)
        return false;

    ...
}

...
    /* If there are any MAP_SHARED mappings for the file, this will read data from `enc`. */
    if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0)
        (void)reload_mmaped_from_file_handle(hdl);
...

This should be a very simple change. And given that hdl->inode is an immutable field, we don't need to do any additional locking.

@kailun-qin Please check whether it works (I think it should) and update the PR and the test correspondingly (make two sub-tests, one as you have currently, the other one with two opens of the same file).

libos/src/bookkeep/libos_vma.c line 1410 at r4 (raw file):

        ret = file->fs->fs_ops->read(file, (void*)read_begin, read_end - read_begin,
                                     (file_off_t*)&vma_info->file_offset);

I just realized that we can be interrupted here (this func can return a benign PAL_ERROR_INTERRUPTED).

I think we should do a while loop like this:

gramine/libos/src/libos_utils.c

Lines 10 to 26 in d052d18

 int read_exact(PAL_HANDLE handle, void* buf, size_t size) { 

 size_t read = 0; 

 while (read < size) { 

 size_t tmp_read = size - read; 

 int ret = PalStreamRead(handle, /*offset=*/0, &tmp_read, (char*)buf + read); 

 if (ret < 0) { 

 if (ret == -PAL_ERROR_INTERRUPTED || ret == -PAL_ERROR_TRYAGAIN) { 

 continue; 

 } 

 return pal_to_unix_errno(ret); 

 } else if (tmp_read == 0) { 

 return -ENODATA; 

 } 

 read += tmp_read; 

 } 

 return 0; 

 }

(Unfortunately can't reuse that read_exact() function as it doesn't support the offset, so need to write our own loop.)

libos/src/fs/chroot/encrypted.c line 457 at r4 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

Shouldn't we fail here? (but probably without updating file position, in case the caller wants to retry? not sure)

Not sure +1. But I'm fine w/ checking failure here.

+1. But I'm not sure if we want to restore the file position... After all, the underlying file operation already happened, and there's no way to rewind this operation. So e.g. the underlying file size was already extended, and what do we do with hdl->inode->size in this case? We can't restore hdl->inode->size to the before-the-write value, because this will lead to discrepancy between the real on-the-host size and our in-gramine size...

So I currently vote for writing a log_warning() in the error case here, but keep return actual_count (i.e., the success of the write function itself).

We can even put a BUG() here, as I don't think there are legitimate benign cases where this reload could fail. @mkow WDYT?

libos/src/fs/chroot/fs.c line 325 at r4 (raw file):

    /* If there are any MAP_SHARED mappings for the file, this will read data from `hdl`. */
    if (__atomic_load_n(&hdl->num_mmapped, __ATOMIC_ACQUIRE) != 0)
        (void)reload_mmaped_from_file_handle(hdl);

ditto (restore pos and hdl->inode->size on error)

libos/src/fs/tmpfs/fs.c line 272 at r4 (raw file):

    /* If there are any MAP_SHARED mappings for the file, this will read data from `hdl`. */
    if (__atomic_load_n(&hdl->num_mmapped, __ATOMIC_ACQUIRE) != 0)
        (void)reload_mmaped_from_file_handle(hdl);

ditto

kailun-qin

Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

libos/include/libos_handle.h line 173 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Actually, very good point.

Can't we move this field to the inode? All the FSes that we care about correctly support inodes now (tmpfs, chroot, encrypted).

Then we can do things like:
static bool vma_filter_needs_read(struct libos_vma* vma, void* arg) {
    struct libos_handle* hdl = arg;

    if (vma->flags & (VMA_UNMAPPED | VMA_INTERNAL | MAP_ANONYMOUS | MAP_PRIVATE))
        return false;

    assert(vma->file);

    if (hdl && hdl->inode && vma->file->inode && vma->file->inode != hdl->inode)
        return false;

    ...
}

...
    /* If there are any MAP_SHARED mappings for the file, this will read data from `enc`. */
    if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0)
        (void)reload_mmaped_from_file_handle(hdl);
...
This should be a very simple change. And given that hdl->inode is an immutable field, we don't need to do any additional locking.

@kailun-qin Please check whether it works (I think it should) and update the PR and the test correspondingly (make two sub-tests, one as you have currently, the other one with two opens of the same file).

Thanks for the great suggestion! I was overcomplicating things. Done.

libos/src/bookkeep/libos_vma.c line 1410 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I just realized that we can be interrupted here (this func can return a benign PAL_ERROR_INTERRUPTED).

I think we should do a while loop like this:

gramine/libos/src/libos_utils.c

Lines 10 to 26 in d052d18

int read_exact(PAL_HANDLE handle, void* buf, size_t size) {

size_t read = 0;

while (read < size) {

size_t tmp_read = size - read;

int ret = PalStreamRead(handle, /*offset=*/0, &tmp_read, (char*)buf + read);

if (ret < 0) {

if (ret == -PAL_ERROR_INTERRUPTED || ret == -PAL_ERROR_TRYAGAIN) {

continue;

}

return pal_to_unix_errno(ret);

} else if (tmp_read == 0) {

return -ENODATA;

}

read += tmp_read;

}

return 0;

}

(Unfortunately can't reuse that read_exact() function as it doesn't support the offset, so need to write our own loop.)

Done.

libos/src/fs/chroot/encrypted.c line 457 at r4 (raw file):

After all, the underlying file operation already happened, and there's no way to rewind this operation.

I'm do doing a warning log for now

dimakuv

Reviewed 10 of 10 files at r5, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "squash! " and "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)

libos/src/bookkeep/libos_vma.c line 1414 at r5 (raw file):

            size_t to_read = size - read;
            ret = file->fs->fs_ops->read(file, (void*)(read_begin + read), to_read,
                                         (file_off_t*)&vma_info->file_offset);

Wait, this looks like a bug. The last argument is updated by the read() callback, but that is not what we want. This helper function must not update the current file offset.

I think you need to do smth like this:

        size_t size = read_end - read_begin;
        size_t read = 0;
        file_off_t file_offset = (file_off_t)vma_info->file_offset;
        while (read < size) {
            size_t to_read = size - read;
            ret = file->fs->fs_ops->read(file, (void*)(read_begin + read), to_read,
                                         &file_offset);
        ...

This way you'll start reading with the vma_info->file_offset and on each iteration of the while loop, the new helper variable file_offset will be updated to reflect from what file offset to read the next chunk.

libos/src/bookkeep/libos_vma.c line 1419 at r5 (raw file):

                    continue;
                }
                goto out;

You also need to translate the PAL error code into the UNIX error code before going out:

ret = pal_to_unix_errno(ret);
goto out;

libos/test/fs/read_write_mmap.c line 64 at r5 (raw file):

    int fd2 = open_output_fd(file_path, /*rdwr=*/true);
    printf("open(%s) RW fd2 (mmap) OK\n", file_path);

Why do you say (mmap)? This fd2 is used for write().

libos/test/fs/read_write_mmap.c line 69 at r5 (raw file):

        close(fd2);
        fatal_error("ftruncate fd2\n");
    }

Why do you need this "extend to 1MB size" operation on fd2? It will call write() which will automatically extend the file, so looks this code is redundant?

libos/test/fs/read_write_mmap.c line 77 at r5 (raw file):

    fill_random(buf_write, size);
    write_fd(file_path, fd2, buf_write, size);
    printf("write(%s) RW fd2 (mmap) OK\n", file_path);

ditto (why (mmap))

libos/test/fs/read_write_mmap.c line 88 at r5 (raw file):

    printf("close(%s) RW fd1 (mmap) OK\n", file_path);
    close_fd(file_path, fd2);
    printf("close(%s) RW fd2 (mmap) OK\n", file_path);

ditto (why (mmap))

kailun-qin

Reviewable status: 8 of 13 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

libos/src/bookkeep/libos_vma.c line 1414 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Wait, this looks like a bug. The last argument is updated by the read() callback, but that is not what we want. This helper function must not update the current file offset.

I think you need to do smth like this:
        size_t size = read_end - read_begin;
        size_t read = 0;
        file_off_t file_offset = (file_off_t)vma_info->file_offset;
        while (read < size) {
            size_t to_read = size - read;
            ret = file->fs->fs_ops->read(file, (void*)(read_begin + read), to_read,
                                         &file_offset);
        ...
This way you'll start reading with the vma_info->file_offset and on each iteration of the while loop, the new helper variable file_offset will be updated to reflect from what file offset to read the next chunk.

Good catch, done.

libos/src/bookkeep/libos_vma.c line 1419 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

You also need to translate the PAL error code into the UNIX error code before going out:
ret = pal_to_unix_errno(ret);
goto out;

It's actually returning unix error codes (the above is actually wrong). Done.

libos/test/fs/read_write_mmap.c line 64 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why do you say (mmap)? This fd2 is used for write().

Done.

libos/test/fs/read_write_mmap.c line 69 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why do you need this "extend to 1MB size" operation on fd2? It will call write() which will automatically extend the file, so looks this code is redundant?

Done.

libos/test/fs/read_write_mmap.c line 77 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto (why (mmap))

Done.

libos/test/fs/read_write_mmap.c line 88 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto (why (mmap))

Done.

dimakuv

Reviewed 5 of 5 files at r6, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

mkow

Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin)

libos/src/fs/chroot/encrypted.c line 457 at r4 (raw file):

We can even put a BUG() here, as I don't think there are legitimate benign cases where this reload could fail. @mkow WDYT?

+1, I think this is a good idea. The main problem here is that the untrusted host can force it too fail, but in benign cases this shouldn't happen. And in case of the failure we end up with inconsistent state visible to the app.

libos/include/libos_handle.h line 173 at r4 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

Thanks for the great suggestion! I was overcomplicating things. Done.

Yeah, moving to inode looks good. Will continue reviewing now.

kailun-qin

Reviewable status: 9 of 13 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

libos/src/fs/chroot/encrypted.c line 457 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

We can even put a BUG() here, as I don't think there are legitimate benign cases where this reload could fail. @mkow WDYT?

+1, I think this is a good idea. The main problem here is that the untrusted host can force it too fail, but in benign cases this shouldn't happen. And in case of the failure we end up with inconsistent state visible to the app.

Done.

libos/src/fs/chroot/fs.c line 325 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto (restore pos and hdl->inode->size on error)

Done.

libos/src/fs/tmpfs/fs.c line 272 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto

Done.

mkow

Reviewed 2 of 8 files at r1, 1 of 9 files at r2, 2 of 10 files at r5, 4 of 5 files at r6, 3 of 4 files at r7, all commit messages.
Reviewable status: 12 of 13 files reviewed, 10 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @kailun-qin)

libos/include/libos_vma.h line 133 at r8 (raw file):

Call read

This IMO doesn't describe well what it does, the read itself is not the essence, but reading and writing the data to the mapped region? I.e. reloading.

libos/include/libos_vma.h line 136 at r8 (raw file):

int reload_mmaped_from_file_range(uintptr_t begin, uintptr_t end);

/* Call `read` for file mappings of `hdl` */

ditto

libos/src/bookkeep/libos_vma.c line 516 at r8 (raw file):

    if (vma->file) {
        put_handle(vma->file);
        (void)__atomic_sub_fetch(&vma->file->inode->num_mmapped, 1, __ATOMIC_RELAXED);

inode field is optional, I think you should IF on it?

libos/src/bookkeep/libos_vma.c line 516 at r8 (raw file):

    if (vma->file) {
        put_handle(vma->file);
        (void)__atomic_sub_fetch(&vma->file->inode->num_mmapped, 1, __ATOMIC_RELAXED);

vma->file may have already been freed by this line.

libos/src/bookkeep/libos_vma.c line 516 at r8 (raw file):

    if (vma->file) {
        put_handle(vma->file);
        (void)__atomic_sub_fetch(&vma->file->inode->num_mmapped, 1, __ATOMIC_RELAXED);

Can we assert here that it doesn't go below 0? (or rather: doesn't underflow)

libos/src/bookkeep/libos_vma.c line 804 at r8 (raw file):

    if (new_vma->file) {
        get_handle(new_vma->file);
        (void)__atomic_add_fetch(&new_vma->file->inode->num_mmapped, 1, __ATOMIC_RELAXED);

ditto (inode)

libos/src/bookkeep/libos_vma.c line 1052 at r8 (raw file):

    if (new_vma->file) {
        get_handle(new_vma->file);
        (void)__atomic_add_fetch(&new_vma->file->inode->num_mmapped, 1, __ATOMIC_RELAXED);

ditto (inode)

libos/src/bookkeep/libos_vma.c line 1365 at r8 (raw file):

}

static bool vma_filter_needs_read(struct libos_vma* vma, void* arg) {

Suggestion:

vma_filter_needs_reload

libos/src/bookkeep/libos_vma.c line 1454 at r8 (raw file):

}

int reload_mmaped_from_file_range(uintptr_t begin, uintptr_t end) {

This looks unused? I guess we can simplify this whole code then?

libos/test/fs/read_write_mmap.c line 19 at r8 (raw file):

    printf("mmap_fd(%zu) OK\n", size);
    fill_random(m, size);
    int ret = msync(m, size, MS_SYNC);

Could you explain how does this case actually work with your PR? I could have missed something, but you didn't change anything in msync flow, and writing to memory is not triggering anything, so how is the writeback triggered? Was msync already doing that before the PR?

kailun-qin

Reviewable status: 11 of 13 files reviewed, 10 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

libos/include/libos_vma.h line 133 at r8 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Call read

This IMO doesn't describe well what it does, the read itself is not the essence, but reading and writing the data to the mapped region? I.e. reloading.

Not relevant any more. Done.

libos/include/libos_vma.h line 136 at r8 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto

Done.

libos/src/bookkeep/libos_vma.c line 516 at r8 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

inode field is optional, I think you should IF on it?

Done.

libos/src/bookkeep/libos_vma.c line 516 at r8 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

vma->file may have already been freed by this line.

Good catch! Done.

libos/src/bookkeep/libos_vma.c line 516 at r8 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Can we assert here that it doesn't go below 0? (or rather: doesn't underflow)

Done.

libos/src/bookkeep/libos_vma.c line 804 at r8 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto (inode)

Done.

libos/src/bookkeep/libos_vma.c line 1052 at r8 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto (inode)

Done.

libos/src/bookkeep/libos_vma.c line 1365 at r8 (raw file):

}

static bool vma_filter_needs_read(struct libos_vma* vma, void* arg) {

Done.

libos/src/bookkeep/libos_vma.c line 1454 at r8 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

This looks unused? I guess we can simplify this whole code then?

Done.

libos/test/fs/read_write_mmap.c line 19 at r8 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Could you explain how does this case actually work with your PR? I could have missed something, but you didn't change anything in msync flow, and writing to memory is not triggering anything, so how is the writeback triggered? Was msync already doing that before the PR?

msync() will eventually go to the wirte() callback:

gramine/libos/src/fs/libos_fs_util.c

Line 209 in d052d18

ssize_t count = hdl->fs->fs_ops->write(hdl, write_addr, write_size, &pos);

. As we added the reloading logic into them, they should be triggered as expected.

dimakuv

Reviewed 3 of 4 files at r7, 2 of 2 files at r9, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)

a discussion (no related file):
Something goes wrong:

04:08:29  ________________________ TC_01_Bootstrap.test_200_exec _________________________
04:08:29  test_libos.py:201: in test_200_exec
04:08:29      stdout, _ = self.run_binary(['exec'])
04:08:29  ../../../usr/lib/python3.8/site-packages/graminelibos/regression.py:224: in run_binary
04:08:29      _returncode, stdout, stderr = run_command(cmd, timeout=timeout, **kwds)
04:08:29  ../../../usr/lib/python3.8/site-packages/graminelibos/regression.py:162: in run_command
04:08:29      raise subprocess.CalledProcessError(proc.returncode, cmd, raw_stdout, raw_stderr)
04:08:29  E   subprocess.CalledProcessError: Command '['/home/jenkins/workspace/graphene-direct-sanitizers/usr/lib/x86_64-linux-gnu/gramine/direct/loader', '/home/jenkins/workspace/graphene-direct-sanitizers/usr/lib/x86_64-linux-gnu/gramine/direct/libpal.so', 'init', 'exec']' returned non-zero exit status 1.
04:08:29  ----------------------------- Captured stderr call -----------------------------
04:08:29  [0.012] [P1:T1:exec_victim] assert failed ../libos/src/bookkeep/libos_vma.c:518 num_mmapped < UINT64_MAX

Please debug and fix.

libos/src/bookkeep/libos_vma.c line 516 at r8 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

Done.

It would be easier to understand your assert if you would change to __atomic_fetch_sub() (then you would compare that "the previous value was greater than 0" in the assert).

libos/src/bookkeep/libos_vma.c line 1454 at r8 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

Done.

For the context: we initially mimicked the msync_all() functions (see below), but yes, technically reload_mmaped_from_file_range() was unused. I'm fine with removing it and simplifying a bit, though we lost a nice readability uniformity here.

libos/src/bookkeep/libos_vma.c line 1425 at r9 (raw file):

            if (ret < 0) {
                ret = pal_to_unix_errno(ret);
                break;

Why not goto out? Then you won't need a weird && !ret in the for loop.

libos/src/bookkeep/libos_vma.c line 1439 at r9 (raw file):

                break;
            } else if (count == 0) {
                break;

Why did you remove ret = -ENODATA?

libos/test/fs/read_write_mmap.c line 19 at r8 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

msync() will eventually go to the wirte() callback:

gramine/libos/src/fs/libos_fs_util.c

Line 209 in d052d18

ssize_t count = hdl->fs->fs_ops->write(hdl, write_addr, write_size, &pos);

. As we added the reloading logic into them, they should be triggered as expected.

Yep, as @kailun-qin explained. msync() -> write() callback -> this PR updates this callback with reload_mmaped_from_file_handle().

kailun-qin

Reviewable status: 12 of 13 files reviewed, 13 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

libos/src/bookkeep/libos_vma.c line 516 at r8 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

It would be easier to understand your assert if you would change to __atomic_fetch_sub() (then you would compare that "the previous value was greater than 0" in the assert).

Done again.

libos/src/bookkeep/libos_vma.c line 1425 at r9 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why not goto out? Then you won't need a weird && !ret in the for loop.

hmm... && !ret servees for both the break here (outer loop) and the break below (inner loop). Note that here when mprotect fails we need to break the outer loop, while below when the read callback fails we need to break the inner loop. These two breaks need to goto different places -- with the first goto free_vma_info_array() directly while the latter goto restore permission w/ mprotect (which is still within the outer loop). I just don't want to introduce two gotos, so I used this approach.

libos/src/bookkeep/libos_vma.c line 1439 at r9 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why did you remove ret = -ENODATA?

This will cause tests like test_204_copy_dir_mmap_whole to fail when reloading on munmap(). The test copies size length of data (less than size of one page, say 256 bytes):

gramine/libos/test/fs/copy_mmap_whole.c

Lines 10 to 14 in d052d18

 // copy data 

 if (ftruncate(fo, size) != 0) 

 fatal_error("ftruncate(%s, %zu) failed: %s\n", output_path, size, strerror(errno)); 

 printf("ftruncate(%zu) output OK\n", size); 

 memcpy(out, in, size);

wheras we align up to page-aligned (4K) on munmap():

gramine/libos/src/sys/libos_mmap.c

Line 350 in d052d18

length = ALLOC_ALIGN_UP(length);

. If we return -ENODATA when nothing to read, then munmap() fails.

kailun-qin

Reviewable status: 12 of 13 files reviewed, 13 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Something goes wrong:

04:08:29  ________________________ TC_01_Bootstrap.test_200_exec _________________________
04:08:29  test_libos.py:201: in test_200_exec
04:08:29      stdout, _ = self.run_binary(['exec'])
04:08:29  ../../../usr/lib/python3.8/site-packages/graminelibos/regression.py:224: in run_binary
04:08:29      _returncode, stdout, stderr = run_command(cmd, timeout=timeout, **kwds)
04:08:29  ../../../usr/lib/python3.8/site-packages/graminelibos/regression.py:162: in run_command
04:08:29      raise subprocess.CalledProcessError(proc.returncode, cmd, raw_stdout, raw_stderr)
04:08:29  E   subprocess.CalledProcessError: Command '['/home/jenkins/workspace/graphene-direct-sanitizers/usr/lib/x86_64-linux-gnu/gramine/direct/loader', '/home/jenkins/workspace/graphene-direct-sanitizers/usr/lib/x86_64-linux-gnu/gramine/direct/libpal.so', 'init', 'exec']' returned non-zero exit status 1.
04:08:29  ----------------------------- Captured stderr call -----------------------------
04:08:29  [0.012] [P1:T1:exec_victim] assert failed ../libos/src/bookkeep/libos_vma.c:518 num_mmapped < UINT64_MAX

Please debug and fix.

Trying to fix by adding the missing ref counter increase on split_vma(), let's give it another try.

dimakuv

Reviewed 1 of 1 files at r10, all commit messages.
Reviewable status: all files reviewed, 15 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)

libos/src/bookkeep/libos_vma.c line 1425 at r9 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

hmm... && !ret servees for both the break here (outer loop) and the break below (inner loop). Note that here when mprotect fails we need to break the outer loop, while below when the read callback fails we need to break the inner loop. These two breaks need to goto different places -- with the first goto free_vma_info_array() directly while the latter goto restore permission w/ mprotect (which is still within the outer loop). I just don't want to introduce two gotos, so I used this approach.

I'm sorry, but this is unreadable to me. Could you please split the body of this while loop (i.e., reloading a single VMA) into a helper static function? In ~2 years, when somebody will need to modify this code, they will definitely introduce a bug here...

libos/src/bookkeep/libos_vma.c line 1439 at r9 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

This will cause tests like test_204_copy_dir_mmap_whole to fail when reloading on munmap(). The test copies size length of data (less than size of one page, say 256 bytes):

gramine/libos/test/fs/copy_mmap_whole.c

Lines 10 to 14 in d052d18

// copy data

if (ftruncate(fo, size) != 0)

fatal_error("ftruncate(%s, %zu) failed: %s\n", output_path, size, strerror(errno));

printf("ftruncate(%zu) output OK\n", size);

memcpy(out, in, size);

wheras we align up to page-aligned (4K) on munmap():

gramine/libos/src/sys/libos_mmap.c

Line 350 in d052d18

length = ALLOC_ALIGN_UP(length);

. If we return -ENODATA when nothing to read, then munmap() fails.

Thanks, it makes sense. We previously copy-pasted that code from read_exact(), but you're right, here we don't read the exact size -- it can be less than that.

libos/src/bookkeep/libos_vma.c line 214 at r10 (raw file):

    if (new_vma->file) {
        if (new_vma->file->inode)
            (void)__atomic_add_fetch(&new_vma->file->inode->num_mmapped, 1, __ATOMIC_RELAXED);

Woudn't it make sense to move this inside copy_vma()?

libos/src/bookkeep/libos_vma.c line 1441 at r10 (raw file):

                break;
            } else if (count == 0) {
                break;

Ouch, @kailun-qin's comment made me realize that we have totally redundant paths like this:

munmap() -> msync_all() -> fs_ops->msync() -> fs_ops->write() -> reload_mmaped_from_file_handle()
mmap(MAP_FIXED_NOREPLACE) -> msync_all() -> fs_ops->msync() -> fs_ops->write() -> reload_mmaped_from_file_handle()
msync() -> -> msync_all() -> fs_ops->msync() -> fs_ops->write() -> reload_mmaped_from_file_handle()

In other words, when VMA contents are being msynced, it calls the write callback which in turn blindly reloads the VMA contents. Which is totally bogus.

@kailun-qin What if we add some boolean flag to struct libos_vma_info, and we set this flag inside msync_all() and then unset it? And we'll check this flag here -- if it is set, then we skip all this reloading logic for the VMA.

Otherwise we are doomed to have these meaningless read/write operations on each munmap, which is a frequent op.

mkow

Reviewed 1 of 2 files at r9, 1 of 1 files at r10, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @kailun-qin)

libos/include/libos_vma.h line 133 at r10 (raw file):

int msync_handle(struct libos_handle* hdl);

/* Call `reload` for file mappings of `hdl` */

There's no "reload" function or anything like that, so it shouldn't be backticked. Just use it as an English verb.

Suggestion:

/* Reload file mappings of `hdl` */

libos/src/bookkeep/libos_vma.c line 1441 at r10 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Ouch, @kailun-qin's comment made me realize that we have totally redundant paths like this:

munmap() -> msync_all() -> fs_ops->msync() -> fs_ops->write() -> reload_mmaped_from_file_handle()

mmap(MAP_FIXED_NOREPLACE) -> msync_all() -> fs_ops->msync() -> fs_ops->write() -> reload_mmaped_from_file_handle()

msync() -> -> msync_all() -> fs_ops->msync() -> fs_ops->write() -> reload_mmaped_from_file_handle()

In other words, when VMA contents are being msynced, it calls the write callback which in turn blindly reloads the VMA contents. Which is totally bogus.

@kailun-qin What if we add some boolean flag to struct libos_vma_info, and we set this flag inside msync_all() and then unset it? And we'll check this flag here -- if it is set, then we skip all this reloading logic for the VMA.

Otherwise we are doomed to have these meaningless read/write operations on each munmap, which is a frequent op.

But this idea doesn't seem to be correct against race conditions? What in the meantime someone does something which actually should reload a mapping here? The msync-in-progress will turn the reloading into a no-op?

libos/test/fs/read_write_mmap.c line 19 at r8 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Yep, as @kailun-qin explained. msync() -> write() callback -> this PR updates this callback with reload_mmaped_from_file_handle().

I see, thanks!

kailun-qin

Reviewable status: 11 of 13 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

libos/include/libos_vma.h line 133 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

There's no "reload" function or anything like that, so it shouldn't be backticked. Just use it as an English verb.

Done.

libos/src/bookkeep/libos_vma.c line 1425 at r9 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I'm sorry, but this is unreadable to me. Could you please split the body of this while loop (i.e., reloading a single VMA) into a helper static function? In ~2 years, when somebody will need to modify this code, they will definitely introduce a bug here...

Done.

libos/src/bookkeep/libos_vma.c line 214 at r10 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Woudn't it make sense to move this inside copy_vma()?

Done.

libos/src/bookkeep/libos_vma.c line 1441 at r10 (raw file):

But this idea doesn't seem to be correct against race conditions? What in the meantime someone does something which actually should reload a mapping here?

Yes, it can be racy.

The msync-in-progress will turn the reloading into a no-op?

Yes, I think this was what @dimakuv would like to propose.

dimakuv

Reviewed 1 of 2 files at r11, 1 of 1 files at r12, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)

a discussion (no related file):

Previously, kailun-qin (Kailun Qin) wrote…

Trying to fix by adding the missing ref counter increase on split_vma(), let's give it another try.

Looks like problem solved, thanks.

libos/src/bookkeep/libos_vma.c line 1441 at r10 (raw file):

But this idea doesn't seem to be correct against race conditions? What in the meantime someone does something which actually should reload a mapping here? The msync-in-progress will turn the reloading into a no-op?

Yes, there is a possibility of data races. We even discussed it very long time ago with Pawel (when he implemented msync(), which structure Kailun took for this "reload VMAs"). Here's a relevant link: #550 (review)

Note that:

We already have possibility for data races in this func anyway, see the NOTE in the code.
It's impossible to take a global VMA lock here, because this lock is very widely used (e.g., if a fs_ops->read() callback wants to malloc, it will probably land into the VMA code and will try to re-acquire the VMA lock). In other words, we can't blindly protect with a global VMA lock because we'll have deadlocks due to nested locking.
An argument that it is fine: this reload_mmaped_from_file_handle() is always called from an operation on an active VMA (e.g. write() or munmap() syscalls). So at least the current VMA will not be freed/remaped while we're in this func. There is a possibility that another thread of the app will suddenly do something with other VMAs that are backed by this file handle, but it means a buggy application.

Otherwise we are doomed to have these meaningless read/write operations on each munmap, which is a frequent op.

On a second thought, this munmap-of-file-backed-region shouldn't be a frequent operation. Typically apps load e.g. a binary image inside process memory and just keep it there forever, or load e.g. a database file and keep it there also forever. So munmap should be rare actually.

Moreover, if we imagine the same file mmapped twice (I don't know why anyone would do it), then the current munmap logic starts to make sense -- yes, the first mmapped region (the one that is being munmapped) will be redundantly reloaded, but the second mmapped region will correct be reloaded. So this looks reasonable after all.

@kailun-qin Can we add a NOTE comment describing this inefficiency?

libos/src/bookkeep/libos_vma.c line 1429 at r12 (raw file):

            goto out;
        } else if (count == 0) {
            goto out;

Aren't you supposed to break here? Or at least specify ret = 0 before.

Otherwise looks like you have undefined value of ret in this case.

Also, you could add a quick comment for this case, that it's possible that the underlying file contents do not cover the whole VMA region, and thus count == 0.

kailun-qin

Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

libos/src/bookkeep/libos_vma.c line 1441 at r10 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

But this idea doesn't seem to be correct against race conditions? What in the meantime someone does something which actually should reload a mapping here? The msync-in-progress will turn the reloading into a no-op?

Yes, there is a possibility of data races. We even discussed it very long time ago with Pawel (when he implemented msync(), which structure Kailun took for this "reload VMAs"). Here's a relevant link: #550 (review)

Note that:

We already have possibility for data races in this func anyway, see the NOTE in the code.

It's impossible to take a global VMA lock here, because this lock is very widely used (e.g., if a fs_ops->read() callback wants to malloc, it will probably land into the VMA code and will try to re-acquire the VMA lock). In other words, we can't blindly protect with a global VMA lock because we'll have deadlocks due to nested locking.

An argument that it is fine: this reload_mmaped_from_file_handle() is always called from an operation on an active VMA (e.g. write() or munmap() syscalls). So at least the current VMA will not be freed/remaped while we're in this func. There is a possibility that another thread of the app will suddenly do something with other VMAs that are backed by this file handle, but it means a buggy application.

Otherwise we are doomed to have these meaningless read/write operations on each munmap, which is a frequent op.

On a second thought, this munmap-of-file-backed-region shouldn't be a frequent operation. Typically apps load e.g. a binary image inside process memory and just keep it there forever, or load e.g. a database file and keep it there also forever. So munmap should be rare actually.

Moreover, if we imagine the same file mmapped twice (I don't know why anyone would do it), then the current munmap logic starts to make sense -- yes, the first mmapped region (the one that is being munmapped) will be redundantly reloaded, but the second mmapped region will correct be reloaded. So this looks reasonable after all.

@kailun-qin Can we add a NOTE comment describing this inefficiency?

Done.

libos/src/bookkeep/libos_vma.c line 1429 at r12 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Aren't you supposed to break here? Or at least specify ret = 0 before.

Otherwise looks like you have undefined value of ret in this case.

Also, you could add a quick comment for this case, that it's possible that the underlying file contents do not cover the whole VMA region, and thus count == 0.

Done.

dimakuv

Reviewed 1 of 1 files at r13, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

mkow

Reviewed 1 of 2 files at r11, 1 of 1 files at r13, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin)

libos/src/bookkeep/libos_vma.c line 1380 at r13 (raw file):

        return false;

    assert(vma->file);

Actually, why is this true? Aren't we iterating over all VMAs, including the ones without a backing file?

libos/src/bookkeep/libos_vma.c line 1382 at r13 (raw file):

    assert(vma->file);

    if (hdl && hdl->inode && vma->file->inode && vma->file->inode != hdl->inode)

Doesn't this check work a bit unexpectedly? I think that now, if you pass a handle without a file (hdl->inode == NULL), which is clearly expected/allowed by this function, it will reload all the mappings?

kailun-qin

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

libos/src/bookkeep/libos_vma.c line 1380 at r13 (raw file):

Aren't we iterating over all VMAs, including the ones without a backing file?

For those VMAs w/o a backing file, shouldn't they contain MAP_ANONYMOUS and would return earlier? (Or am I missing sth?)

libos/src/bookkeep/libos_vma.c line 1382 at r13 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Doesn't this check work a bit unexpectedly? I think that now, if you pass a handle without a file (hdl->inode == NULL), which is clearly expected/allowed by this function, it will reload all the mappings?

But reload is invoked from the write callback where it's actually guaranteed to have hdl->inode?

mkow

Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin)

libos/src/bookkeep/libos_vma.c line 1380 at r13 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

Aren't we iterating over all VMAs, including the ones without a backing file?

For those VMAs w/o a backing file, shouldn't they contain MAP_ANONYMOUS and would return earlier? (Or am I missing sth?)

Ah, right.

libos/src/bookkeep/libos_vma.c line 1382 at r13 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

But reload is invoked from the write callback where it's actually guaranteed to have hdl->inode?

Ok, but then why checking it in runtime? It looks like this was a valid and possible case. If it's impossible then it should be an assert.

kailun-qin

Reviewable status: 12 of 13 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

libos/src/bookkeep/libos_vma.c line 1382 at r13 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Ok, but then why checking it in runtime? It looks like this was a valid and possible case. If it's impossible then it should be an assert.

Done.

dimakuv

Reviewed 1 of 1 files at r14, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

libos/src/bookkeep/libos_vma.c line 1382 at r13 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

Done.

@kailun-qin I think it's still a bit misleading.

I agree that hdl->inode must always exist (because this func is ultimately invoked from the write() callbacks that indeed have hdl->inode). The assert assert(hdl && hdl->inode) is fine with me.

However, I don't like the assert on assert(vma->file->inode). In this code, we're just browsing through all file-backed VMA mappings, and since our documentation says that libos_handle::inode can be optional -- even though I think that's not true for files that can be mmapped -- we must not blindly assert on vma->file->inode. This will also be similar to how you check vma->file->inode in other places in this PR.

So in all, that's my suggestion:

    struct libos_handle* hdl = arg;
    assert(hdl && hdl->inode); /* guaranteed to have inode because invoked from `write` callback */

    if (vma->flags & (VMA_UNMAPPED | VMA_INTERNAL | MAP_ANONYMOUS | MAP_PRIVATE))
        return false;

    assert(vma->file); /* check above filtered out non-file-backed mappings */

    if (!vma->file->inode || vma->file->inode != hdl->inode)
        return false;

kailun-qin

Reviewable status: 12 of 13 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

libos/src/bookkeep/libos_vma.c line 1382 at r13 (raw file):
Thanks for the suggestion!

even though I think that's not true for files that can be mmapped

Yes, this was my previous assumption.

since our documentation says that libos_handle::inode can be optional, ...,we must not blindly assert on vma->file->inode

Yes, I'm updating to the proposed version -- which makes more sense.

dimakuv

Reviewed 1 of 1 files at r15, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

mkow

Reviewed 1 of 1 files at r15, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, "fixup! " found in commit messages' one-liners

Previously, Gramine didn't propagate changes to a mmaped file back to memory. This is unexpected for apps that use both mmap and read/write on the same file (e.g., etcd-io/bbolt has such implementation to write data via `write()` and read it via the mmaped memory). This commit introduces the reloading of affected mmaped regions on `chroot_encrypted`, `chroot` and `tmpfs` file writes. To reduce performance penalty in normal cases (where apps use either `write()` or `mmap()`), this commit also creates a new field in the inode to record the number of mmaped regions of this file. We hence simply check on this field to determine whether to perform the reloading to make the overhead negligible if none mmaped. Signed-off-by: Kailun Qin <kailun.qin@intel.com>

dimakuv

Reviewed 6 of 6 files at r16, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from different teams (1 more required, approved so far: Intel)

mkow

Reviewed 6 of 6 files at r16, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved

dimakuv requested changes Aug 30, 2023

View reviewed changes

kailun-qin commented Aug 30, 2023

View reviewed changes

dimakuv requested changes Aug 30, 2023

View reviewed changes

kailun-qin commented Aug 30, 2023

View reviewed changes

dimakuv approved these changes Aug 30, 2023

View reviewed changes

kailun-qin commented Nov 6, 2023

View reviewed changes

mkow reviewed Nov 7, 2023

View reviewed changes

mkow requested changes Nov 16, 2023

View reviewed changes

kailun-qin commented Nov 17, 2023

View reviewed changes

dimakuv requested changes Nov 17, 2023

View reviewed changes

kailun-qin commented Nov 17, 2023

View reviewed changes

dimakuv requested changes Nov 17, 2023

View reviewed changes

kailun-qin commented Nov 17, 2023

View reviewed changes

dimakuv previously approved these changes Nov 17, 2023

View reviewed changes

mkow requested changes Nov 17, 2023

View reviewed changes

kailun-qin dismissed dimakuv’s stale review via b09bf68 November 18, 2023 02:08

kailun-qin commented Nov 18, 2023

View reviewed changes

kailun-qin force-pushed the kailun-qin/update-mmap-enc-files branch 2 times, most recently from ae315f3 to 85fd0fa Compare November 18, 2023 03:37

mkow requested changes Nov 19, 2023

View reviewed changes

kailun-qin commented Nov 20, 2023

View reviewed changes

dimakuv requested changes Nov 20, 2023

View reviewed changes

kailun-qin commented Nov 20, 2023

View reviewed changes

dimakuv requested changes Nov 20, 2023

View reviewed changes

mkow requested changes Nov 20, 2023

View reviewed changes

kailun-qin commented Nov 21, 2023

View reviewed changes

dimakuv requested changes Nov 21, 2023

View reviewed changes

kailun-qin commented Nov 21, 2023

View reviewed changes

kailun-qin force-pushed the kailun-qin/update-mmap-enc-files branch from 342cd13 to fd57351 Compare November 21, 2023 10:09

dimakuv previously approved these changes Nov 21, 2023

View reviewed changes

mkow requested changes Nov 22, 2023

View reviewed changes

kailun-qin commented Nov 22, 2023

View reviewed changes

mkow requested changes Nov 22, 2023

View reviewed changes

kailun-qin dismissed dimakuv’s stale review via 49faaeb November 23, 2023 02:17

kailun-qin commented Nov 23, 2023

View reviewed changes

dimakuv requested changes Nov 23, 2023

View reviewed changes

kailun-qin commented Nov 23, 2023

View reviewed changes

dimakuv previously approved these changes Nov 23, 2023

View reviewed changes

mkow previously approved these changes Nov 23, 2023

View reviewed changes

kailun-qin dismissed stale reviews from mkow and dimakuv via 2426d53 November 23, 2023 11:13

kailun-qin force-pushed the kailun-qin/update-mmap-enc-files branch from fc9f497 to 2426d53 Compare November 23, 2023 11:13

kailun-qin force-pushed the kailun-qin/update-mmap-enc-files branch from 2426d53 to 1ea3e60 Compare November 23, 2023 11:15

dimakuv approved these changes Nov 23, 2023

View reviewed changes

mkow approved these changes Nov 23, 2023

View reviewed changes

dimakuv merged commit 1ea3e60 into gramineproject:master Nov 23, 2023
17 of 18 checks passed

	int read_exact(PAL_HANDLE handle, void* buf, size_t size) {
	size_t read = 0;
	while (read < size) {
	size_t tmp_read = size - read;
	int ret = PalStreamRead(handle, /offset=/0, &tmp_read, (char*)buf + read);
	if (ret < 0) {
	if (ret == -PAL_ERROR_INTERRUPTED \|\| ret == -PAL_ERROR_TRYAGAIN) {
	continue;
	}
	return pal_to_unix_errno(ret);
	} else if (tmp_read == 0) {
	return -ENODATA;
	}
	read += tmp_read;
	}
	return 0;
	}

	// copy data
	if (ftruncate(fo, size) != 0)
	fatal_error("ftruncate(%s, %zu) failed: %s\n", output_path, size, strerror(errno));
	printf("ftruncate(%zu) output OK\n", size);
	memcpy(out, in, size);

[LibOS] Update mmapped regions when writing to encrypted files #1523

[LibOS] Update mmapped regions when writing to encrypted files #1523

Conversation

kailun-qin commented Aug 30, 2023 • edited Loading

Description of the changes

How to test this PR?

dimakuv left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

dzobbe commented Nov 4, 2023

kailun-qin left a comment

Choose a reason for hiding this comment

mkow left a comment

Choose a reason for hiding this comment

dzobbe commented Nov 7, 2023

dzobbe commented Nov 14, 2023

mkow left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

mkow left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

mkow left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

mkow left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

mkow left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

mkow left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

kailun-qin left a comment

Choose a reason for hiding this comment

dimakuv left a comment

kailun-qin commented Aug 30, 2023 •

edited

Loading