-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flush profile data on demand using SIGUSR1
signal
#1751
Flush profile data on demand using SIGUSR1
signal
#1751
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ) (waiting on @jkr0103)
a discussion (no related file):
This PR looks surprisingly simple. Nice! Thanks for this.
a discussion (no related file):
Could you copy-paste some examples from your experiments? I'd like to see how things look like in reality (before testing myself locally).
Also, what applications would you recommend to test?
Documentation/performance.rst
line 636 at r1 (raw file):
signal which would help in cases where user would like to collect profile data only for a particular period (e.g., during benchmarks run). Send SIGUSR1 signal using command ``kill -SIGUSR1 <pid>``.
You should also mention that sending multiple SIGUSR1 will create multiple files, each containing profile data collected after the previous SIGUSR1. And that this is helpful when the application runs in clearly separated stages.
pal/src/host/linux-sgx/host_exception.c
line 191 at r1 (raw file):
} static void handle_async_sigusr1_signal(int signum, siginfo_t* info, struct ucontext* uc) {
That's a too long name. SIGUSR1 is already a "signal" and already a "asynchronous" one.
So just rename to handle_sigusr1()
pal/src/host/linux-sgx/host_exception.c
line 201 at r1 (raw file):
sgx_profile_init(); struct debug_map* map = g_debug_map;
Please add the same comment from here:
gramine/pal/src/host/linux-sgx/host_main.c
Line 314 in 1cf1f46
/* Report all ELFs already loaded (host part of PAL and dynamic libraries used by it) */ |
pal/src/host/linux-sgx/host_profile.c
line 127 at r1 (raw file):
ret = DO_SYSCALL(clock_gettime, CLOCK_REALTIME, &ts); if (ret < 0) { log_error("sgx_profile_sample: clock_gettime failed: %s", unix_strerror(ret));
sgx_profile_init:
pal/src/host/linux-sgx/host_profile.c
line 128 at r1 (raw file):
if (ret < 0) { log_error("sgx_profile_sample: clock_gettime failed: %s", unix_strerror(ret)); return false;
What is this? Should be goto out;
pal/src/host/linux-sgx/host_profile.c
line 133 at r1 (raw file):
if (g_pal_enclave.profile_append_pid_to_filename) { snprintf(g_pal_enclave.profile_filename, ARRAY_SIZE(g_pal_enclave.profile_filename), SGX_PROFILE_FILENAME_WITH_PID, (int)g_host_pid, ts.tv_sec);
profile_filename
is 64 chars in size. Now that we also add the number of seconds, are we sure we still fit in 64 chars limit? Please check how many chars in the filenames you see in your experiments.
I think we can increase to 128 chars in size, and this will be more than enough.
pal/src/host/linux-sgx/host_profile.c
line 197 at r1 (raw file):
g_mem_fd = -1; log_always("Profile data written to %s (%lu bytes)", g_pal_enclave.profile_filename, size);
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 5 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)
a discussion (no related file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Could you copy-paste some examples from your experiments? I'd like to see how things look like in reality (before testing myself locally).
Also, what applications would you recommend to test?
Done.
Documentation/performance.rst
line 636 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
You should also mention that sending multiple SIGUSR1 will create multiple files, each containing profile data collected after the previous SIGUSR1. And that this is helpful when the application runs in clearly separated stages.
Done.
pal/src/host/linux-sgx/host_exception.c
line 191 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
That's a too long name. SIGUSR1 is already a "signal" and already a "asynchronous" one.
So just rename to
handle_sigusr1()
Done.
pal/src/host/linux-sgx/host_exception.c
line 201 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please add the same comment from here:
gramine/pal/src/host/linux-sgx/host_main.c
Line 314 in 1cf1f46
/* Report all ELFs already loaded (host part of PAL and dynamic libraries used by it) */
Done.
pal/src/host/linux-sgx/host_profile.c
line 127 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
sgx_profile_init:
Done.
pal/src/host/linux-sgx/host_profile.c
line 128 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
What is this? Should be
goto out;
Done.
pal/src/host/linux-sgx/host_profile.c
line 133 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
profile_filename
is 64 chars in size. Now that we also add the number of seconds, are we sure we still fit in 64 chars limit? Please check how many chars in the filenames you see in your experiments.I think we can increase to 128 chars in size, and this will be more than enough.
Done.
pal/src/host/linux-sgx/host_profile.c
line 197 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Why this change?
User might not want to enable debug logs due to performance reasons in that case user won't know the file where profile data is written. Also this log is seen only when profiling is enabled, in that case we should always inform user about the file name where profile data is written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103)
a discussion (no related file):
Previously, jkr0103 (Jitender Kumar) wrote…
Done.
Done where? UPDATE: You updated the PR description, I see now.
pal/src/host/linux-sgx/host_exception.c
line 201 at r2 (raw file):
sgx_profile_init(); /* Report all ELFs already loaded (host part of PAL and dynamic libraries used by it) */
Ah, sorry, the text in parenthesis is incorrect in this case -- it's not only the PAL binary and its dynamic-lib dependencies. It's also the application executable itself, the LibOS binary, etc.
So please shorten to just: /* Report all ELFs already loaded */
pal/src/host/linux-sgx/host_profile.c
line 133 at r1 (raw file):
Previously, jkr0103 (Jitender Kumar) wrote…
Done.
Btw, I looked at the sizes of the longest possible strings (sgx-perf-
prefix + PID + timestamp + .data
suffix). Typically you'll get ~32 chars. So 64 chars would also be enough.
Nevertheless, I don't see any harm in increasing from 64 chars to 128 chars.
pal/src/host/linux-sgx/host_profile.c
line 197 at r1 (raw file):
Previously, jkr0103 (Jitender Kumar) wrote…
User might not want to enable debug logs due to performance reasons in that case user won't know the file where profile data is written. Also this log is seen only when profiling is enabled, in that case we should always inform user about the file name where profile data is written.
You're right, that's a good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 5 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)
pal/src/host/linux-sgx/host_exception.c
line 201 at r2 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Ah, sorry, the text in parenthesis is incorrect in this case -- it's not only the PAL binary and its dynamic-lib dependencies. It's also the application executable itself, the LibOS binary, etc.
So please shorten to just:
/* Report all ELFs already loaded */
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners
Jenkins, test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners
a discussion (no related file):
I tested the PR, looks like it's working. I followed the instructions in the PR description and tested Redis on my local machine.
I did 4 steps:
- Start Redis and perform
kill -SIGUSR1
after 15 seconds; this is the enclave startup time. - Wait for 15 seconds and perform
kill -SIGUSR1
again; this is the do-nothing idle time. - Start
./src/src/redis-benchmark -t set,get
, wait till it finishes (around 15 seconds) and performkill -SIGUSR1
again; this is the actual benchmark-working time. - Wait for 15 seconds and finally terminate Redis via
kill -SIGTERM
; this is the do-nothing idle time.
So, a total of 1 minute.
sgx-perf-1707208751.data
(178147 bytes) -- startup (15s)
Samples: 21
- 76.19% pal-sgx libpal.so [.] copy_u64s
copy_u64s
copy_u64s_from_untrusted
sgx_copy_to_enclave_verified
- sgx_copy_to_enclave
+ 66.67% load_trusted_or_allowed_file
+ 9.52% copy_and_verify_trusted_file
- 19.05% pal-sgx libpal.so [.] memset
_real_memset (inlined)
_PalVirtualMemoryAlloc
PalVirtualMemoryAlloc
libos_syscall_mmap
+ 4.76% pal-sgx libsysdb.so [.] snprintf_callback
sgx-perf-1707208766.data
(18547 bytes) -- idle before bench run (15s)
Samples: 2
- 50.00% pal-sgx libpal.so [.] copy_u64s
copy_u64s
copy_u64s_from_untrusted
sgx_copy_to_enclave_verified
alloc_untrusted_futex_word
_PalEventCreate
PalEventCreate
create_lock
get_new_handle
libos_syscall_openat
- 50.00% pal-sgx libpal.so [.] slab_alloc
slab_alloc
malloc
calloc
_PalEventCreate
PalEventCreate
create_lock
get_new_handle
libos_syscall_openat
sgx-perf-1707208782.data
(2059747 bytes) -- benchmark run (15s)
Samples: 245
+ 38.37% pal-sgx libpal.so [.] enclave_entry
+ 4.08% pal-sgx libpal.so [.] copy_u64s
+ 2.45% pal-sgx libpal.so [.] sgx_ocall
+ 2.04% pal-sgx libpal.so [.] __save_xregs
+ 2.04% pal-sgx redis-server [.] je_malloc_usable_size
+ 1.63% pal-sgx libpal.so [.] __restore_xregs
+ 1.63% pal-sgx libpal.so [.] sgx_copy_from_enclave_verified
+ 1.63% pal-sgx libsysdb.so [.] _return_from_syscall
+ 1.63% pal-sgx redis-server [.] siphash_nocase
+ 1.22% pal-sgx libpal.so [.] sgx_exitless_ocall
+ 1.22% pal-sgx libpal.so [.] sgx_reset_ustack
+ 1.22% pal-sgx libsysdb.so [.] cmp_addr_to_vma
+ 1.22% pal-sgx libsysdb.so [.] libos_syscall_gettimeofday
+ 1.22% pal-sgx libsysdb.so [.] truncate_file_name
+ 0.82% pal-sgx libc.so.6 [.] __GI___libc_write
+ 0.82% pal-sgx libc.so.6 [.] __strcasecmp_l_avx2
+ 0.82% pal-sgx libc.so.6 [.] read
+ 0.82% pal-sgx libpal.so [.] PalSocketSend
+ 0.82% pal-sgx libpal.so [.] copy_u64s_to_untrusted
+ 0.82% pal-sgx libpal.so [.] memcpy
+ 0.82% pal-sgx libpal.so [.] ocall_send
+ 0.82% pal-sgx libpal.so [.] sgx_is_valid_untrusted_ptr
+ 0.82% pal-sgx libsysdb.so [.] avl_tree_lower_bound_fn
sgx-perf-1707208796.data
(18547 bytes) -- idle after bench run (15s)
Samples: 2
- 100.00% pal-sgx libpal.so [.] copy_u64s
copy_u64s
copy_u64s_from_untrusted
sgx_copy_to_enclave_verified
alloc_untrusted_futex_word
_PalEventCreate
PalEventCreate
create_lock
get_new_handle
libos_syscall_openat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103)
pal/src/host/linux-sgx/host_exception.c
line 198 at r3 (raw file):
#ifdef DEBUG if (g_pal_enclave.profile_enable) { sgx_profile_finish();
This PR breaks on multi-threaded apps. I tested on Memcached and got SIGSEGV.
That's because some other thread may be in the middle of saving something into the profile, and the profile data/metadata is "swept" from its feet.
We already have a global spinlock that protects all operations on profiling, so we just need to surround all this logic with a single lock/unlock. Let me prepare a fixup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 6 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners
pal/src/host/linux-sgx/host_exception.c
line 198 at r3 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
This PR breaks on multi-threaded apps. I tested on Memcached and got SIGSEGV.
That's because some other thread may be in the middle of saving something into the profile, and the profile data/metadata is "swept" from its feet.
We already have a global spinlock that protects all operations on profiling, so we just need to surround all this logic with a single lock/unlock. Let me prepare a fixup.
Done. The code is pretty ugly and will need some re-arrangements, and because we don't protect g_mem_fd
with a lock, we sometimes see (rather benign) error messages. Someone (me?) will need to make the code look better. But functionality-wise seems to be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r2, all commit messages.
Reviewable status: 1 of 6 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)
a discussion (no related file):
Why is this implemented in PAL-SGX? Profiling is a generic feature in Gramine, not SGX-specific.
-- commits
line 2 at r4:
missing component tag
Documentation/performance.rst
line 639 at r4 (raw file):
multiple files, each containing profile data collected after the previous ``SIGUSR1`` which is helpful when the application runs in clearly separated stages.
This has to be rewritten, it's hard to read.
pal/src/host/linux-sgx/host_exception.c
line 198 at r4 (raw file):
#ifdef DEBUG if (g_pal_enclave.profile_enable) { sgx_profile_reinit();
You can't call this from a signal handler for like 10 different reasons, e.g. because it tries to take a lock which could have been already taken by the current thread when SIGUSR1 arrived, causing a deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 6 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @jkr0103, and @mkow)
pal/src/host/linux-sgx/host_exception.c
line 198 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
You can't call this from a signal handler for like 10 different reasons, e.g. because it tries to take a lock which could have been already taken by the current thread when SIGUSR1 arrived, causing a deadlock.
Indeed. I think the best way would be to just set an atomic flag in the signal, and do the flush the next time profile data is collected (eg. in sgx_profile_sample_aex
), which should happen often enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 6 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
a discussion (no related file):
Previously, mkow (Michał Kowalczyk) wrote…
Why is this implemented in PAL-SGX? Profiling is a generic feature in Gramine, not SGX-specific.
There is nothing required for non-SGX PAL (gramine-direct
). This kind of profiling just works, it doesn't require any additional features.
So we're augmenting already-existing SGX PAL support for profiling. See https://gramine.readthedocs.io/en/stable/performance.html#profiling-with-perf
pal/src/host/linux-sgx/host_exception.c
line 198 at r4 (raw file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
Indeed. I think the best way would be to just set an atomic flag in the signal, and do the flush the next time profile data is collected (eg. in
sgx_profile_sample_aex
), which should happen often enough.
Yeah, I agree...
The code got super-ugly now, @jkr0103 or me will need to sort this all out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 6 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
pal/src/host/linux-sgx/host_exception.c
line 198 at r4 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Yeah, I agree...
The code got super-ugly now, @jkr0103 or me will need to sort this all out.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r5, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
Previously, mkow (Michał Kowalczyk) wrote…
missing component tag
Previously we had a prefix like this: [Pal/Linux-SGX] SGX perf:
So I suggest to change the commit message to
[Pal/Linux-SGX] SGX perf: flush profile data on SIGUSR1 signal
This allows to collect profile data only for a particular period
(e.g., during Gramine startup or during an actual benchmark run).
Documentation/performance.rst
line 639 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
This has to be rewritten, it's hard to read.
Proposed text:
It is also possible to flush the collected profile data interactively, using the
``SIGUSR1`` signal. This helps to collect profile data only for a particular
period, e.g., skipping the Gramine startup and application initialization time
and concentrating only on the actual application processing. Send ``SIGUSR1``
using command ``kill -SIGUSR1 <gramine-pid>``. Sending multiple ``SIGUSR1`` will
create multiple files, each containing profile data collected after the previous
``SIGUSR1``.
pal/src/host/linux-sgx/host_exception.c
line 198 at r3 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Done. The code is pretty ugly and will need some re-arrangements, and because we don't protect
g_mem_fd
with a lock, we sometimes see (rather benign) error messages. Someone (me?) will need to make the code look better. But functionality-wise seems to be ok.
This was fixed by @jkr0103
pal/src/host/linux-sgx/host_internal.h
line 134 at r5 (raw file):
#ifdef DEBUG #include "elf/elf.h"
What is this? Why is this needed in this header?
pal/src/host/linux-sgx/host_internal.h
line 155 at r5 (raw file):
/* Finalize and close file */ void sgx_profile_finish(void); void sgx_profile_finish_thread_unsafe(void);
This function must be declared as static
in the *.c
file, and removed from here.
pal/src/host/linux-sgx/host_internal.h
line 158 at r5 (raw file):
/* Re-initialize based on g_pal_enclave settings */ int sgx_profile_reinit(void);
This function must be declared as static
in the *.c
file, and removed from here.
pal/src/host/linux-sgx/host_internal.h
line 176 at r5 (raw file):
/* Record a new mapped ELF */ void sgx_profile_report_elf(const char* filename, void* addr); void sgx_profile_report_elf_thread_unsafe(const char* filename, void* addr);
Why do you need to declare these 4 functions? You should just declare those functions as static
in the *.c
file, and remove from here.
pal/src/host/linux-sgx/host_profile.c
line 177 at r5 (raw file):
} void sgx_profile_finish_thread_unsafe(void) {
static void
pal/src/host/linux-sgx/host_profile.c
line 197 at r5 (raw file):
g_profile_enabled = false; }
Please add a newline after this
pal/src/host/linux-sgx/host_profile.c
line 203 at r5 (raw file):
spinlock_unlock(&g_perf_data_lock); }
Please remove a newline
pal/src/host/linux-sgx/host_profile.c
line 454 at r5 (raw file):
log_error("sgx_profile_report_elf(%s): close failed: %s", filename, unix_strerror(ret)); } }
Please add a newline after this
pal/src/host/linux-sgx/host_profile.c
line 481 at r5 (raw file):
sgx_profile_report_elf_cleanup(filename, &elf_addr, elf_length, fd); } }
Sorry, this refactoring of sgx_profile_report_elf()
turns out to be a complete mess...
I suggest to do the same you did with sgx_profile_finish()
-- just remove the locks from the original function and rename it to _thread_unsafe()
, and in the wrapper function add the locks. Yes, this will be less performant than the original code (because the critical section in-between lock and unlock becomes larger), but this function is called very rarely, only on new binaries/libraries loaded into the application.
pal/src/host/linux-sgx/host_profile.c
line 484 at r5 (raw file):
int sgx_profile_reinit(void) { int ret;
Please add in the very beginning: assert(spinlock_is_locked(&g_perf_data_lock));
pal/src/host/linux-sgx/host_profile.c
line 489 at r5 (raw file):
ret = sgx_profile_init(); if (ret < 0) { log_error("sgx_profile_reinit: initializing again failed: %s", unix_strerror(ret));
Actually no need for "initializing again", as the name of the func already explains the issue. So: sgx_profile_reinit failed: %s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 6 files reviewed, 16 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Previously we had a prefix like this:
[Pal/Linux-SGX] SGX perf:
So I suggest to change the commit message to
[Pal/Linux-SGX] SGX perf: flush profile data on SIGUSR1 signal This allows to collect profile data only for a particular period (e.g., during Gramine startup or during an actual benchmark run).
Could you please add it while rebasing the PR?
Documentation/performance.rst
line 639 at r4 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Proposed text:
It is also possible to flush the collected profile data interactively, using the ``SIGUSR1`` signal. This helps to collect profile data only for a particular period, e.g., skipping the Gramine startup and application initialization time and concentrating only on the actual application processing. Send ``SIGUSR1`` using command ``kill -SIGUSR1 <gramine-pid>``. Sending multiple ``SIGUSR1`` will create multiple files, each containing profile data collected after the previous ``SIGUSR1``.
Done.
pal/src/host/linux-sgx/host_exception.c
line 198 at r3 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
This was fixed by @jkr0103
Done.
pal/src/host/linux-sgx/host_internal.h
line 134 at r5 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
What is this? Why is this needed in this header?
It was required for elf_ehdr_t
but not required anymore as those function declarations are moved out of this file.
pal/src/host/linux-sgx/host_internal.h
line 155 at r5 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
This function must be declared as
static
in the*.c
file, and removed from here.
Done.
pal/src/host/linux-sgx/host_internal.h
line 158 at r5 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
This function must be declared as
static
in the*.c
file, and removed from here.
Done.
pal/src/host/linux-sgx/host_internal.h
line 176 at r5 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Why do you need to declare these 4 functions? You should just declare those functions as
static
in the*.c
file, and remove from here.
These functions are not required now, hence deleted.
pal/src/host/linux-sgx/host_profile.c
line 177 at r5 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
static void
Done.
pal/src/host/linux-sgx/host_profile.c
line 197 at r5 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please add a newline after this
Done.
pal/src/host/linux-sgx/host_profile.c
line 203 at r5 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please remove a newline
Done.
pal/src/host/linux-sgx/host_profile.c
line 454 at r5 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please add a newline after this
Done.
pal/src/host/linux-sgx/host_profile.c
line 481 at r5 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Sorry, this refactoring of
sgx_profile_report_elf()
turns out to be a complete mess...I suggest to do the same you did with
sgx_profile_finish()
-- just remove the locks from the original function and rename it to_thread_unsafe()
, and in the wrapper function add the locks. Yes, this will be less performant than the original code (because the critical section in-between lock and unlock becomes larger), but this function is called very rarely, only on new binaries/libraries loaded into the application.
Done.
pal/src/host/linux-sgx/host_profile.c
line 484 at r5 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please add in the very beginning:
assert(spinlock_is_locked(&g_perf_data_lock));
Done.
pal/src/host/linux-sgx/host_profile.c
line 489 at r5 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Actually no need for "initializing again", as the name of the func already explains the issue. So:
sgx_profile_reinit failed: %s
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @jkr0103, and @mkow)
Previously, jkr0103 (Jitender Kumar) wrote…
Could you please add it while rebasing the PR?
Yes I will
pal/src/host/linux-sgx/host_internal.h
line 165 at r6 (raw file):
/* Record a sample during OCALL (function to be executed) */ void sgx_profile_sample_ocall_outer(void* ocall_func);
Please re-add the accidentally removed line
pal/src/host/linux-sgx/host_profile.c
line 206 at r6 (raw file):
static void sgx_profile_report_elf_thread_unsafe(const char* filename, void* addr) { int ret;
Can you add assert(spinlock_is_locked(&g_perf_data_lock));
as the very first line?
pal/src/host/linux-sgx/host_profile.c
line 324 at r6 (raw file):
if (ret < 0) { log_error("sgx_profile_reinit failed: %s", unix_strerror(ret)); goto out;
There seems to be no reason to have out:
label. Just return ret
here, and return 0
at the end of the func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 6 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
pal/src/host/linux-sgx/host_internal.h
line 165 at r6 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please re-add the accidentally removed line
Done.
pal/src/host/linux-sgx/host_profile.c
line 206 at r6 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Can you add
assert(spinlock_is_locked(&g_perf_data_lock));
as the very first line?
This function is file-scope only and callers grabed or assert on lock, hence adding this assert seems redundant.
pal/src/host/linux-sgx/host_profile.c
line 324 at r6 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
There seems to be no reason to have
out:
label. Justreturn ret
here, andreturn 0
at the end of the func.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r7, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
pal/src/host/linux-sgx/host_profile.c
line 206 at r6 (raw file):
Previously, jkr0103 (Jitender Kumar) wrote…
This function is file-scope only and callers grabed or assert on lock, hence adding this assert seems redundant.
LGTM, I'm ok with this explanation. (Indeed, very easy to verify that the lock is taken.)
pal/src/host/linux-sgx/host_profile.c
line 57 at r7 (raw file):
while (total < size) { ret = DO_SYSCALL(pread64, g_mem_fd, (uint8_t*)dest + total, size - total,
Technically accessing g_mem_fd
must be under the g_perf_data_lock
spinlock, because this variable can become -1
during re-opening of the file.
On the other hand, adding locking here would lead to tremendous overheads of profiling, which will seriously distort the perf results. So I'm very reluctant to do proper thread safety here...
In the worst case, pread64()
will return smth like EBADF
, and the sample will be lost. That's what I observe in practice when testing this PR. Sounds like a small correctness price to pay, so I would leave as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r8, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
a discussion (no related file):
Bar the (non-blocking) discussion point, I am fine with this PR.
I would like to test it again after all reviews are done and before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
pal/src/host/linux-sgx/host_profile.c
line 57 at r7 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Technically accessing
g_mem_fd
must be under theg_perf_data_lock
spinlock, because this variable can become-1
during re-opening of the file.On the other hand, adding locking here would lead to tremendous overheads of profiling, which will seriously distort the perf results. So I'm very reluctant to do proper thread safety here...
In the worst case,
pread64()
will return smth likeEBADF
, and the sample will be lost. That's what I observe in practice when testing this PR. Sounds like a small correctness price to pay, so I would leave as is.
I thought about this a bit more, and I have the following compromise solution:
- We introduce a global
g_null_mem_fd
which is opened at init time and points to the pseudo-file "/dev/null" (i.e., all writes to this FD are no-ops). - Instead of closing previous
g_mem_fd
, wedup2()
theg_null_mem_fd
into the previous mem FD. In other words, all writes to the previous mem FD do not return with error, but instead are silently dropped. Note thatdup2()
is an atomic operation, which is what we want. - So all threads that are still using the previous mem FD will not fail but instead just do nothing. At some point these threads will notice the new mem FD and will start writing their samples into this new mem FD.
This also solves the problem that I didn't realize when I wrote my previous comment (and that problem is very serious):
- One thread closes
g_mem_fd
, now the FD number (say10
) that was used for mem FD is available to any other usage. - Another thread still wants to use this old mem FD (i.e.
pread(10)
), and is preempted. - Yet another thread wants to open some unrelated file. It notices that FD
10
is available and assigns10
to this new file. - Second thread is scheduled again. It now performs
pread()
from a completely irrelevant file. Gramine state becomes silently corrupted. Oops.
There is a problem that we will leak g_mem_fd
and g_perf_data::pd
on every SIGUSR1 signal, but it's a small price to pay. (I think the only alternative would be to use spinlocks everywhere, which is very expensive.)
All the above also applies to g_perf_data::pd
FD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 7 files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
pal/src/host/linux-sgx/host_profile.c
line 57 at r7 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
I thought about this a bit more, and I have the following compromise solution:
- We introduce a global
g_null_mem_fd
which is opened at init time and points to the pseudo-file "/dev/null" (i.e., all writes to this FD are no-ops).- Instead of closing previous
g_mem_fd
, wedup2()
theg_null_mem_fd
into the previous mem FD. In other words, all writes to the previous mem FD do not return with error, but instead are silently dropped. Note thatdup2()
is an atomic operation, which is what we want.- So all threads that are still using the previous mem FD will not fail but instead just do nothing. At some point these threads will notice the new mem FD and will start writing their samples into this new mem FD.
This also solves the problem that I didn't realize when I wrote my previous comment (and that problem is very serious):
- One thread closes
g_mem_fd
, now the FD number (say10
) that was used for mem FD is available to any other usage.- Another thread still wants to use this old mem FD (i.e.
pread(10)
), and is preempted.- Yet another thread wants to open some unrelated file. It notices that FD
10
is available and assigns10
to this new file.- Second thread is scheduled again. It now performs
pread()
from a completely irrelevant file. Gramine state becomes silently corrupted. Oops.There is a problem that we will leak
g_mem_fd
andg_perf_data::pd
on every SIGUSR1 signal, but it's a small price to pay. (I think the only alternative would be to use spinlocks everywhere, which is very expensive.)
All the above also applies to
g_perf_data::pd
FD.
g_mem_fd
need not be closed/open again and again on SIGUSR1
, we can keep it open for the lifetime of the process. @dimakuv do you see any issue with this approach?
I have implemented the solution (using dup2
) suggestion above for g_perf_data::pd
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r9, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
pal/src/host/linux-sgx/host_perf_data.c
line 155 at r9 (raw file):
} // File `/dev/null` is open once and remain open during the lifecycle of the process
I would rephrase to this:
/* `/dev/null` is opened once and is kept open for the whole runtime of Gramine */
pal/src/host/linux-sgx/host_perf_data.c
line 158 at r9 (raw file):
if (g_dev_null_fd < 0) { ret = DO_SYSCALL(open, "/dev/null", O_WRONLY | O_TRUNC | O_CREAT | O_CLOEXEC, PERM_rw_r__r__);
Please align O_WRONLY ...
line to be exactly under open, ...
pal/src/host/linux-sgx/host_perf_data.c
line 158 at r9 (raw file):
if (g_dev_null_fd < 0) { ret = DO_SYSCALL(open, "/dev/null", O_WRONLY | O_TRUNC | O_CREAT | O_CLOEXEC, PERM_rw_r__r__);
Please remove O_TRUNC
and O_CREAT
-- they don't make sense in this context
pal/src/host/linux-sgx/host_perf_data.c
line 160 at r9 (raw file):
O_WRONLY | O_TRUNC | O_CREAT | O_CLOEXEC, PERM_rw_r__r__); if (ret < 0) { log_error("pd_open: cannot open `/dev/null` for writing: %s", unix_strerror(fd));
Not fd
but ret
pal/src/host/linux-sgx/host_perf_data.c
line 298 at r9 (raw file):
out: if ((dup2_ret = dup2(g_dev_null_fd, pd->fd)) < 0)
We don't use Glibc wrappers here (dup2()
) but instead raw syscalls (DO_SYSCALL(dup2, g_dev_null_fd, pd->fd)
). Please fix.
pal/src/host/linux-sgx/host_profile.c
line 57 at r7 (raw file):
g_mem_fd
need not be closed/open again and again onSIGUSR1
, we can keep it open for the lifetime of the process. @dimakuv do you see any issue with this approach?
Yes, you are right. This is good how you implemented it.
I have implemented the solution (using
dup2
) suggestion above forg_perf_data::pd
.
Yes, looks like what I wanted.
pal/src/host/linux-sgx/host_profile.c
line 116 at r9 (raw file):
g_profile_mode = g_pal_enclave.profile_mode; // File `/proc/self/mem` is open once and remain open during the lifecycle of the process
I would rephrase to this:
/* `/proc/self/mem` is opened once and is kept open for the whole runtime of Gramine */
pal/src/host/linux-sgx/host_profile.c
line 121 at r9 (raw file):
if (ret < 0) { log_error("sgx_profile_init: opening /proc/self/mem failed: %s", unix_strerror(ret)); return ret;
Why did you replace all goto out
with return ret
? This looked reasonable before. Now we have sometimes return ret
and sometimes goto out
, which is more confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 7 files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
pal/src/host/linux-sgx/host_perf_data.c
line 155 at r9 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
I would rephrase to this:
/* `/dev/null` is opened once and is kept open for the whole runtime of Gramine */
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 158 at r9 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please align
O_WRONLY ...
line to be exactly underopen, ...
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 158 at r9 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please remove
O_TRUNC
andO_CREAT
-- they don't make sense in this context
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 160 at r9 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Not
fd
butret
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 298 at r9 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
We don't use Glibc wrappers here (
dup2()
) but instead raw syscalls (DO_SYSCALL(dup2, g_dev_null_fd, pd->fd)
). Please fix.
Done.
pal/src/host/linux-sgx/host_profile.c
line 116 at r9 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
I would rephrase to this:
/* `/proc/self/mem` is opened once and is kept open for the whole runtime of Gramine */
Done.
pal/src/host/linux-sgx/host_profile.c
line 121 at r9 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Why did you replace all
goto out
withreturn ret
? This looked reasonable before. Now we have sometimesreturn ret
and sometimesgoto out
, which is more confusing.
reverted now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r10, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 5 files at r5, 1 of 4 files at r8, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Yes I will
[Pal/Linux-SGX] SGX perf
-> [Pal/Linux-SGX] perf
Documentation/performance.rst
line 638 at r10 (raw file):
initialization time and concentrating only on the actual application processing. Send ``SIGUSR1`` using command ``kill -SIGUSR1 -<gramine-pid>`` (note the minus sign before <gramine-pid>). Sending multiple ``SIGUSR1`` will
Do we want this? Why?
Also, then it's not gramine-pid
, but PGID.
Code quote:
(note the minus sign before <gramine-pid>).
pal/src/host/linux-sgx/host_profile.c
line 164 at r10 (raw file):
} static void sgx_profile_finish_thread_unsafe(void) {
What does "unsafe" mean in this context?
pal/src/host/linux-sgx/host_profile.c
line 186 at r10 (raw file):
} static void sgx_profile_report_elf_thread_unsafe(const char* filename, void* addr) {
Could you move this function down where it was before? It obscures the diff and it's hard to review. If you want to move functions around, I'd prefer if you did that in a follow-up PR where you only change the order of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 7 files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
Previously, mkow (Michał Kowalczyk) wrote…
[Pal/Linux-SGX] SGX perf
->[Pal/Linux-SGX] perf
will be done while rebasing.
Documentation/performance.rst
line 638 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Do we want this? Why?
Also, then it's notgramine-pid
, but PGID.
Some workloads create multiple child process e.g. nginx have option worker_processes 4;
which means it will create 4 child process. Minus sign is required to send the signal to all the processes in the group.
pal/src/host/linux-sgx/host_profile.c
line 164 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
What does "unsafe" mean in this context?
unsafe
means it's not safe for multiple thread, hence require locking,. User of this function should take care of locking before using this function if expecting multiple threads to enter at the same time.
pal/src/host/linux-sgx/host_profile.c
line 186 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Could you move this function down where it was before? It obscures the diff and it's hard to review. If you want to move functions around, I'd prefer if you did that in a follow-up PR where you only change the order of them.
That required forward declaration. Anyways moved it down and added forward declaration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 5 files at r5, 1 of 4 files at r8, 1 of 2 files at r10, 1 of 2 files at r11, all commit messages.
Reviewable status: 6 of 7 files reviewed, 10 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)
pal/src/host/linux-sgx/host_exception.c
line 196 at r11 (raw file):
__UNUSED(uc); #ifdef DEBUG
Please move it up to #ifdef
the whole function.
pal/src/host/linux-sgx/host_exception.c
line 241 at r11 (raw file):
goto err; ret = set_signal_handler(SIGUSR1, handle_sigusr1);
Please also wrap this is #ifdef DEBUG
.
pal/src/host/linux-sgx/host_perf_data.c
line 297 at r11 (raw file):
out: if ((dup2_ret = DO_SYSCALL(dup2, g_dev_null_fd, pd->fd)) < 0)
Please keep it in a separate line, like other places in our code do.
pal/src/host/linux-sgx/host_perf_data.c
line 297 at r11 (raw file):
out: if ((dup2_ret = DO_SYSCALL(dup2, g_dev_null_fd, pd->fd)) < 0)
What is the exact race this change tries to prevent? And how could it possibly work if there's no synchronization point between closing/swapping this descriptor and freeing the memory holding pd
?
It seems to me that if close(...)
was racy before, then free(pd)
still is.
pal/src/host/linux-sgx/host_perf_data.c
line 300 at r11 (raw file):
log_error("pd_close: dup2 failed: %s", unix_strerror(dup2_ret)); free(pd);
pd->fd
leaks here
pal/src/host/linux-sgx/host_profile.c
line 164 at r10 (raw file):
Previously, jkr0103 (Jitender Kumar) wrote…
unsafe
means it's not safe for multiple thread, hence require locking,. User of this function should take care of locking before using this function if expecting multiple threads to enter at the same time.
Ah, ok, but no other function in this file (nor any other in Gramine) uses such naming convention. I thought it's "finish thread" with an "unsafe" suffix, not "thread unsafe" suffix.
If you want to require locking by the caller then just assert the relevant lock state.
Update:
Hmm, I guess you used this name because there exists also sgx_profile_finish
with the locking, which is public. We prepend a single underscore for such cases + assert, see e.g.:
gramine/libos/src/sys/libos_futex.c
Lines 175 to 199 in 929bb9d
static void _maybe_dequeue_futex(struct libos_futex* futex) { | |
assert(spinlock_is_locked(&futex->lock)); | |
assert(spinlock_is_locked(&g_futex_tree_lock)); | |
if (check_dequeue_futex(futex)) { | |
avl_tree_delete(&g_futex_tree, &futex->tree_node); | |
futex->in_tree = false; | |
/* We still hold this futex reference (in the caller), so this won't call free. */ | |
put_futex(futex); | |
} | |
} | |
/* | |
* If `futex` has no waiters and is on `g_futex_tree`, takes it off that tree. | |
* | |
* Neither `g_futex_tree_lock` nor `futex->lock` should be held while calling this, | |
* it acquires these locks itself. | |
*/ | |
static void maybe_dequeue_futex(struct libos_futex* futex) { | |
spinlock_lock(&g_futex_tree_lock); | |
spinlock_lock(&futex->lock); | |
_maybe_dequeue_futex(futex); | |
spinlock_unlock(&futex->lock); | |
spinlock_unlock(&g_futex_tree_lock); | |
} |
pal/src/host/linux-sgx/host_profile.c
line 40 at r11 (raw file):
_WITH_PID
This suffix only made sense before, when there were more possible names. And it's even misleading now, it mentions the PID but not the time.
pal/src/host/linux-sgx/host_profile.c
line 178 at r11 (raw file):
log_always("Profile data written to %s (%lu bytes)", g_pal_enclave.profile_filename, size); g_profile_enabled = false;
This variable is read in many places without a lock because it was assumed to be constant after initialization, you cannot change it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 6 of 7 files reviewed, 10 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)
Previously, jkr0103 (Jitender Kumar) wrote…
will be done while rebasing.
(and also, Pal
-> PAL
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 7 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
a discussion (no related file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Ok, then let's keep as one PR.
@jkr0103 Will you do this new approach?
done
pal/src/host/linux-sgx/host_profile.c
line 178 at r11 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
It's still a race condition if you write with locks, but read without.
Reverted the changes here based on new approach discussed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r14, all commit messages.
Reviewable status: all files reviewed, 23 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
pal/src/host/linux-sgx/host_internal.h
line 59 at r14 (raw file):
extern struct pal_enclave g_pal_enclave; extern bool trigger_profile_reinit;
Please add g_
prefix -- all global (non-function-scope) variables must have this prefix.
pal/src/host/linux-sgx/host_internal.h
line 60 at r14 (raw file):
extern struct pal_enclave g_pal_enclave; extern bool trigger_profile_reinit; extern char profile_filename[128];
ditto (prefix)
pal/src/host/linux-sgx/host_internal.h
line 177 at r14 (raw file):
struct perf_data* pd_open(const char* file_name, bool with_stack); ssize_t pd_close_file(struct perf_data* pd);
Can you group the new helper functions together? Like this:
int pd_open_file(struct perf_data* pd, const char* file_name);
ssize_t pd_close_file(struct perf_data* pd);
struct perf_data* pd_open(const char* file_name, bool with_stack);
...
pal/src/host/linux-sgx/host_main.c
line 780 at r14 (raw file):
#ifdef DEBUG enclave_info->profile_enable = false; profile_filename[0] = '\0';
I think you can safely remove this line, it's not needed. (The global variable that you declare is already initialized to all zeros by the compiler.)
pal/src/host/linux-sgx/host_perf_data.c
line 144 at r14 (raw file):
int pd_open_file(struct perf_data* pd, const char* file_name) { int ret, close_ret;
Please move close_ret
to where it is used:
fail:
int close_ret = DO_SYSCALL(close, fd);
...
It's just easier to read that way -- there's only one place where it is used, so it doesn't have to be declared at the top of the func.
pal/src/host/linux-sgx/host_perf_data.c
line 148 at r14 (raw file):
int fd = DO_SYSCALL(open, file_name, O_WRONLY | O_TRUNC | O_CREAT | O_CLOEXEC, PERM_rw_r__r__); if (fd < 0) { log_error("pd_open: cannot open %s for writing: %s", file_name, unix_strerror(fd));
pd_open
-> pd_open_file
pal/src/host/linux-sgx/host_perf_data.c
line 149 at r14 (raw file):
if (fd < 0) { log_error("pd_open: cannot open %s for writing: %s", file_name, unix_strerror(fd)); return -EINVAL;
You can just return fd;
pal/src/host/linux-sgx/host_perf_data.c
line 171 at r14 (raw file):
close_ret = DO_SYSCALL(close, fd); if (close_ret < 0) log_error("pd_reopen: close failed: %s", unix_strerror(close_ret));
pd_reopen
-> pd_open_file
pal/src/host/linux-sgx/host_perf_data.c
line 176 at r14 (raw file):
struct perf_data* pd_open(const char* file_name, bool with_stack) { int ret;
Please move ret
directly to where it's used:
int ret = pd_open_file(pd, file_name);
...
pal/src/host/linux-sgx/host_perf_data.c
line 182 at r14 (raw file):
log_error("pd_open: out of memory"); return NULL; }
Please add an empty line after this one
pal/src/host/linux-sgx/host_perf_data.c
line 185 at r14 (raw file):
ret = pd_open_file(pd, file_name); if (ret < 0) { log_error("pd_open: close failed: %s", unix_strerror(ret));
The error message is wrong. I think you can just remove this line, because pd_open_file()
already prints an error message.
pal/src/host/linux-sgx/host_perf_data.c
line 186 at r14 (raw file):
if (ret < 0) { log_error("pd_open: close failed: %s", unix_strerror(ret)); return NULL;
You forgot to free(pd);
before returning
pal/src/host/linux-sgx/host_perf_data.c
line 282 at r14 (raw file):
ssize_t pd_close_file(struct perf_data* pd) { ssize_t ret = 0;
Please remove = 0
. Because this variable will be re-assigned anyway in the next line. I don't know why it was here in the first place.
pal/src/host/linux-sgx/host_perf_data.c
line 283 at r14 (raw file):
ssize_t pd_close_file(struct perf_data* pd) { ssize_t ret = 0; int close_ret;
ditto (move to where it's used)
pal/src/host/linux-sgx/host_perf_data.c
line 300 at r14 (raw file):
close_ret = DO_SYSCALL(close, pd->fd); if (close_ret < 0) log_error("pd_close: close failed: %s", unix_strerror(close_ret));
pf_close_file:
pal/src/host/linux-sgx/host_perf_data.c
line 302 at r14 (raw file):
log_error("pd_close: close failed: %s", unix_strerror(close_ret)); return ret;
It is a non-trivial property of this function that it returns "size of the finalized perf-data file" (the return value of lseek()
syscall above).
Please add a comment:
/* returns the size of the finalized perf-data file on success */
return ret;
pal/src/host/linux-sgx/host_profile.c
line 201 at r14 (raw file):
} /* Re-initialize based on g_pal_enclave settings */
Actually, I don't understand this comment. I think the function name is self-explanatory, and this comment can be deleted.
pal/src/host/linux-sgx/host_profile.c
line 210 at r14 (raw file):
size = pd_close_file(g_perf_data); if (size < 0) { log_error("sgx_profile_reinit: pd_close failed: %s", unix_strerror(size));
pd_close_file failed
pal/src/host/linux-sgx/host_profile.c
line 211 at r14 (raw file):
if (size < 0) { log_error("sgx_profile_reinit: pd_close failed: %s", unix_strerror(size)); return -size;
Why negation? Should be just return size;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 7 files reviewed, 23 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
pal/src/host/linux-sgx/host_internal.h
line 59 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please add
g_
prefix -- all global (non-function-scope) variables must have this prefix.
Done.
pal/src/host/linux-sgx/host_internal.h
line 60 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
ditto (prefix)
Done.
pal/src/host/linux-sgx/host_internal.h
line 177 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Can you group the new helper functions together? Like this:
int pd_open_file(struct perf_data* pd, const char* file_name); ssize_t pd_close_file(struct perf_data* pd); struct perf_data* pd_open(const char* file_name, bool with_stack); ...
Done.
pal/src/host/linux-sgx/host_main.c
line 780 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
I think you can safely remove this line, it's not needed. (The global variable that you declare is already initialized to all zeros by the compiler.)
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 148 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
pd_open
->pd_open_file
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 149 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
You can just
return fd;
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 171 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
pd_reopen
->pd_open_file
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 176 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please move
ret
directly to where it's used:int ret = pd_open_file(pd, file_name); ...
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 182 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please add an empty line after this one
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 185 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
The error message is wrong. I think you can just remove this line, because
pd_open_file()
already prints an error message.
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 186 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
You forgot to
free(pd);
before returning
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 282 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Please remove
= 0
. Because this variable will be re-assigned anyway in the next line. I don't know why it was here in the first place.
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 283 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
ditto (move to where it's used)
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 300 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
pf_close_file:
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 302 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
It is a non-trivial property of this function that it returns "size of the finalized perf-data file" (the return value of
lseek()
syscall above).Please add a comment:
/* returns the size of the finalized perf-data file on success */ return ret;
Done.
pal/src/host/linux-sgx/host_profile.c
line 201 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Actually, I don't understand this comment. I think the function name is self-explanatory, and this comment can be deleted.
Done.
pal/src/host/linux-sgx/host_profile.c
line 210 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
pd_close_file failed
Done.
pal/src/host/linux-sgx/host_profile.c
line 211 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Why negation? Should be just
return size;
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r15, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
pal/src/host/linux-sgx/host_internal.h
line 177 at r14 (raw file):
Previously, jkr0103 (Jitender Kumar) wrote…
Done.
Sorry, that's not what I suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 6 of 7 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
pal/src/host/linux-sgx/host_internal.h
line 177 at r14 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Sorry, that's not what I suggested.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r16, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
a discussion (no related file):
I will test this PR when other reviewers also approve and before the merge. Otherwise looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 5 files at r15, 1 of 1 files at r16, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)
pal/src/host/linux-sgx/host_internal.h
line 59 at r16 (raw file):
extern struct pal_enclave g_pal_enclave; extern bool g_trigger_profile_reinit;
These two should be in #ifdef DEBUG
.
pal/src/host/linux-sgx/host_internal.h
line 176 at r16 (raw file):
ssize_t pd_close_file(struct perf_data* pd); struct perf_data* pd_open(const char* file_name, bool with_stack);
I'd list this function first, the order would be more logical then.
pal/src/host/linux-sgx/host_perf_data.c
line 297 at r11 (raw file):
Previously, jkr0103 (Jitender Kumar) wrote…
Dmitrii and I have seen with
memcached
worklaod where on thread was trying to write but FD was already close, that's why we introduceddup2
which does two operations atomicallyclose
and swap the value of old to new FD.You are right on
free
, hence removed free.
This hack was deleted, resolving.
pal/src/host/linux-sgx/host_perf_data.c
line 300 at r11 (raw file):
Previously, jkr0103 (Jitender Kumar) wrote…
We repalced
close
withdup2
to workaround the race condition. Please see above comments for more info on race condition.
This hack was deleted, resolving.
pal/src/host/linux-sgx/host_perf_data.c
line 165 at r16 (raw file):
goto fail; pd->fd = fd;
assert(pd->fd == -1)
before overwriting this
pal/src/host/linux-sgx/host_perf_data.c
line 168 at r16 (raw file):
return 0; fail: ;
Suggestion:
fail:;
pal/src/host/linux-sgx/host_perf_data.c
line 299 at r16 (raw file):
if (close_ret < 0) log_error("pd_close_file: close failed: %s", unix_strerror(close_ret));
Just to be sure, please set pd->fd
to -1.
pal/src/host/linux-sgx/host_profile.c
line 197 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
You're right, that's a good point.
Then please make it log_error
or log_warning
. log_always
prints even if the user set loader.log_level = "none"
.
pal/src/host/linux-sgx/host_profile.c
line 39 at r16 (raw file):
#define NSEC_IN_SEC 1000000000 /* Filenames for saved data */
Suggestion:
/* Filename for saved data */
pal/src/host/linux-sgx/host_profile.c
line 156 at r16 (raw file):
out: if (g_mem_fd > 0) {
Suggestion:
if (g_mem_fd >= 0) {
pal/src/host/linux-sgx/host_profile.c
line 213 at r16 (raw file):
} log_always("Profile data written to %s (%lu bytes)", g_profile_filename, size);
ditto
pal/src/host/linux-sgx/host_profile.c
line 246 at r16 (raw file):
spinlock_lock(&g_perf_data_lock); if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true) sgx_profile_reinit();
missing result check
pal/src/host/linux-sgx/host_profile.c
line 272 at r16 (raw file):
spinlock_lock(&g_perf_data_lock); if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true) sgx_profile_reinit();
missing result check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
pal/src/host/linux-sgx/host_internal.h
line 176 at r16 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
I'd list this function first, the order would be more logical then.
It was me who wanted the order differently, but Ok, both Michal and Jitender think it's better to reorder. I am Ok with this.
pal/src/host/linux-sgx/host_profile.c
line 197 at r1 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Then please make it
log_error
orlog_warning
.log_always
prints even if the user setloader.log_level = "none"
.
One issue with log_error
and log_warning
is that they will prepend error:
or warning:
words to the string. It's a bit misleading, as there is nothing wrong with the execution. It's more like info:
, but we currently don't have this convenience function. Maybe we should add it.
Anyway, I think we can go with log_error()
for now, and then add log_info()
in another PR.
pal/src/host/linux-sgx/host_profile.c
line 246 at r16 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
missing result check
Yeah, since we're in the exception handler and there's no reason to terminate the whole application, I think we can just print log_error()
, see several lines below for how we treat errors in pd_event_sample_simple()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
pal/src/host/linux-sgx/host_profile.c
line 197 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
One issue with
log_error
andlog_warning
is that they will prependerror:
orwarning:
words to the string. It's a bit misleading, as there is nothing wrong with the execution. It's more likeinfo:
, but we currently don't have this convenience function. Maybe we should add it.Anyway, I think we can go with
log_error()
for now, and then addlog_info()
in another PR.
This log is printed in non-production case only and that too when a file is being create and this info must be passed to the user. We still can have it even if user set loader.log_level = "none"
.
If you still disagree, I'll make the changes.
pal/src/host/linux-sgx/host_profile.c
line 246 at r16 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Yeah, since we're in the exception handler and there's no reason to terminate the whole application, I think we can just print
log_error()
, see several lines below for how we treat errors inpd_event_sample_simple()
.
we already print error message inside sgx_profile_reinit
in all errors cases, do we really want to print error message again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 7 files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
pal/src/host/linux-sgx/host_internal.h
line 59 at r16 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
These two should be in
#ifdef DEBUG
.
Done.
pal/src/host/linux-sgx/host_internal.h
line 176 at r16 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
It was me who wanted the order differently, but Ok, both Michal and Jitender think it's better to reorder. I am Ok with this.
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 165 at r16 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
assert(pd->fd == -1)
before overwriting this
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 168 at r16 (raw file):
return 0; fail: ;
Done.
pal/src/host/linux-sgx/host_perf_data.c
line 299 at r16 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Just to be sure, please set
pd->fd
to -1.
Done.
pal/src/host/linux-sgx/host_profile.c
line 39 at r16 (raw file):
#define NSEC_IN_SEC 1000000000 /* Filenames for saved data */
Done.
pal/src/host/linux-sgx/host_profile.c
line 156 at r16 (raw file):
out: if (g_mem_fd > 0) {
Done.
pal/src/host/linux-sgx/host_profile.c
line 213 at r16 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto
ditto
pal/src/host/linux-sgx/host_profile.c
line 246 at r16 (raw file):
Previously, jkr0103 (Jitender Kumar) wrote…
we already print error message inside
sgx_profile_reinit
in all errors cases, do we really want to print error message again?
Added a comment before calling sgx_profile_reinit
.
pal/src/host/linux-sgx/host_profile.c
line 272 at r16 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
missing result check
Added a comment before calling sgx_profile_reinit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r17, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r17, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)
pal/src/host/linux-sgx/host_internal.h
line 176 at r16 (raw file):
Previously, jkr0103 (Jitender Kumar) wrote…
Done.
But not like that - pd_open()
is the function creating perf_data
object and it's always called first, it should be listed before pd_open_file
.
pal/src/host/linux-sgx/host_perf_data.c
line 301 at r17 (raw file):
log_error("pd_close_file: close failed: %s", unix_strerror(close_ret)); pd->fd = -1;
Should be before if
, even a failed close()
makes the descriptor unusable (see the corresponding manpage).
pal/src/host/linux-sgx/host_profile.c
line 197 at r1 (raw file):
this info must be passed to the user
It doesn't have to, it's just some useful debug information.
We still can have it even if user set
loader.log_level = "none"
.
Yeah, and that's the very problem - it breaks the explicit user choice to stop Gramine from printing logs. All other places in our codebase respect this settings (except places before log_level is initialized).
I'm not sure what's the best solution here, I'd probably just go with log_debug
or log_error
with added [INFO]
.
pal/src/host/linux-sgx/host_profile.c
line 246 at r16 (raw file):
But we cannot continue just ignoring the failure, the whole profiling state is inconsistent?
there's no reason to terminate the whole application
I don't agree, the user wanted to perf-trace the whole app, doing that completely failed and is unusable after this point. I don't see why a user would want to continue execution in this state.
pal/src/host/linux-sgx/host_profile.c
line 247 at r17 (raw file):
if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true) { /* No error checking since we're in exception handler and the function already prints errors */
Suggestion:
* errors */
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)
pal/src/host/linux-sgx/host_profile.c
line 197 at r1 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
this info must be passed to the user
It doesn't have to, it's just some useful debug information.
We still can have it even if user set
loader.log_level = "none"
.Yeah, and that's the very problem - it breaks the explicit user choice to stop Gramine from printing logs. All other places in our codebase respect this settings (except places before log_level is initialized).
I'm not sure what's the best solution here, I'd probably just go with
log_debug
orlog_error
with added[INFO]
.
Ok, then I suggest to follow @mkow's advice and do:
log_error("[INFO] Profile data written to %s (%lu bytes)", g_profile_filename, size);
pal/src/host/linux-sgx/host_profile.c
line 246 at r16 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
But we cannot continue just ignoring the failure, the whole profiling state is inconsistent?
there's no reason to terminate the whole application
I don't agree, the user wanted to perf-trace the whole app, doing that completely failed and is unusable after this point. I don't see why a user would want to continue execution in this state.
Hm, actually @mkow is right. The perf becomes unusable. Let's do this forceful termination then:
if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true) {
ret = sgx_profile_reinit();
if (ret < 0) {
/* no need to print error message since sgx_profile_init() already prints it */
BUG();
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 7 files reviewed, 9 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
pal/src/host/linux-sgx/host_internal.h
line 176 at r16 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
But not like that -
pd_open()
is the function creatingperf_data
object and it's always called first, it should be listed beforepd_open_file
.
Usually callee
is added before caller
, hence made the change accordingly. Anyways I updated based on the comments.
pal/src/host/linux-sgx/host_perf_data.c
line 301 at r17 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Should be before
if
, even a failedclose()
makes the descriptor unusable (see the corresponding manpage).
Done.
pal/src/host/linux-sgx/host_profile.c
line 197 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Ok, then I suggest to follow @mkow's advice and do:
log_error("[INFO] Profile data written to %s (%lu bytes)", g_profile_filename, size);
Done.
pal/src/host/linux-sgx/host_profile.c
line 213 at r16 (raw file):
Previously, jkr0103 (Jitender Kumar) wrote…
ditto
Done
pal/src/host/linux-sgx/host_profile.c
line 246 at r16 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Hm, actually @mkow is right. The perf becomes unusable. Let's do this forceful termination then:
if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true) { ret = sgx_profile_reinit(); if (ret < 0) { /* no need to print error message since sgx_profile_init() already prints it */ BUG(); } }
Done.
pal/src/host/linux-sgx/host_profile.c
line 247 at r17 (raw file):
if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true) { /* No error checking since we're in exception handler and the function already prints errors */
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r18, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @mkow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r18, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)
pal/src/host/linux-sgx/host_internal.h
line 176 at r16 (raw file):
Previously, jkr0103 (Jitender Kumar) wrote…
Usually
callee
is added beforecaller
, hence made the change accordingly. Anyways I updated based on the comments.
That's mostly for static
functions I think, because C is old and annoying, and you either need a forward declaration or declare all the functions in their dependency order. But in the headers I prefer a more logical grouping, so that it's easier to read by the function's user.
bda60ac
to
4e2a7df
Compare
This allows to collect profile data only for a particular period (e.g., during Gramine startup or during an actual benchmark run). Signed-off-by: jkr0103 <jitender.kumar@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r19, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @mkow)
a discussion (no related file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
I will test this PR when other reviewers also approve and before the merge. Otherwise looks good to me.
Tested on Redis and Memcached. Things look good to me.
Gotcha: If there's nothing to process, all Memcached threads go into sleep waiting on epoll_wait(<forever>)
. This means that a SIGUSR1 signal will set the flag to g_trigger_profile_reinit
, but the actual write-to-perf-file will happen only on the next SGX enclave exit, which will happen only after epoll_wait()
returns, i.e. on a next send/recv request from the client (or on SIGTERM). I don't consider it a problem, as this is anyway what we want to measure.
Previously, mkow (Michał Kowalczyk) wrote…
(and also,
Pal
->PAL
)
Done.
4e2a7df
to
6d77bcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r11, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @mkow)
Jenkins, test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r19, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved
Description of the changes
Currently Gramine don't support flushing sgx profiling data if user wants to collect only for a particular duraion (e.g., only during perf benchmarking runs). This PR add feature to flush the profile data on demand using
SIGUSR1
signal. That way user can collect data only for a particular duration. This also eliminates the noise added in profile data duing start and stop of the process.Partially fixes issue #1711
How to test this PR?
Please follow the instruction here to enable SGX profiling. Now execute the workload in Gramine SGX. Send
SIGUSR1
signal using commandkill -SIGUSR1 <pid>
to flush the profile data. This should create a sgx profile data file which can be opened usingperf report
command.Example:
gramine/CI-Examples/redis/redis-server.manifest.template
.sgx.profile.enable = "main"
sgx.profile.with_stack = true
make clean && make SGX=1
gramine-sgx redis-server &
kill -SIGUSR1 <pid>
to flush. Gramine flush the profile data in a file as indicated by below log line:Profile data written to sgx-perf-1707196865.data (161042 bytes)
src/src/redis-benchmark
SIGUSR1
signal to flush profile data using commandkill -SIGUSR1 <pid>
. Again profile data is written to a new file as indicated by below log line:Profile data written to sgx-perf-1707196952.data (15096242 bytes)
sgx-perf-1707196952.data
have the profile data collected during benchmarks run.perf report --no-children -i sgx-perf-1707196952.data
Same steps can be tried with MongoDB, MySQL etc..
This change is