-
Notifications
You must be signed in to change notification settings - Fork 130
Remove hypervisor_handler thread #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
eb85e5f
to
4abdd49
Compare
c4d0a52
to
210e506
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments/questions , not quite finished reviewing , I will finish it in the morning
@@ -72,7 +71,8 @@ impl MultiUseGuestCallContext { | |||
// !Send (and !Sync), we also don't need to worry about | |||
// synchronization | |||
|
|||
call_function_on_guest(&mut self.sbox, func_name, func_ret_type, args) | |||
self.sbox |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change to the _no_reset
version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous call_function_on_guest
in removed file guest_dispatch.rs
didn't reset memory either, I just renamed it to make that more clear. I moved the method to MultiUseSandbox because I'm not sure what purpose guest_dispatch.rs
served now that SingleUsedSandbox is removed.
Justfile
Outdated
cargo test {{ if features =="" {''} else if features=="no-default-features" {"--no-default-features" } else {"--no-default-features -F " + features } }} --profile={{ if target == "debug" { "dev" } else { target } }} -p hyperlight-host --lib -- metrics::tests::test_metrics_are_emitted --exact --ignored | ||
cargo test {{ if features =="" {''} else if features=="no-default-features" {"--no-default-features" } else {"--no-default-features -F function_call_metrics," + features } }} --profile={{ if target == "debug" { "dev" } else { target } }} -p hyperlight-host --lib -- metrics::tests::test_metrics_are_emitted --exact --ignored | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did these tests get deleted? They don't see related to this change, I guess we potentially have different metrics now (although arguably we might want to have a kill metric and metrics about success of killing etc.) but even if we don't we still have some metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have different metrics now. The test itself changed to use a local metrics recorder, which makes it capable to only record the metrics emitted in the given test. The local metrics recorder is able to only capture metrics that were emitted while the local recorder is active, and that are emitted on the same thread. We couldn't use the local metrics recorder before, because the metrics were emitted from a separate thread (hypervisor_handler thread), and local metrics recorder only worked on the given thread.
In short, we no longer require #[ignore] on the test, so it is ran as part of the other tests. Your comment made me realize we should still test it with the metrics feature flag so I added back that one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that deleting this file has deleted a bunch of tests that we would want to keep (e.g. seccomp tests etc.) I couldn't see that these had been relocated anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops you are correct. I added them back to initializedmultiuse.rs (I didn't keep the ones that made no sense)
@@ -89,17 +89,29 @@ fn main() -> Result<()> { | |||
let no_op = Noop::<UninitializedSandbox, MultiUseSandbox>::default(); | |||
|
|||
let mut multiuse_sandbox = usandbox.evolve(no_op)?; | |||
let interrupt_handle = multiuse_sandbox.interrupt_handle(); | |||
|
|||
const NUM_CALLS: i32 = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Are we logging attempts and results of kills? I assume so given this code has changed , if we are not logging when we kill we should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics are still emitted, unchanged from previous behavior https://github.com/ludfjig/hyperlight/blob/210e5069b88e60d5398ff72abbb8ed031a152464/src/hyperlight_host/src/hypervisor/mod.rs#L297-L302.
Added logging to kill attempts
@@ -102,10 +102,22 @@ fn do_hyperlight_stuff() { | |||
let no_op = Noop::<UninitializedSandbox, MultiUseSandbox>::default(); | |||
|
|||
let mut multiuse_sandbox = usandbox.evolve(no_op).expect("Failed to evolve sandbox"); | |||
let interrupt_handle = multiuse_sandbox.interrupt_handle(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as logging? are we creating metrics on how many times kill has been called/ succeeded etc.? If not we probably should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have same metrics as before, which is how many sandbox cancellations have occured (successfully)
|
||
// Call a function that gets cancelled by the host function 5 times to generate some log entries. | ||
const NUM_CALLS: i32 = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as the logging example, if we log then we should be creating trace records as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep the metrics are still there, and I added logging for kill attempts
@@ -390,6 +393,11 @@ impl HypervLinuxDriver { | |||
mem_regions, | |||
entrypoint: entrypoint_ptr.absolute()?, | |||
orig_rsp: rsp_ptr, | |||
interrupt_handle: Arc::new(LinuxInterruptHandle { | |||
running: AtomicBool::new(false), | |||
tid: AtomicU64::new(unsafe { libc::pthread_self() }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the thread id here? what happens if the sandbox is created on thread 1 , then used on thread 2 and killed will we send the kill signal to the wrong thread?
Also is it possible that sandbox a and b are created on thread 1 then guest func is called on sandbox a , then host tries to kill guest func call ,meanwhile func_call ends on sandbox a and thread 1 calls func on sandbox b , then we send kill signal and kill the wrong one?
If we dont have tests for these scenarios then we should (and more that I haven't thought of)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll never send the signal to the wrong thread, because tid
is always set before running
, and we only send signal when running
is set.
However, it's possible that killing a sandbox on thread 1 will accidentally kill a new sandbox on the same thread. As far as I know, I don't know of any way to avoid this.
- Thread 1 creates sandbox, finishes its call to vcpu::run(), but not yet set
running
to false - Thread 2 calls
kill()
, and enters the while loop becauserunning
is still set totrue
- Thread 1 sets
running
to false, and on this thread we create a new sandbox, and call vcpu::run on the new sandbox - Thread 2 sends the signal
- Thread 1's vcpu of the new sandbox gets interrupted, but the user intended to kill the old one :/
This occurs because the checking running
and sending the signal is not an atomic operation, and I'm not sure how to avoid it. Basically, between the green and red line, anything could happen on another thread
We'll at most send 1 incorrect signal, because the next checking of running
in the while-loop is guaranteed to be false
// Note: if a `InterruptHandle::kill()` signal is delivered to this thread **here** | ||
// - after we've set the running to true, | ||
// Then the signal does not have any effect, because the signal handler is a no-op. | ||
// This is fine since we are already done with the `VcpuFd::run()` call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this statement is true, for example the reason for the exit may be to call a host function in which case we want to stop the execution, I think that this scenario is handled in the current implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, if the user-thread times out while waiting for hypervisor_handler thread, the user-thread returns control back. However, it doesn't actually stop the host function call from happening, after the host-thread has returned control. Let's add host-function cancellation in another PR since it was never properly supported.
@@ -634,58 +666,8 @@ impl Hypervisor for KVMDriver { | |||
} | |||
} | |||
|
|||
#[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did these tests get deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted in kvm/windows/mshv in favor of running the test from mod.rs, since they all do the same thing regardless
|
||
#[cfg(gdb)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this get deleted? We still want to test initialize don't we? Its still being executed just not via the handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added back! Whopsie
34a5772
to
baf9912
Compare
Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
…eventing host functions to return control to guest after being interrupted Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
kill()
api. Currently, one can only cancel guest function calls, not the first vm initialization, but this should be fine as long as you trust your guest binary. You can only interrupt a guest while it's in a blocking call tovcpufd.run()
. Host function calls can still not be interrupted (Make it possible to kill guest execution when running a host function. #192)These changes should improve performance and throughput. It should also avoid the incredible performance drop off we observed under load when the hypervisor handler thread required joining, after cancelling guest execution.
Added API changes:
Removed API changed:
closes #471
Note: On KVM, moving the vcpufd (sandbox) to a new thread will incur a performance overhead the first time the vcpu in ran on the new thread, as per kvm kernel docs