Remove hypervisor_handler thread #533

ludfjig · 2025-05-27T21:51:24Z

Removes the per-vm hypervisor_handler thread in favor of running on the callers thread.
Removes all timeout-based cancellation in favor of an explicit kill() api. Currently, one can only cancel guest function calls, not the first vm initialization, but this should be fine as long as you trust your guest binary. You can only interrupt a guest while it's in a blocking call to vcpufd.run(). Host function calls can still not be interrupted (Make it possible to kill guest execution when running a host function. #192)

These changes should improve performance and throughput. It should also avoid the incredible performance drop off we observed under load when the hypervisor handler thread required joining, after cancelling guest execution.

Added API changes:

impl MultiUseSandbox{
    /// Get a handle to the interrupt handler for this sandbox,
    /// capable of interrupting guest execution.
    pub fn interrupt_handle(&self) -> Arc<dyn InterruptHandle> {
         ...  
    }
    ...
}

/// A trait for handling interrupts to a sandbox's vcpu
pub trait InterruptHandle: Send + Sync {
    /// Interrupt the corresponding sandbox's vcpu if it's running.
    ///
    /// - If this is called while the vcpu is running, then it will interrupt the vcpu and return `true`.
    /// - If this is called while the vcpu is not running, then it will do nothing and return `false`.
    ///
    /// # Note
    /// This function will block for the duration of the time it takes for the vcpu thread to be interrupted.
    fn kill(&self) -> bool;

    /// Returns true iff the corresponding sandbox has been dropped
    fn dropped(&self) -> bool;
}

Removed API changed:

All timeout based configuration

closes #471

Note: On KVM, moving the vcpufd (sandbox) to a new thread will incur a performance overhead the first time the vcpu in ran on the new thread, as per kvm kernel docs

simongdavies

Left some comments/questions , not quite finished reviewing , I will finish it in the morning

simongdavies · 2025-05-28T21:40:47Z

src/hyperlight_host/src/func/call_ctx.rs

@@ -72,7 +71,8 @@ impl MultiUseGuestCallContext {
        // !Send (and !Sync), we also don't need to worry about
        // synchronization

-        call_function_on_guest(&mut self.sbox, func_name, func_ret_type, args)
+        self.sbox


Why did this change to the _no_reset version?

previous call_function_on_guest in removed file guest_dispatch.rs didn't reset memory either, I just renamed it to make that more clear. I moved the method to MultiUseSandbox because I'm not sure what purpose guest_dispatch.rs served now that SingleUsedSandbox is removed.

simongdavies · 2025-05-28T21:50:37Z

Justfile

-    cargo test {{ if features =="" {''} else if features=="no-default-features" {"--no-default-features" } else {"--no-default-features -F " + features } }} --profile={{ if target == "debug" { "dev" } else { target } }} -p hyperlight-host --lib -- metrics::tests::test_metrics_are_emitted --exact --ignored
-    cargo test {{ if features =="" {''} else if features=="no-default-features" {"--no-default-features" } else {"--no-default-features -F function_call_metrics," + features } }} --profile={{ if target == "debug" { "dev" } else { target } }} -p hyperlight-host --lib -- metrics::tests::test_metrics_are_emitted --exact --ignored
-
+


Why did these tests get deleted? They don't see related to this change, I guess we potentially have different metrics now (although arguably we might want to have a kill metric and metrics about success of killing etc.) but even if we don't we still have some metrics?

We don't have different metrics now. The test itself changed to use a local metrics recorder, which makes it capable to only record the metrics emitted in the given test. The local metrics recorder is able to only capture metrics that were emitted while the local recorder is active, and that are emitted on the same thread. We couldn't use the local metrics recorder before, because the metrics were emitted from a separate thread (hypervisor_handler thread), and local metrics recorder only worked on the given thread.

In short, we no longer require #[ignore] on the test, so it is ran as part of the other tests. Your comment made me realize we should still test it with the metrics feature flag so I added back that one

simongdavies · 2025-05-28T21:52:22Z

src/hyperlight_host/src/func/guest_dispatch.rs

It seems that deleting this file has deleted a bunch of tests that we would want to keep (e.g. seccomp tests etc.) I couldn't see that these had been relocated anywhere?

Whoops you are correct. I added them back to initializedmultiuse.rs (I didn't keep the ones that made no sense)

simongdavies · 2025-05-28T21:54:20Z

src/hyperlight_host/examples/logging/main.rs

@@ -89,17 +89,29 @@ fn main() -> Result<()> {
    let no_op = Noop::<UninitializedSandbox, MultiUseSandbox>::default();

    let mut multiuse_sandbox = usandbox.evolve(no_op)?;
+    let interrupt_handle = multiuse_sandbox.interrupt_handle();
+
+    const NUM_CALLS: i32 = 5;


Question: Are we logging attempts and results of kills? I assume so given this code has changed , if we are not logging when we kill we should

The metrics are still emitted, unchanged from previous behavior https://github.com/ludfjig/hyperlight/blob/210e5069b88e60d5398ff72abbb8ed031a152464/src/hyperlight_host/src/hypervisor/mod.rs#L297-L302.

Added logging to kill attempts

simongdavies · 2025-05-28T21:55:28Z

src/hyperlight_host/examples/metrics/main.rs

@@ -102,10 +102,22 @@ fn do_hyperlight_stuff() {
    let no_op = Noop::<UninitializedSandbox, MultiUseSandbox>::default();

    let mut multiuse_sandbox = usandbox.evolve(no_op).expect("Failed to evolve sandbox");
+    let interrupt_handle = multiuse_sandbox.interrupt_handle();


Same as logging? are we creating metrics on how many times kill has been called/ succeeded etc.? If not we probably should

We have same metrics as before, which is how many sandbox cancellations have occured (successfully)

simongdavies · 2025-05-28T21:56:12Z

src/hyperlight_host/examples/tracing/main.rs


    // Call a function that gets cancelled by the host function 5 times to generate some log entries.
+    const NUM_CALLS: i32 = 5;


Same question as the logging example, if we log then we should be creating trace records as well

Yep the metrics are still there, and I added logging for kill attempts

simongdavies · 2025-05-28T22:03:01Z

src/hyperlight_host/src/hypervisor/hyperv_linux.rs

@@ -390,6 +393,11 @@ impl HypervLinuxDriver {
            mem_regions,
            entrypoint: entrypoint_ptr.absolute()?,
            orig_rsp: rsp_ptr,
+            interrupt_handle: Arc::new(LinuxInterruptHandle {
+                running: AtomicBool::new(false),
+                tid: AtomicU64::new(unsafe { libc::pthread_self() }),


What is the thread id here? what happens if the sandbox is created on thread 1 , then used on thread 2 and killed will we send the kill signal to the wrong thread?

Also is it possible that sandbox a and b are created on thread 1 then guest func is called on sandbox a , then host tries to kill guest func call ,meanwhile func_call ends on sandbox a and thread 1 calls func on sandbox b , then we send kill signal and kill the wrong one?

If we dont have tests for these scenarios then we should (and more that I haven't thought of)

We'll never send the signal to the wrong thread, because tid is always set before running, and we only send signal when running is set.

However, it's possible that killing a sandbox on thread 1 will accidentally kill a new sandbox on the same thread. As far as I know, I don't know of any way to avoid this.

Thread 1 creates sandbox, finishes its call to vcpu::run(), but not yet set running to false

Thread 2 calls kill(), and enters the while loop because running is still set to true

Thread 1 sets running to false, and on this thread we create a new sandbox, and call vcpu::run on the new sandbox

Thread 2 sends the signal

Thread 1's vcpu of the new sandbox gets interrupted, but the user intended to kill the old one :/

This occurs because the checking running and sending the signal is not an atomic operation, and I'm not sure how to avoid it. Basically, between the green and red line, anything could happen on another thread

We'll at most send 1 incorrect signal, because the next checking of running in the while-loop is guaranteed to be false

simongdavies · 2025-05-28T22:12:57Z

src/hyperlight_host/src/hypervisor/hyperv_linux.rs

+        // Note: if a `InterruptHandle::kill()` signal is delivered to this thread **here**
+        // - after we've set the running to true,
+        // Then the signal does not have any effect, because the signal handler is a no-op.
+        // This is fine since we are already done with the `VcpuFd::run()` call.


I don't think that this statement is true, for example the reason for the exit may be to call a host function in which case we want to stop the execution, I think that this scenario is handled in the current implementation

In the current implementation, if the user-thread times out while waiting for hypervisor_handler thread, the user-thread returns control back. However, it doesn't actually stop the host function call from happening, after the host-thread has returned control. Let's add host-function cancellation in another PR since it was never properly supported.

simongdavies · 2025-05-28T22:35:22Z

src/hyperlight_host/src/hypervisor/kvm.rs

@@ -634,58 +666,8 @@ impl Hypervisor for KVMDriver {
    }
 }

-#[cfg(test)]


Why did these tests get deleted?

Deleted in kvm/windows/mshv in favor of running the test from mod.rs, since they all do the same thing regardless

simongdavies · 2025-05-28T22:39:20Z

src/hyperlight_host/src/hypervisor/mod.rs


-    #[cfg(gdb)]


Why did this get deleted? We still want to test initialize don't we? Its still being executed just not via the handler?

Added back! Whopsie

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

…eventing host functions to return control to guest after being interrupted Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

ludfjig force-pushed the cancel_vm3 branch 6 times, most recently from eb85e5f to 4abdd49 Compare May 27, 2025 22:56

ludfjig added the kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. label May 27, 2025

ludfjig force-pushed the cancel_vm3 branch 4 times, most recently from c4d0a52 to 210e506 Compare May 28, 2025 00:14

ludfjig marked this pull request as ready for review May 28, 2025 00:52

ludfjig requested review from danbugs, dblnz, devigned, syntactically, marosset, jprendes and simongdavies as code owners May 28, 2025 00:52

simongdavies reviewed May 28, 2025

View reviewed changes

ludfjig force-pushed the cancel_vm3 branch 10 times, most recently from 34a5772 to baf9912 Compare May 30, 2025 22:18

ludfjig force-pushed the cancel_vm3 branch from baf9912 to e9a4990 Compare May 30, 2025 23:05

ludfjig added 5 commits May 30, 2025 16:58

Remove existing hypervisor_handler thread and associated

f611a5d

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

Remove configuration relating to timeouts

0f737fb

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

Add InterruptHandle trait, and implement trait for mshv,kvm and windows

4af0b59

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

Fix up tests that rely on timeout-based cancellation

354d276

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

Add back tests that were deleted

4cae8b5

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

ludfjig force-pushed the cancel_vm3 branch from 8e5ea00 to bd41ddb Compare May 30, 2025 23:59

Fix bug where stale signals can interrupt wrong sandbox, and allow pr…

38e9421

…eventing host functions to return control to guest after being interrupted Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

ludfjig force-pushed the cancel_vm3 branch from bd41ddb to 38e9421 Compare May 31, 2025 00:09

		cargo test {{ if features =="" {''} else if features=="no-default-features" {"--no-default-features" } else {"--no-default-features -F " + features } }} --profile={{ if target == "debug" { "dev" } else { target } }} -p hyperlight-host --lib -- metrics::tests::test_metrics_are_emitted --exact --ignored
		cargo test {{ if features =="" {''} else if features=="no-default-features" {"--no-default-features" } else {"--no-default-features -F function_call_metrics," + features } }} --profile={{ if target == "debug" { "dev" } else { target } }} -p hyperlight-host --lib -- metrics::tests::test_metrics_are_emitted --exact --ignored


		// Call a function that gets cancelled by the host function 5 times to generate some log entries.
		const NUM_CALLS: i32 = 5;

Remove hypervisor_handler thread #533

Are you sure you want to change the base?

Remove hypervisor_handler thread #533

Conversation

ludfjig commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simongdavies left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ludfjig May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ludfjig May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ludfjig May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ludfjig May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ludfjig May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ludfjig commented May 27, 2025 •

edited

Loading

ludfjig May 28, 2025 •

edited

Loading

ludfjig May 29, 2025 •

edited

Loading

ludfjig May 28, 2025 •

edited

Loading

ludfjig May 29, 2025 •

edited

Loading

ludfjig May 28, 2025 •

edited

Loading