[Bug] Processes get stuck after resuming VM from snapshot

# Describe the bug

After resuming a VM from snapshot, processes occasionally get stuck. A minimal example is an `init` binary that just runs `while (true) { sleep 100ms ; print 'hello' }` - after resuming from snapshot, it only sometimes is able to resume the loop, and other times it gets stuck and does not print anything at all.

## To Reproduce

1. Clone the following repo, which has the minimal code needed to reproduce: https://github.com/bduffany/firecracker-sleep-issue
2. Make sure you have `make`, `cc` (C compiler, from `gcc` package), `glibc-static` (for statically linking the init binary), and `jq` (for parsing firecracker API output)
3. Run `make test`

`make test` is doing the following:
- Fetches the firecracker v1.4.0 release binary from GitHub as well as vmlinux 4.14 from the https://github.com/firecracker-microvm/firecracker-demo repo
- Runs firecracker with an initrd where the init binary just loops infinitely, printing "running" then sleeping for 100ms. 
- Streams the VM logs to the kernel (using `tail -f` in the background)
- Every 1s, pauses the VM, takes a snapshot, then kills firecracker (SIGTERM), then restarts firecracker, resuming the VM from snapshot.

## Expected behaviour

When running `make test` in that repo, the init binary should print `running` several times after each resume. But on some resumes, it appears stuck, and does not print anything until the next resume.

Interesting details:
- The issue does NOT reproduce if I replace the `nanosleep` syscalls with an NOP loop of 1e9 iterations (`for (int i=0; i<1e9; i++) continue;`)
- The issue seems specific to snapshotting, not just pausing and resuming the VM. I tried just doing pause/resume without taking a snapshot and then restarting the `firecracker` binary in between, but could not reproduce.
- I could not reproduce this on an Intel CPU so far, only AMD (have tried 2 different Intel machines and 2 different AMD machines).

## Environment

- Firecracker v1.4.0
- Host kernel v6.2.0 (Ubuntu 22.04)
  - **UPDATE:** also reproduced on host kernel v5.10 - `m6a.metal` instance
- Guest kernel 4.14.55-84.37.amzn2.x86_64, from https://github.com/firecracker-microvm/firecracker-demo
  - Also tried compiling 5.10 using the recommended guest config - the issue still reproduces.
- Rootfs: none (initrd only)
- Architecture: x86_64 (AMD Ryzen CPU)
  - **UPDATE:** Also reproduced with AMD EPYC (`m6a.metal` instance)
- GLIBC 2.35 (for init binary)

## Additional context

This repro above is a minimal example of a much more troublesome issue where we are having trouble reconnecting to microVMs after resuming them from snapshots. We are running a server inside the VM and are having trouble connecting to over vsock. I suspected that the guest process was "stuck" somehow since when running `sleep 1 && print('hello')` in a background loop, it sometimes doesn't print anything. I came up with this minimal reproducer for this behavior.

## Checks

- Have you searched the Firecracker Issues database for similar problems?
  - https://github.com/firecracker-microvm/firecracker/issues/3020 - not sure if this is related, but it was closed without resolution. The final comment said to re-open if it could be reproduced with a supported kernel version
- Have you read the existing relevant Firecracker documentation?
  - I have read the FAQ about guest clock drift / NTP, but this appears more significant than just clock drift since I think nanosleep should work based on relative timing? I could be wrong, though.
- Are you certain the bug being reported is a Firecracker issue?
  - Not 100% certain, but given that it happens only when loading a snapshot, it seems like it could be Firecracker-related


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Processes get stuck after resuming VM from snapshot #4099

Describe the bug

To Reproduce

Expected behaviour

Environment

Additional context

Checks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Processes get stuck after resuming VM from snapshot #4099

Description

Describe the bug

To Reproduce

Expected behaviour

Environment

Additional context

Checks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions