Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcontainer/cgroups/fs: fix OCI runtime pause failed #4388

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

botieking98
Copy link

For some instance, runc pause still failed with
ctr: OCI runtime pause failed: unable to freeze: unknown.

We should let it sleep a longer time for some really very slow system or machine.

if i%500 == 499 {
// should sleep a longer time for
// some really very slow machine.
time.Sleep(5 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really very long sleep interval.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really very long sleep interval.

Yes, for some machines, it may cause frozen failed if not sleep a longer time.

Copy link
Member

@lifubang lifubang Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, I think this looks like a cat-is-catching-mouse game, when the mouse runs more quickly, the cat needs more stength.
So, in other slow machine, maybe we need 6 seconds or more, I think we have sleep for many times, and have returned an error to the caller when failure. I think the caller should have a retry mechanism when got an error in a slow machine. It's the job of the caller, not in this function, WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, I think this looks like a cat-is-catching-mouse game, when the mouse runs more quickly, the cat needs more stength. So, in other slow machine, maybe we need 6 seconds or more, I think we have sleep for many times, and have returned an error to the caller when failure. I think the caller should have a retry mechanism when got an error in a slow machine. It's the job of the caller, not in this function, WDYT?

Indeed, as you mentioned, it's hard to determine the exact sleep time. In my case, when I was running an LLM training container, I needed to checkpoint the container and export a snapshot. A relatively long sleep time was required to read correct value of freezer.state, ensuring that the checkpoint was successful. Of course, if necessary, I think we could add a retry mechanism to ensure the program is foolproof.

@botieking98
Copy link
Author

@kolyshkin

@botieking98
Copy link
Author

botieking98 commented Sep 23, 2024

@lifubang @kolyshkin I have add some retry mechanism, PTAL again :)

For some instance, runc pause still failed with
`ctr: OCI runtime pause failed: unable to freeze: unknown`.

We should let it sleep a longer time for some really very
slow system or machine.

Signed-off-by: Song Zhang <zhangsong34@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants