-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libcontainer/cgroups/fs: fix OCI runtime pause failed #4388
base: main
Are you sure you want to change the base?
Conversation
7ce71d3
to
296d0c8
Compare
libcontainer/cgroups/fs/freezer.go
Outdated
if i%500 == 499 { | ||
// should sleep a longer time for | ||
// some really very slow machine. | ||
time.Sleep(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really very long sleep interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really very long sleep interval.
Yes, for some machines, it may cause frozen failed if not sleep a longer time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, I think this looks like a cat-is-catching-mouse
game, when the mouse runs more quickly, the cat needs more stength.
So, in other slow machine, maybe we need 6 seconds or more, I think we have sleep for many times, and have returned an error to the caller when failure. I think the caller should have a retry mechanism when got an error in a slow machine. It's the job of the caller, not in this function, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, I think this looks like a
cat-is-catching-mouse
game, when the mouse runs more quickly, the cat needs more stength. So, in other slow machine, maybe we need 6 seconds or more, I think we have sleep for many times, and have returned an error to the caller when failure. I think the caller should have a retry mechanism when got an error in a slow machine. It's the job of the caller, not in this function, WDYT?
Indeed, as you mentioned, it's hard to determine the exact sleep time. In my case, when I was running an LLM training container, I needed to checkpoint the container and export a snapshot. A relatively long sleep time was required to read correct value of freezer.state
, ensuring that the checkpoint was successful. Of course, if necessary, I think we could add a retry mechanism to ensure the program is foolproof.
296d0c8
to
f0936b9
Compare
f0936b9
to
c6112ac
Compare
@lifubang @kolyshkin I have add some retry mechanism, PTAL again :) |
c6112ac
to
e12061a
Compare
For some instance, runc pause still failed with `ctr: OCI runtime pause failed: unable to freeze: unknown`. We should let it sleep a longer time for some really very slow system or machine. Signed-off-by: Song Zhang <zhangsong34@huawei.com>
e12061a
to
b2f8637
Compare
For some instance, runc pause still failed with
ctr: OCI runtime pause failed: unable to freeze: unknown
.We should let it sleep a longer time for some really very slow system or machine.