Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

sandbox: don't exit if an app fails #3478

Merged
merged 6 commits into from
Jan 11, 2017

Conversation

s-urbaniak
Copy link
Contributor

@s-urbaniak s-urbaniak commented Dec 14, 2016

Currently the sandbox follows the same semantic rule as an immutable
pod. As soon as an app fails, it quits immediately.

This behavior does not hold for mutable sandboxes. As soon as an app
fails the sandbox needs to stay alive but track the exit code of the
failed app appropriately.

Also in case of restarts the /rkt/status files need to be readjusted.

This fixes it.

Also fixes #3472

TODOs:

  • functional test which starts two apps
  • functional test which starts a faulty app and an app
  • functional test which starts only faulty apps

@s-urbaniak s-urbaniak added this to the v1.22.0 milestone Dec 14, 2016
@s-urbaniak s-urbaniak force-pushed the sandbox-stay-alive-3472 branch 2 times, most recently from ebb3714 to b32937c Compare December 14, 2016 15:59
@s-urbaniak
Copy link
Contributor Author

/cc @euank

@euank
Copy link
Member

euank commented Dec 16, 2016

Add to the list of tests:

  • adds, starts, and removes an app.

Although with this change rkt app status ... --app=.. correctly shows exited, rkt app rm .... --app=... can't remove the application with this error. I didn't dig into it further, but my naive guess is that RemainAfterExit is turning the status of the app into stopped instead of inactive, or maybe inactive wasn't ever right.

@lucab lucab modified the milestones: v1.23.0, v1.22.0 Dec 19, 2016
@s-urbaniak
Copy link
Contributor Author

gah, thanks for the follow-up, I need to implement more functional tests :-/

@s-urbaniak
Copy link
Contributor Author

@euank while I am writing the tests, do you mind to provide the sequence that reproduces the issue for you?

At least the following works for me:

$ sudo -E rkt app add --debug $pod docker://debian:sid --name=debian --exec=/bin/sh -- -c 'while true; do echo hallo; sleep .5; done'
stage0: locking pod manifest
stage0: Loading image sha512-e68319869a62c391b77dbc77ec0db84290c9803718fbe5df89a230d08ee5bbc0
stage0: Writing image manifest
stage0: adding app to sandbox
$ sudo -E rkt app start --debug $pod --app=debian
$ sudo -E rkt app stop --debug $pod --app=debian
$ sudo -E rkt app rm --debug $pod --app=debian
stage0: locking sandbox manifest

@euank
Copy link
Member

euank commented Dec 19, 2016

It looks like it only happens if the app crashes, not if rkt stops it.

@s-urbaniak I can repro it with the same commands as #3472 for setup, and then the following:

$ rkt app status $(cat /tmp/uuid) --app=busybox           
name=busybox
state=exited
image_id=sha512-060ab846247313a0aa120aa60761a2beb2080e1392c2969631f8caa7aa7d1597
created_at=2016-12-19 11:51:45.373342304 -0800 PST
started_at=2016-12-19 11:51:55.273725672 -0800 PST
finished_at=2016-12-19 11:51:55.493734411 -0800 PST
exit_code=127
$ rkt app rm $(cat /tmp/uuid) --app=busybox    
rm: error removing app: error executing stage1 entrypoint: app-rm: cleanup error: app "busybox" is still running
$ rkt app stop $(cat /tmp/uuid) --app=busybox
$ echo $?
0
$ rkt app rm $(cat /tmp/uuid) --app=busybox    
rm: error removing app: error executing stage1 entrypoint: app-rm: cleanup error: app "busybox" is still running

@s-urbaniak
Copy link
Contributor Author

@euank 👍 now I can reproduce this now, thanks a lot. Current mode: investigating, fixing, writing tests.

Currently the sandbox follows the same semantic rule as an immutable
pod. As soon as an app fails, it quits immediately.

This behavior does not hold for mutable sandboxes. As soon as an app
fails the sandbox needs to stay alive but track the exit code of the
failed app appropriately.

Also in case of restarts the /rkt/status files need to be readjusted.

This fixes it.

Also fixes rkt#3472
Currently if a sandbox app failed, it cannot be removed because it is
assumed to be still running. This fixes it.
Currently if `rkt app --debug` is enabled we get stage1 debug output
only if an error occurs.

This fixes it by printing debug output always if it is enabled.
Currently we have mixture of file and directory conventions for app subcommands in
stage0 and stage1.

This fixes it by consolidating them following Go standards (underscore
delimiters in filenames).
@s-urbaniak s-urbaniak force-pushed the sandbox-stay-alive-3472 branch from b32937c to 12f92f4 Compare January 9, 2017 15:58
@s-urbaniak
Copy link
Contributor Author

@euank still writing the final functional tests, but do you mind to quickly verify if removing failed apps is fixed now?

This simplifies the sandbox tests.
@s-urbaniak s-urbaniak force-pushed the sandbox-stay-alive-3472 branch from 12f92f4 to 1a72e2a Compare January 9, 2017 16:26
@euank
Copy link
Member

euank commented Jan 9, 2017

Indeed, works as far as I can tell, thanks @s-urbaniak

This adds more functional tests for the app sandbox. It adds a test
which starts/stops an app within a running sandbox and another test
which adds three different kinds of apps and evaluates their respective
status in rkt.
@squeed
Copy link
Contributor

squeed commented Jan 10, 2017

This PR will expose some logic flaws in some status code. For example, the API service assumes that if a pod is running, all apps are running.
I'll file an issue to fix this.

@squeed
Copy link
Contributor

squeed commented Jan 10, 2017

I like that you cleaned up some of the stage0 files. Did you change any of the behavior, or just shuffle code?
Just wondering if we should review more carefully the new files.

@s-urbaniak
Copy link
Contributor Author

@squeed yes, the first two commits are behavioral changes (b308a80 and 0282d80), the rest is code cleanup + adding tests.

Admittedly I can make it separate PRs.

Copy link
Member

@lucab lucab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (one question inline, but really just a nit and I'm fine to land it as is).

@@ -107,7 +107,10 @@ func cleanupStage0(appName *types.ACName, enterCmd []string) error {
// rely only on the output, since is-active returns non-zero for inactive units
out, _ := cmd.Output()

if string(out) != "inactive\n" {
switch string(out) {
case "failed\n":
Copy link
Member

@lucab lucab Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we don't TrimRight here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just pure sloppiness, I will address this in a follow-up.

@s-urbaniak s-urbaniak merged commit 9f7cc9c into rkt:master Jan 11, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CRI: unit failing leads to undesirable sandbox failure
4 participants