-
Notifications
You must be signed in to change notification settings - Fork 113
agent: add support for guest-hooks (v2) #393
Conversation
Codecov Report
@@ Coverage Diff @@
## master #393 +/- ##
==========================================
+ Coverage 46.92% 46.99% +0.07%
==========================================
Files 16 17 +1
Lines 2549 2911 +362
==========================================
+ Hits 1196 1368 +172
- Misses 1196 1378 +182
- Partials 157 165 +8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising @flx42 - just some minor comments.
agent.go
Outdated
// scanGuestHooks will search the given guestHookPath | ||
// for any OCI hooks | ||
func (s *sandbox) scanGuestHooks(guestHookPath string) { | ||
agentLog.Infof("Scanning guest filesystem for OCI hooks at %s", guestHookPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the log call. You could change it slightly to be more in line with what we tend to do:
agentLog.WithField("oci-hook-path", guestHookPath).Info("Scanning guest filesystem for OCI hooks")
That makes the log fields searchable and minimises the fixed (hard-to-search) portion. See:
The same comment applies to lots of the log calls throughout (I've added a few more examples, but not exhaustively ;)
See:
agent.go
Outdated
if len(s.guestHooks.Prestart) > 0 || len(s.guestHooks.Poststart) > 0 || len(s.guestHooks.Poststop) > 0 { | ||
s.guestHooksPresent = true | ||
} else { | ||
agentLog.Warnf("Guest hooks were requested but none were found at %s", guestHookPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified as we've already logged the hook path:
agentLog.Warn("Guest hooks were requested but none were found")
agent_test.go
Outdated
defer os.RemoveAll(hookPath) | ||
|
||
poststopPath := path.Join(hookPath, "poststop") | ||
err = os.Mkdir(poststopPath, 0777) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate this is test code, but is there a reason for these perms as we wouldn't expect those perms on a real running system? If they can be tightened up, I'd be tempted to create something like the following to mirror what we do in the runtime repo:
const (
testDirMode = os.FileMode(0750)
testFileMode = os.FileMode(0640)
testExeFileMode = os.FileMode(0750)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can tighten the permissions sure, but I do like the octal format since it's immediately readable for everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sure - octal is good :)
agent_test.go
Outdated
assert.NoError(err) | ||
|
||
normalPath := path.Join(poststopPath, "normalfile") | ||
f, err := os.OpenFile(normalPath, os.O_RDONLY|os.O_CREATE, 0666) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this become testExeFileMode
?
oci.go
Outdated
@@ -0,0 +1,113 @@ | |||
// | |||
// Copyright (c) 2018 Intel Corporation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"NVidia Corporation"? 😄
|
||
s.guestHooks.Prestart = findHooks(guestHookPath, "prestart") | ||
s.guestHooks.Poststart = findHooks(guestHookPath, "poststart") | ||
s.guestHooks.Poststop = findHooks(guestHookPath, "poststop") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be overkill, but I wonder if we want to run these three filesystem operations in separate goroutines and wait for them to finish to minimise the time cost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems entirely overkill since every function call is likely to stop after the first ReadDir :), that is to say if <path>/{prestart,poststart,poststop}
does not exist or is not a directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack ;)
"github.com/stretchr/testify/assert" | ||
) | ||
|
||
func TestChangeToBundlePath(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test to handle the non-absolute spec.Root.Path
case which is handled differently in changeToBundlePath()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After re-examining the code, I've decided to change the logic we have in grpc.go
.
Instead of doing changeToBundlePath
and then writeSpecToFile
, I've swapped the order of these operations. The concept of bundle doesn't make sense if we don't have config.json
serialized in a directory. And the spec struct in the Go code does not define where the bundle should be.
I think it makes more sense to first make dirname(rootfs)
a valid bundle (by convention, could be any other directory), then chdir
to this bundle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the conlusion is that we don't need to treat absolute and relative paths differently. If spec.Root.Path
is the recommended rootfs
, then .
will become an OCI bundle, and that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notifying @eguzman3 in case he notices something wrong with my reasoning :)
oci.go
Outdated
for _, file := range files { | ||
name := file.Name() | ||
if ok, err := isValidHook(file); !ok { | ||
agentLog.WithError(err).Warnf("Skipping hook %s", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could become:
agentLog.WithError(err).WithField("oci-hook-name", name).Warn("Skipping hook")
oci.go
Outdated
|
||
files, err := ioutil.ReadDir(hooksPath) | ||
if err != nil { | ||
agentLog.WithError(err).Infof("Skipping hook type %s", hookType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could become:
agentLog.WithError(err).WithField("oci-hook-type", hookType).Info("Skipping hook type")
oci.go
Outdated
continue | ||
} | ||
|
||
agentLog.Infof("Adding hook %s of type %s", name, hookType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agentLog.WithFields(logrus.Fields{
"oci-hook-name" : name,
"oci-hook-type" : hookType,
}).Info("Adding hook")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a nice example of why I don't like this logging model, but I will do the change :)
Adds support for running OCI hooks within the guest. A 'drop-in' path (guest_hook_path) is specified in the cli configuration file and if set, the agent will look for OCI hooks in this directory and inject them into the container life cycle. Fixes: kata-containers#348 Replaces: kata-containers#365 Co-authored-by: Edward Guzman <eguzman@nvidia.com> Co-authored-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Edward Guzman <eguzman@nvidia.com> Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
All of the above should have been addressed now, great feedback @jodh-intel :) I realize I didn't ask you if amending the existing commit is actually fine, or if you prefer squashing everything at the end. |
Hi @flx42 - we normally amend and force push - so looks like you chose the correct flow :-) |
The metrics CI failure here is because we have merged kata-containers/tests#785, so I now need to go re-configure the build slave to take that into account. I'll jump on that. |
Metrics slave config updated. I re-span the build, and that has now passed. |
Thanks @flx42! lgtm |
|
||
// Change cwd because libcontainer assumes the bundle path is the cwd: | ||
// https://github.com/opencontainers/runc/blob/v1.0.0-rc5/libcontainer/specconv/spec_linux.go#L157 | ||
oldcwd, err := changeToBundlePath(ociSpec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that the code piece works because golang sets CLONE_FS when creating new M (https://github.com/golang/go/blob/2595fe7fb6f272f9204ca3ef0b0c55e66fb8d90f/src/runtime/os_linux.go#L131) so CWD is shared among all os threads.
|
||
// Change cwd because libcontainer assumes the bundle path is the cwd: | ||
// https://github.com/opencontainers/runc/blob/v1.0.0-rc5/libcontainer/specconv/spec_linux.go#L157 | ||
oldcwd, err := changeToBundlePath(ociSpec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that the code piece works because golang sets CLONE_FS when creating new M (https://github.com/golang/go/blob/2595fe7fb6f272f9204ca3ef0b0c55e66fb8d90f/src/runtime/os_linux.go#L131) so CWD is shared among all os threads. We rely on this model because the goroutine might be rescheduled to another os thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like an implementation detail. The runtime could very well store the cwd for each goroutine and then restore it when the goroutine is re-scheduled anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flx42 I mentioned it because we do depend on the runtime implementation details here. If the Go runtime decides that CWD is an M local property, we are doomed because a goroutine can be rescheduled to another os thread since we do not lock os thread here, and then CWD is messed up in different threads.
a.sandbox.guestHooksPresent = false | ||
|
||
if req.GuestHookPath != "" { | ||
a.sandbox.scanGuestHooks(req.GuestHookPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is still silently ignored. I wonder how a user can find it out later on if scanGuestHooks
fails here. I think we should at least report it back in CreateSandboxResponse
and runtime can decide if it wants to bail out or continue using the hooks-missing image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Through the logs:
https://github.com/kata-containers/agent/pull/393/files/980023ec62b3144ef9f333209290c6215e295538#diff-2fe34e1b6f7f9aece0af74b1635502eeR259
I thought we agreed that it was fine to just log it for now. But I'm happy to change it if we have consensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am leaning towards just logging this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
/test |
Re-cranking the CI to see if we get lucky and end up running on a VT-x-enabled system... |
All passed apart from the metrics CI which is showing a rather bizarre message:
/cc @grahamwhaley |
That could be because I had a typo in the jenkins metrics job for the 'tests' repo (which I fixed earlier today), which meant part of that job was monitoring the agent repo instead of the tests repo :-( |
I nudged a metrics rebuild - let's keep an eye on it. |
I think that pre-existing metrics CI job is just broke from my previous mis-configuration. Let me try to kick a whole CI cycle here, and see what happens. If we end up with them all passed apart from metrics dying with some odd git merge/branch issue, then let's merge. |
@chavafg, @mnaser - we're still seeing failures due to jobs being scheduled on non-VT-x systems:
Is there any way to bump the priority of this issue please? |
@grahamwhaley - the metrics CI is still failing for the same reason. I'm wondering if this implies the delta between the main CI code and the metrics code is too wide (can we share more setup code?) |
The 16.04 CI failed two of its tests:
/cc @devimc |
@jodh-intel I'll check the metrics CI job setup against the QA setup on the Jenkins server. We share almost all of the setup/scripts already - the metrics invokes the same jenkins build script that the QA CI does. I'll check there has not been any updates to the pre-script env setup (the stuff that lives in the Jenkins UI/xml files) that we have missed on the metrics. |
Looks like the Jenkins plugin is too smart, and remembers what it was told to do in the first place (when it was mis-configured). |
yay, the hand crafted metrics CI job has acked' the PR ;-) |
@grahamwhaley Nice. CI has passed :) Merging this one. |
We need kata-containers/agent#393 in order for kata-containers#720 to be carried forward Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
We need kata-containers/agent#393 in order for kata-containers#720 to be carried forward Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
We need kata-containers/agent#393 in order for kata-containers#720 to be carried forward github.com/kata-containers/agent: 3a678a9 agent: Fix the issue of stdout hang on builtin proxy d12910e agent: judge whether /sys/devices/system/memory/block_size_bytes exist. 980023e agent: add support for guest-hooks e03f7d7 memory: Fix update memory path. a396a23 grpc: Add seccomp status to guest details 7b71c10 grpc: Add agent details to guest details call 4fefa1a release: Kata Containers 1.3.0 a628496 device: rescan pci bus before waiting for new devices 1310f3d device: fix redefined notify channel 4163b8b uevent:Revert "device: check for existing device PCI path before waiting" 54f77cf agent: support agent as init process in rootfs images ef62167 uevent: Add logs for uevents fc907f6 agent: fix the issue of missing close process terminal d71dd02 vendor: bump runtime-spec version 28a5ab2 release: Kata Containers 1.3.0-rc1 2b89a0a uevent: fix crash on read errors 9cf56c0 agent: auto-online hotplug memory 33aea09 CI: Add "make proto" 00a5588 Makefile: Conditionally build agent with tag seccomp 7caf1c8 device: check for existing device PCI path before waiting 3441244 device: do not close notify channel when wait timeout github.com/opencontainers/runtime-spec: 5d9aa69 config-linux: Add Intel RDT/MBA Linux support 6f5fcd4 Support for network namespace in windows 06cf899 config-linux: add Intel RDT CLOS name sharing support f3be7d2 config: clarify source mount 65fac2b Fix camelCasing on idType to align with other Windows spec conventions da8adc9 incorporating edits from JTerry's feedback c182ebc meeting: Bump July meeting from the 4th to the 11th e01b694 config: Add Windows Devices to Schema 81d81f3 docs: Added kata-runtime to implementations b0700ad Add gVisor to the implementations list 9e459a6 .travis.yml: Get schema dependencies in before_install 692abcb .travis: Bump minimum Go version to 1.9 fd39559 config: Clarify execution environment for hooks cd9892d config-linux: Drop console(4) reference e662e5c Linux devices: uid/gid relative to container 74b670e config: Add VM-based container configuration section cd39042 uidMappings: change order of fields for clarity 2e241f7 specs-go/config: Define RDMA cgroup 9df387e schema/Makefile: fix test de688f2 config: Fix Linux mount options links ef008dd glossary: Bump JSON spec to RFC 8259 4e5a137 schema: Completely drop our JSON Schema 'id' properties 70ba4e6 meeting: Bump January meeting from the 3rd to the 10th 8558116 schema: add allowed values for defaultAction 5d9bbad config: Dedent root paragraphs, since they aren't a list entry e566cf6 version: put master back to -dev 966a58d fix the link to hook Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
Adds support for running OCI hooks within the guest. A 'drop-in'
path (guest_hook_path) is specified in the cli configuration file
and if set, the agent will look for OCI hooks in this directory
and inject them into the container life cycle.
Fixes: #348
Replaces: #365
Changes compared to #365:
&spec
grpc
package tomain
package, with a new fileoci.go
@jodh-intel