Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

agent: add support for guest-hooks (v2) #393

Merged
merged 1 commit into from
Oct 10, 2018

Conversation

flx42
Copy link
Contributor

@flx42 flx42 commented Oct 2, 2018

Adds support for running OCI hooks within the guest. A 'drop-in'
path (guest_hook_path) is specified in the cli configuration file
and if set, the agent will look for OCI hooks in this directory
and inject them into the container life cycle.

Fixes: #348
Replaces: #365

Changes compared to #365:

  • Rebased on current master
  • Fixed the &spec
  • Added more logging when discovering hooks, no error is fatal today
  • Moved utility functions from grpc package to main package, with a new file oci.go

@jodh-intel

@codecov
Copy link

codecov bot commented Oct 2, 2018

Codecov Report

Merging #393 into master will increase coverage by 0.07%.
The diff coverage is 51.32%.

@@            Coverage Diff             @@
##           master     #393      +/-   ##
==========================================
+ Coverage   46.92%   46.99%   +0.07%     
==========================================
  Files          16       17       +1     
  Lines        2549     2911     +362     
==========================================
+ Hits         1196     1368     +172     
- Misses       1196     1378     +182     
- Partials      157      165       +8

Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising @flx42 - just some minor comments.

agent.go Outdated
// scanGuestHooks will search the given guestHookPath
// for any OCI hooks
func (s *sandbox) scanGuestHooks(guestHookPath string) {
agentLog.Infof("Scanning guest filesystem for OCI hooks at %s", guestHookPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the log call. You could change it slightly to be more in line with what we tend to do:

agentLog.WithField("oci-hook-path", guestHookPath).Info("Scanning guest filesystem for OCI hooks")

That makes the log fields searchable and minimises the fixed (hard-to-search) portion. See:

The same comment applies to lots of the log calls throughout (I've added a few more examples, but not exhaustively ;)

See:

agent.go Outdated
if len(s.guestHooks.Prestart) > 0 || len(s.guestHooks.Poststart) > 0 || len(s.guestHooks.Poststop) > 0 {
s.guestHooksPresent = true
} else {
agentLog.Warnf("Guest hooks were requested but none were found at %s", guestHookPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified as we've already logged the hook path:

agentLog.Warn("Guest hooks were requested but none were found")

agent_test.go Outdated
defer os.RemoveAll(hookPath)

poststopPath := path.Join(hookPath, "poststop")
err = os.Mkdir(poststopPath, 0777)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this is test code, but is there a reason for these perms as we wouldn't expect those perms on a real running system? If they can be tightened up, I'd be tempted to create something like the following to mirror what we do in the runtime repo:

const (
        testDirMode     = os.FileMode(0750)
        testFileMode    = os.FileMode(0640)
        testExeFileMode = os.FileMode(0750)
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can tighten the permissions sure, but I do like the octal format since it's immediately readable for everyone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sure - octal is good :)

agent_test.go Outdated
assert.NoError(err)

normalPath := path.Join(poststopPath, "normalfile")
f, err := os.OpenFile(normalPath, os.O_RDONLY|os.O_CREATE, 0666)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this become testExeFileMode?

oci.go Outdated
@@ -0,0 +1,113 @@
//
// Copyright (c) 2018 Intel Corporation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"NVidia Corporation"? 😄


s.guestHooks.Prestart = findHooks(guestHookPath, "prestart")
s.guestHooks.Poststart = findHooks(guestHookPath, "poststart")
s.guestHooks.Poststop = findHooks(guestHookPath, "poststop")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be overkill, but I wonder if we want to run these three filesystem operations in separate goroutines and wait for them to finish to minimise the time cost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems entirely overkill since every function call is likely to stop after the first ReadDir :), that is to say if <path>/{prestart,poststart,poststop} does not exist or is not a directory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack ;)

"github.com/stretchr/testify/assert"
)

func TestChangeToBundlePath(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test to handle the non-absolute spec.Root.Path case which is handled differently in changeToBundlePath()?

Copy link
Contributor Author

@flx42 flx42 Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-examining the code, I've decided to change the logic we have in grpc.go.
Instead of doing changeToBundlePath and then writeSpecToFile, I've swapped the order of these operations. The concept of bundle doesn't make sense if we don't have config.json serialized in a directory. And the spec struct in the Go code does not define where the bundle should be.

I think it makes more sense to first make dirname(rootfs) a valid bundle (by convention, could be any other directory), then chdir to this bundle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the conlusion is that we don't need to treat absolute and relative paths differently. If spec.Root.Path is the recommended rootfs, then . will become an OCI bundle, and that's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notifying @eguzman3 in case he notices something wrong with my reasoning :)

oci.go Outdated
for _, file := range files {
name := file.Name()
if ok, err := isValidHook(file); !ok {
agentLog.WithError(err).Warnf("Skipping hook %s", name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could become:

agentLog.WithError(err).WithField("oci-hook-name", name).Warn("Skipping hook")

oci.go Outdated

files, err := ioutil.ReadDir(hooksPath)
if err != nil {
agentLog.WithError(err).Infof("Skipping hook type %s", hookType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could become:

agentLog.WithError(err).WithField("oci-hook-type", hookType).Info("Skipping hook type")

oci.go Outdated
continue
}

agentLog.Infof("Adding hook %s of type %s", name, hookType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agentLog.WithFields(logrus.Fields{
    "oci-hook-name" : name,
    "oci-hook-type" : hookType,
}).Info("Adding hook")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a nice example of why I don't like this logging model, but I will do the change :)

Adds support for running OCI hooks within the guest. A 'drop-in'
path (guest_hook_path) is specified in the cli configuration file
and if set, the agent will look for OCI hooks in this directory
and inject them into the container life cycle.

Fixes: kata-containers#348
Replaces: kata-containers#365

Co-authored-by: Edward Guzman <eguzman@nvidia.com>
Co-authored-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Edward Guzman <eguzman@nvidia.com>
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
@flx42
Copy link
Contributor Author

flx42 commented Oct 3, 2018

All of the above should have been addressed now, great feedback @jodh-intel :)

I realize I didn't ask you if amending the existing commit is actually fine, or if you prefer squashing everything at the end.

@grahamwhaley
Copy link
Contributor

grahamwhaley commented Oct 4, 2018

Hi @flx42 - we normally amend and force push - so looks like you chose the correct flow :-)
Hmm, we used to have that in the contribution guideline/docs, but I don't see it right now. Maybe it fell off during a docs rework... I'll go look harder.

@grahamwhaley
Copy link
Contributor

@grahamwhaley
Copy link
Contributor

The metrics CI failure here is because we have merged kata-containers/tests#785, so I now need to go re-configure the build slave to take that into account. I'll jump on that.

@grahamwhaley
Copy link
Contributor

Metrics slave config updated. I re-span the build, and that has now passed.

@jodh-intel
Copy link
Contributor

jodh-intel commented Oct 5, 2018

Thanks @flx42!

lgtm

Approved with PullApprove Approved with PullApprove


// Change cwd because libcontainer assumes the bundle path is the cwd:
// https://github.com/opencontainers/runc/blob/v1.0.0-rc5/libcontainer/specconv/spec_linux.go#L157
oldcwd, err := changeToBundlePath(ociSpec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that the code piece works because golang sets CLONE_FS when creating new M (https://github.com/golang/go/blob/2595fe7fb6f272f9204ca3ef0b0c55e66fb8d90f/src/runtime/os_linux.go#L131) so CWD is shared among all os threads.


// Change cwd because libcontainer assumes the bundle path is the cwd:
// https://github.com/opencontainers/runc/blob/v1.0.0-rc5/libcontainer/specconv/spec_linux.go#L157
oldcwd, err := changeToBundlePath(ociSpec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that the code piece works because golang sets CLONE_FS when creating new M (https://github.com/golang/go/blob/2595fe7fb6f272f9204ca3ef0b0c55e66fb8d90f/src/runtime/os_linux.go#L131) so CWD is shared among all os threads. We rely on this model because the goroutine might be rescheduled to another os thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like an implementation detail. The runtime could very well store the cwd for each goroutine and then restore it when the goroutine is re-scheduled anywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flx42 I mentioned it because we do depend on the runtime implementation details here. If the Go runtime decides that CWD is an M local property, we are doomed because a goroutine can be rescheduled to another os thread since we do not lock os thread here, and then CWD is messed up in different threads.

a.sandbox.guestHooksPresent = false

if req.GuestHookPath != "" {
a.sandbox.scanGuestHooks(req.GuestHookPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is still silently ignored. I wonder how a user can find it out later on if scanGuestHooks fails here. I think we should at least report it back in CreateSandboxResponse and runtime can decide if it wants to bail out or continue using the hooks-missing image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Through the logs:
https://github.com/kata-containers/agent/pull/393/files/980023ec62b3144ef9f333209290c6215e295538#diff-2fe34e1b6f7f9aece0af74b1635502eeR259
I thought we agreed that it was fine to just log it for now. But I'm happy to change it if we have consensus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am leaning towards just logging this for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough!

Copy link
Member

@amshinde amshinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@bergwolf
Copy link
Member

bergwolf commented Oct 9, 2018

LGTM

Approved with PullApprove Approved with PullApprove

@bergwolf
Copy link
Member

bergwolf commented Oct 9, 2018

/test

@jodh-intel
Copy link
Contributor

Re-cranking the CI to see if we get lucky and end up running on a VT-x-enabled system...

@jodh-intel
Copy link
Contributor

All passed apart from the metrics CI which is showing a rather bizarre message:

09:34:23  > git rev-parse refs/remotes/origin/origin/pr/393/merge^{commit} # timeout=10
09:34:23  > git rev-parse origin/pr/393/merge^{commit} # timeout=10
09:34:23 ERROR: Couldn't find any revision to build. Verify the repository and branch configuration for this job.

/cc @grahamwhaley

@grahamwhaley
Copy link
Contributor

That could be because I had a typo in the jenkins metrics job for the 'tests' repo (which I fixed earlier today), which meant part of that job was monitoring the agent repo instead of the tests repo :-(
Let me see if we are able to re-spin the job - but, it might be that the rebuild carries over the broken info, in which case can might just skip metrics on this one...

@grahamwhaley
Copy link
Contributor

I nudged a metrics rebuild - let's keep an eye on it.

@grahamwhaley
Copy link
Contributor

I think that pre-existing metrics CI job is just broke from my previous mis-configuration. Let me try to kick a whole CI cycle here, and see what happens. If we end up with them all passed apart from metrics dying with some odd git merge/branch issue, then let's merge.
/test

@jodh-intel
Copy link
Contributor

@chavafg, @mnaser - we're still seeing failures due to jobs being scheduled on non-VT-x systems:

time="2018-10-09T19:24:51Z" level=error msg="open /sys/module/kvm_intel/parameters/nested: no such file or directory" arch=amd64 name=kata-runtime pid=16864 source=runtime
open /sys/module/kvm_intel/parameters/nested: no such file or directory

Is there any way to bump the priority of this issue please?

@jodh-intel
Copy link
Contributor

@grahamwhaley - the metrics CI is still failing for the same reason. I'm wondering if this implies the delta between the main CI code and the metrics code is too wide (can we share more setup code?)

@jodh-intel
Copy link
Contributor

The 16.04 CI failed two of its tests:

Summarizing 2 Failures:

[Fail] package manager update test check apt-get update [It] should not fail 
/tmp/jenkins/workspace/kata-containers-agent-ubuntu-16-04-PR-initrd/go/src/github.com/kata-containers/tests/integration/docker/package_manager_test.go:41

[Fail] CPUs and CPU set updating cpus and cpuset of a running container [It] should have the right number of vCPUs 
/tmp/jenkins/workspace/kata-containers-agent-ubuntu-16-04-PR-initrd/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:432

Ran 202 of 214 Specs in 952.725 seconds
FAIL! -- 200 Passed | 2 Failed | 0 Pending | 12 Skipped --- FAIL: TestIntegration (979.86s)
FAIL

Ginkgo ran 1 suite in 16m20.983774498s
Test Suite Failed

/cc @devimc

@grahamwhaley
Copy link
Contributor

@jodh-intel I'll check the metrics CI job setup against the QA setup on the Jenkins server. We share almost all of the setup/scripts already - the metrics invokes the same jenkins build script that the QA CI does. I'll check there has not been any updates to the pre-script env setup (the stuff that lives in the Jenkins UI/xml files) that we have missed on the metrics.

@grahamwhaley
Copy link
Contributor

Looks like the Jenkins plugin is too smart, and remembers what it was told to do in the first place (when it was mis-configured).
I've mocked up a new job (a jenkins 'parametised build') by kicking a rebuild of a different PR (395, that has merged already), and editing the details to point at this PR. If that does not work, just skip the metrics CI for this PR. It should be configured correctly for future PRs now.

@grahamwhaley
Copy link
Contributor

yay, the hand crafted metrics CI job has acked' the PR ;-)

@amshinde
Copy link
Member

@grahamwhaley Nice. CI has passed :) Merging this one.

@amshinde amshinde merged commit 1aeaf24 into kata-containers:master Oct 10, 2018
@jodh-intel
Copy link
Contributor

Thanks again for your contributions @flx42 and @eguzman3!

flx42 added a commit to flx42/runtime that referenced this pull request Oct 17, 2018
We need kata-containers/agent#393 in order for kata-containers#720 to
be carried forward

Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
flx42 added a commit to flx42/runtime that referenced this pull request Oct 17, 2018
We need kata-containers/agent#393 in order for kata-containers#720 to
be carried forward

Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
flx42 added a commit to flx42/runtime that referenced this pull request Oct 18, 2018
We need kata-containers/agent#393 in order for kata-containers#720 to
be carried forward

github.com/kata-containers/agent:
3a678a9 agent: Fix the issue of stdout hang on builtin proxy
d12910e agent: judge whether /sys/devices/system/memory/block_size_bytes exist.
980023e agent: add support for guest-hooks
e03f7d7 memory: Fix update memory path.
a396a23 grpc: Add seccomp status to guest details
7b71c10 grpc: Add agent details to guest details call
4fefa1a release: Kata Containers 1.3.0
a628496 device: rescan pci bus before waiting for new devices
1310f3d device: fix redefined notify channel
4163b8b uevent:Revert "device: check for existing device PCI path before waiting"
54f77cf agent: support agent as init process in rootfs images
ef62167 uevent: Add logs for uevents
fc907f6 agent: fix the issue of missing close process terminal
d71dd02 vendor: bump runtime-spec version
28a5ab2 release: Kata Containers 1.3.0-rc1
2b89a0a uevent: fix crash on read errors
9cf56c0 agent: auto-online hotplug memory
33aea09 CI: Add "make proto"
00a5588 Makefile: Conditionally build agent with tag seccomp
7caf1c8 device: check for existing device PCI path before waiting
3441244 device: do not close notify channel when wait timeout

github.com/opencontainers/runtime-spec:
5d9aa69 config-linux: Add Intel RDT/MBA Linux support
6f5fcd4 Support for network namespace in windows
06cf899 config-linux: add Intel RDT CLOS name sharing support
f3be7d2 config: clarify source mount
65fac2b Fix camelCasing on idType to align with other Windows spec conventions
da8adc9 incorporating edits from JTerry's feedback
c182ebc meeting: Bump July meeting from the 4th to the 11th
e01b694 config: Add Windows Devices to Schema
81d81f3 docs: Added kata-runtime to implementations
b0700ad Add gVisor to the implementations list
9e459a6 .travis.yml: Get schema dependencies in before_install
692abcb .travis: Bump minimum Go version to 1.9
fd39559 config: Clarify execution environment for hooks
cd9892d config-linux: Drop console(4) reference
e662e5c Linux devices: uid/gid relative to container
74b670e config: Add VM-based container configuration section
cd39042 uidMappings: change order of fields for clarity
2e241f7 specs-go/config: Define RDMA cgroup
9df387e schema/Makefile: fix test
de688f2 config: Fix Linux mount options links
ef008dd glossary: Bump JSON spec to RFC 8259
4e5a137 schema: Completely drop our JSON Schema 'id' properties
70ba4e6 meeting: Bump January meeting from the 3rd to the 10th
8558116 schema: add allowed values for defaultAction
5d9bbad config: Dedent root paragraphs, since they aren't a list entry
e566cf6 version: put master back to -dev
966a58d fix the link to hook

Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants