Split NFD into client and server #209

marquiz · 2019-01-16T14:01:53Z

Refactor NFD into a simple server-client system. Labeling is now done by
a separate 'nfd-master' server. It is a simple service with small
codebase, designed for easy isolation. The feature discovery part is
implemented in a 'nfd-worker' client which sends labeling requests to
nfd-server, thus, requiring no access/permissions to the Kubernetes API
itself.

Client-server communication is implemented by using gRPC. The protocol
currently consists of only one request, i.e. the labeling request.

The spec templates are converted to the new scheme. The nfd-master
server can be deployed using the nfd-master.yaml.template which now also
contains the necessary RBAC configuration. NFD workers can be deployed
by using the nfd-worker-daemonset.yaml.template or
nfd-worker-job.yaml.template (most easily used with the label-nodes.sh
script)

Also, migrates from Glide to dep (#180) in dependency handling. This was necessary because Glide was unable to handle the dependencies of the new codebase.

This PR aims at solving #204.

Testing this should be relatively straightforward:

Build a custom image by running make, tag and push it: docker tag <IMAGE_ID> <TARGET_IMAGE> && docker push <TARGET_IMAGE>
Convert the templates to use your custom image: sed s'!quay.io/kubernetes_incubator/node-feature-discovery:v0.3.0!<TARGET_IMAGE>!' -i nfd-*template
Deploy master and workers: kubectl apply -f nfd-master.yaml.template && kubectl apply -f nfd-worker-daemonset.yaml.template

TODO:

Add TLS authentication
Fix unit tests
Write unit tests for new functionality
Update documentation

k8s-ci-robot · 2019-01-16T14:01:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marquiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [marquiz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nfd-worker-daemonset.yaml.template

cmd/nfd-worker/main.go

Dockerfile

cmd/nfd-master/main.go

nfd-master.yaml.template

zvonkok · 2019-01-28T14:39:16Z

I have a POC running with #209 , #211 , #189. Still some testing to do.

#211 
feature.node.kubernetes.io/kernel-version.full=3.10.0-957.1.3.el7.x86_64
feature.node.kubernetes.io/kernel-version.major=3
feature.node.kubernetes.io/kernel-version.minor=10
feature.node.kubernetes.io/kernel-version.revision=0

#209 
NAME               READY     STATUS    RESTARTS   AGE
nfd-master-69h6x   1/1       Running   0          56m
nfd-master-l86qm   1/1       Running   0          56m
nfd-master-x8tw7   1/1       Running   0          56m
nfd-worker-sz4nr   1/1       Running   0          56m

nfd-master.yaml.template

marquiz · 2019-01-28T15:06:00Z

I have a POC running with #209 , #211 , #189. Still some testing to do.

Thanks you for testing these.

One major missing piece of functionality (in my backlog) regarding this PR is the TLS encryption. That could be added in a separate PR a bit later, too, of course, if we want to get this merged soon.

One thing that could also be added would be a spec template of running the master and worker in the same pod on the nodes. In case somebody would not want to deploy the split architecture. Do you think this would make sense?

zvonkok · 2019-01-31T11:48:49Z

Yeah makes sense, there are folks who run k8s only on one node, and this would help deploy NFD on a single host. This could be a configuration option for the operator specifying a CR with a special keyword option, have no idea, but something like working-mode: multi-node or working-mode: single-node

nfd-master.yaml.template

marquiz · 2019-02-01T13:55:45Z

Yeah makes sense, there are folks who run k8s only on one node, and this would help deploy NFD on a single host.

OK, I now added nfd-daemonset-combined.yaml.template. I'm not entirely sure about the naming of this, but, couldn't come up with anything better...

I also implemented (mutual) TLS authentication, i.e. both nfd-master and nfd-worker(s) are authenticated. Comments on this are welcome, too.

marquiz · 2019-02-02T10:10:21Z

I wrote a quick update to the documentation. Comments to that are more than welcome 😄

Add a new Makefile target for regenerating these files. Also, add a note that the files are auto-generated, including instructions how to re-generate them. Renames the mock files, using the defaults provided by the mockery tool, in order to make their generation easier.

Glide is not actively developed anymore, and, its documentation recommends migrating to dep. Also, dep is widely used in other k8s projects. Migrating to dep dramatically reduces the size of the populated vendor/ directory from 75MB down to about 20MB.

Update client-go and related packages to the latest version.

Refactor NFD into a simple server-client system. Labeling is now done by a separate 'nfd-master' server. It is a simple service with small codebase, designed for easy isolation. The feature discovery part is implemented in a 'nfd-worker' client which sends labeling requests to nfd-server, thus, requiring no access/permissions to the Kubernetes API itself. Client-server communication is implemented by using gRPC. The protocol currently consists of only one request, i.e. the labeling request. The spec templates are converted to the new scheme. The nfd-master server can be deployed using the nfd-master.yaml.template which now also contains the necessary RBAC configuration. NFD workers can be deployed by using the nfd-worker-daemonset.yaml.template or nfd-worker-job.yaml.template (most easily used with the label-nodes.sh script). Only nfd-worker currently support config file or options. The (default) NFD config file is renamed to nfd-worker.conf.

Refactor old tests and add tests for new functions. Add 'test' target in Makefile.

Makes deployment simpler, but, "softens" the setup by basically giving nodes the capability to label themselves.

Add support for TLS authentication. When enabled, nfd-worker verifies that nfd-master has a valid certificate, i.e. signed by the given root certificate and its Common Name (CN) matches the DNS name of the nfd-master service being used. TLS authentication is enabled by specifying --key-file and --cert-file on nfd-master, and, --ca-file on nfd-worker.

Implement TLS client certificate authentication. It is enabled by specifying --ca-file, --key-file and --cert-file, on both the nfd-master and nfd-worker side. When enabled, nfd-master verifies that the client (worker) presents a valid certificate signed by the root certificate (--ca-file). In addition, nfd-master does authorization based on the Common Name (CN) of the client certificate: CN must match the node name specified in the labeling request. This ensures (assuming that the worker certificates are correctly deployed) that nfd-worker is only able to label the node it is running on, i.e. prevents it from labeling other nodes.

Command line option for overriding the Common Name (CN) expected from the nfd-master TLS certificate. This can be especially handy in testing/development.

Make NodeName based authorization of the workers optional (off by default). This makes it possible for all nfd-worker pods in the cluster to use one shared secret, making NFD deployment much easier. However, this also opens a way for nfd-workers to label other nodes (than what it is running on), too.

Makes solving issues easier when gRPC prints out information e.g. about TLS authentication problems on the server (nfd-master) side, too.

Simplifies the code a bit. Also, log NodeName at startup.

marquiz · 2019-02-25T08:10:37Z

Pushed a new rebased version. With one new patch: logging NodeName, and reading it only once, at at startup

zvonkok · 2019-04-04T18:27:12Z

/lgtm Have done several tests, this looks good to me.

marquiz · 2019-04-04T19:39:30Z

/lgtm

k8s-ci-robot · 2019-04-04T19:39:30Z

@marquiz: you cannot LGTM your own PR.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Version 0.4.0 Node-feature-discovery was migrated into a new repository under the kubernetes-sigs organization in Github (kubernetes-sigs#175). Related to the migration, the final container image registry/repo hasn't been dediced yet (kubernetes-sigs#177) – for this release we still use the old repo. Major changes - Split NFD into client and server (kubernetes-sigs#209) - Changes in labels: - NFD label namespace was changed to 'feature.node.kubernetes.io/' (kubernetes-sigs#176) - 'nfd-' prefix was dropped from all feature labels - NFD version label (feature.node.kubernetes.io/node-feature-discovery.version) was replaced by an annotation (nfd.node.kubernetes.io/version) - network SRIOV labels were changed (kubernetes-sigs#173): - 'network-sriov' -> 'network-sriov.capable' - 'network-sriov-configured' -> 'network-sriov.configured' - selinux detection was moved to kernel feature source - 'selinux' -> 'kernel-selinux.enabled' - cpuid, pstate and RDT labels moved under cpu feature source (kubernetes-sigs#217) - 'cpuid-<cpuid flag>' -> 'cpu-cpuid.<cpuid flag>' - 'pstate-turbo' -> 'cpu-pstate.turbo' - 'rdt-<rdt feature>' -> 'cpu-rdt.<rdt feature>' - Support for config file (kubernetes-sigs#169). Currently with three configurable feature sources i.e. cpu (kubernetes-sigs#224), kernel (kubernetes-sigs#157) and pci (kubernetes-sigs#168) - Support for non-binary labels, with arbitrary values other than plain 'true' - PCI device detection (kubernetes-sigs#168) - Kernel version detection (kubernetes-sigs#157) - Kernel config option detection (kubernetes-sigs#146) - Support for custom feature-detector hooks (kubernetes-sigs#144) - Support OS version detection (kubernetes-sigs#149, kubernetes-sigs#211) - Detection of hardware multithreading, such as Intel Hyper-Threading Technology (kubernetes-sigs#147) - Arm64 support for CPUID detection (kubernetes-sigs#194) - Validation of feature label names and values (kubernetes-sigs#199, kubernetes-sigs#219) - Detection of NVDIMM devices (kubernetes-sigs#214) - Get labels by reading from file in 'local' source (kubernetes-sigs#228) - Detection of Intel SST-BF (Speed Select Technology - Base Frequency) (kubernetes-sigs#235) - Make it possible to create feature labels in non-default namespace (kubernetes-sigs#231). Currently possible for using the local source (hooks and files). Miscellaneous - Template specs converted from json to yaml - Documentation updates and fixes

k8s-ci-robot requested a review from balajismaniam January 16, 2019 14:01

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 16, 2019

marquiz mentioned this pull request Jan 16, 2019

Only masters should label nodes, not worker nodes themselves #204

Closed

marquiz force-pushed the devel/client-server branch from f2fccb4 to 5acd22a Compare January 24, 2019 14:23