Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split NFD into client and server #209

Merged
merged 16 commits into from
Apr 4, 2019

Conversation

marquiz
Copy link
Contributor

@marquiz marquiz commented Jan 16, 2019

Refactor NFD into a simple server-client system. Labeling is now done by
a separate 'nfd-master' server. It is a simple service with small
codebase, designed for easy isolation. The feature discovery part is
implemented in a 'nfd-worker' client which sends labeling requests to
nfd-server, thus, requiring no access/permissions to the Kubernetes API
itself.

Client-server communication is implemented by using gRPC. The protocol
currently consists of only one request, i.e. the labeling request.

The spec templates are converted to the new scheme. The nfd-master
server can be deployed using the nfd-master.yaml.template which now also
contains the necessary RBAC configuration. NFD workers can be deployed
by using the nfd-worker-daemonset.yaml.template or
nfd-worker-job.yaml.template (most easily used with the label-nodes.sh
script)

Also, migrates from Glide to dep (#180) in dependency handling. This was necessary because Glide was unable to handle the dependencies of the new codebase.

This PR aims at solving #204.

Testing this should be relatively straightforward:

  1. Build a custom image by running make, tag and push it: docker tag <IMAGE_ID> <TARGET_IMAGE> && docker push <TARGET_IMAGE>
  2. Convert the templates to use your custom image: sed s'!quay.io/kubernetes_incubator/node-feature-discovery:v0.3.0!<TARGET_IMAGE>!' -i nfd-*template
  3. Deploy master and workers: kubectl apply -f nfd-master.yaml.template && kubectl apply -f nfd-worker-daemonset.yaml.template

TODO:

  • Add TLS authentication
  • Fix unit tests
  • Write unit tests for new functionality
  • Update documentation

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marquiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 16, 2019
@marquiz marquiz force-pushed the devel/client-server branch from f2fccb4 to 5acd22a Compare January 24, 2019 14:23
@marquiz marquiz force-pushed the devel/client-server branch from 5acd22a to c91ffc3 Compare January 28, 2019 12:16
@zvonkok
Copy link
Contributor

zvonkok commented Jan 28, 2019

I have a POC running with #209 , #211 , #189. Still some testing to do.

#211 
feature.node.kubernetes.io/kernel-version.full=3.10.0-957.1.3.el7.x86_64
feature.node.kubernetes.io/kernel-version.major=3
feature.node.kubernetes.io/kernel-version.minor=10
feature.node.kubernetes.io/kernel-version.revision=0
#209 
NAME               READY     STATUS    RESTARTS   AGE
nfd-master-69h6x   1/1       Running   0          56m
nfd-master-l86qm   1/1       Running   0          56m
nfd-master-x8tw7   1/1       Running   0          56m
nfd-worker-sz4nr   1/1       Running   0          56m

@marquiz marquiz force-pushed the devel/client-server branch from c91ffc3 to f2edda8 Compare January 28, 2019 15:00
@marquiz
Copy link
Contributor Author

marquiz commented Jan 28, 2019

I have a POC running with #209 , #211 , #189. Still some testing to do.

Thanks you for testing these.

One major missing piece of functionality (in my backlog) regarding this PR is the TLS encryption. That could be added in a separate PR a bit later, too, of course, if we want to get this merged soon.

One thing that could also be added would be a spec template of running the master and worker in the same pod on the nodes. In case somebody would not want to deploy the split architecture. Do you think this would make sense?

@marquiz marquiz force-pushed the devel/client-server branch from f2edda8 to df5f8e9 Compare January 29, 2019 09:34
@zvonkok
Copy link
Contributor

zvonkok commented Jan 31, 2019

Yeah makes sense, there are folks who run k8s only on one node, and this would help deploy NFD on a single host. This could be a configuration option for the operator specifying a CR with a special keyword option, have no idea, but something like working-mode: multi-node or working-mode: single-node

@marquiz marquiz force-pushed the devel/client-server branch 2 times, most recently from 5aed898 to 2244cbd Compare February 1, 2019 13:51
@marquiz
Copy link
Contributor Author

marquiz commented Feb 1, 2019

Yeah makes sense, there are folks who run k8s only on one node, and this would help deploy NFD on a single host.

OK, I now added nfd-daemonset-combined.yaml.template. I'm not entirely sure about the naming of this, but, couldn't come up with anything better...

I also implemented (mutual) TLS authentication, i.e. both nfd-master and nfd-worker(s) are authenticated. Comments on this are welcome, too.

@marquiz marquiz force-pushed the devel/client-server branch 2 times, most recently from d63ebac to 3d03277 Compare February 2, 2019 10:06
@marquiz
Copy link
Contributor Author

marquiz commented Feb 2, 2019

I wrote a quick update to the documentation. Comments to that are more than welcome 😄

@marquiz marquiz force-pushed the devel/client-server branch 3 times, most recently from b0abe53 to ff48ce2 Compare February 8, 2019 18:45
@marquiz marquiz force-pushed the devel/client-server branch from fd42716 to 33905c4 Compare February 20, 2019 12:21
Add a new Makefile target for regenerating these files.  Also, add a
note that the files are auto-generated, including instructions how to
re-generate them.

Renames the mock files, using the defaults provided by the mockery tool,
in order to make their generation easier.
Glide is not actively developed anymore, and, its documentation
recommends migrating to dep. Also, dep is widely used in other k8s
projects.

Migrating to dep dramatically reduces the size of the populated vendor/
directory from 75MB down to about 20MB.
Update client-go and related packages to the latest version.
Refactor NFD into a simple server-client system. Labeling is now done by
a separate 'nfd-master' server. It is a simple service with small
codebase, designed for easy isolation. The feature discovery part is
implemented in a 'nfd-worker' client which sends labeling requests to
nfd-server, thus, requiring no access/permissions to the Kubernetes API
itself.

Client-server communication is implemented by using gRPC. The protocol
currently consists of only one request, i.e. the labeling request.

The spec templates are converted to the new scheme. The nfd-master
server can be deployed using the nfd-master.yaml.template which now also
contains the necessary RBAC configuration. NFD workers can be deployed
by using the nfd-worker-daemonset.yaml.template or
nfd-worker-job.yaml.template (most easily used with the label-nodes.sh
script).

Only nfd-worker currently support config file or options. The (default)
NFD config file is renamed to nfd-worker.conf.
Refactor old tests and add tests for new functions. Add 'test' target in
Makefile.
Makes deployment simpler, but, "softens" the setup by basically giving
nodes the capability to label themselves.
Add support for TLS authentication. When enabled, nfd-worker verifies
that nfd-master has a valid certificate, i.e. signed by the given root
certificate and its Common Name (CN) matches the DNS name of the
nfd-master service being used. TLS authentication is enabled by
specifying --key-file and --cert-file on nfd-master, and, --ca-file on
nfd-worker.
Implement TLS client certificate authentication. It is enabled by
specifying --ca-file, --key-file and --cert-file, on both the nfd-master
and nfd-worker side. When enabled, nfd-master verifies that the client
(worker) presents a valid certificate signed by the root certificate
(--ca-file). In addition, nfd-master does authorization based on the Common Name
(CN) of the client certificate: CN must match the node name specified in
the labeling request. This ensures (assuming that the worker
certificates are correctly deployed) that nfd-worker is only able to label
the node it is running on, i.e. prevents it from labeling other nodes.
Command line option for overriding the Common Name (CN) expected from
the nfd-master TLS certificate. This can be especially handy in
testing/development.
Make NodeName based authorization of the workers optional (off by
default). This makes it possible for all nfd-worker pods in the cluster
to use one shared secret, making NFD deployment much easier. However,
this also opens a way for nfd-workers to label other nodes (than what it
is running on), too.
Makes solving issues easier when gRPC prints out information e.g. about
TLS authentication problems on the server (nfd-master) side, too.
Simplifies the code a bit. Also, log NodeName at startup.
@marquiz marquiz force-pushed the devel/client-server branch from 33905c4 to e09b380 Compare February 25, 2019 08:05
@marquiz
Copy link
Contributor Author

marquiz commented Feb 25, 2019

Pushed a new rebased version. With one new patch: logging NodeName, and reading it only once, at at startup

@zvonkok
Copy link
Contributor

zvonkok commented Apr 4, 2019

/lgtm Have done several tests, this looks good to me.

@marquiz marquiz removed the request for review from balajismaniam April 4, 2019 19:39
@marquiz
Copy link
Contributor Author

marquiz commented Apr 4, 2019

/lgtm

@k8s-ci-robot
Copy link
Contributor

@marquiz: you cannot LGTM your own PR.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@marquiz marquiz merged commit c54551f into kubernetes-sigs:master Apr 4, 2019
@marquiz marquiz mentioned this pull request May 9, 2019
10 tasks
@marquiz marquiz deleted the devel/client-server branch June 11, 2019 07:03
ArangoGutierrez pushed a commit to ArangoGutierrez/node-feature-discovery that referenced this pull request Nov 18, 2020
Version 0.4.0

Node-feature-discovery was migrated into a new repository under the
kubernetes-sigs organization in Github (kubernetes-sigs#175). Related to the migration,
the final container image registry/repo hasn't been dediced yet (kubernetes-sigs#177) –
for this release we still use the old repo.

Major changes
- Split NFD into client and server (kubernetes-sigs#209)
- Changes in labels:
  - NFD label namespace was changed to 'feature.node.kubernetes.io/' (kubernetes-sigs#176)
    - 'nfd-' prefix was dropped from all feature labels
    - NFD version label
      (feature.node.kubernetes.io/node-feature-discovery.version) was
      replaced by an annotation (nfd.node.kubernetes.io/version)
  - network SRIOV labels were changed (kubernetes-sigs#173):
    - 'network-sriov' -> 'network-sriov.capable'
    - 'network-sriov-configured' -> 'network-sriov.configured'
  - selinux detection was moved to kernel feature source
    - 'selinux' -> 'kernel-selinux.enabled'
  - cpuid, pstate and RDT labels moved under cpu feature source (kubernetes-sigs#217)
    - 'cpuid-<cpuid flag>' -> 'cpu-cpuid.<cpuid flag>'
    - 'pstate-turbo' -> 'cpu-pstate.turbo'
    - 'rdt-<rdt feature>' -> 'cpu-rdt.<rdt feature>'
- Support for config file (kubernetes-sigs#169). Currently with three configurable
  feature sources i.e. cpu (kubernetes-sigs#224), kernel (kubernetes-sigs#157) and pci (kubernetes-sigs#168)
- Support for non-binary labels, with arbitrary values other than plain
  'true'
- PCI device detection (kubernetes-sigs#168)
- Kernel version detection (kubernetes-sigs#157)
- Kernel config option detection (kubernetes-sigs#146)
- Support for custom feature-detector hooks (kubernetes-sigs#144)
- Support OS version detection (kubernetes-sigs#149, kubernetes-sigs#211)
- Detection of hardware multithreading, such as Intel Hyper-Threading
  Technology (kubernetes-sigs#147)
- Arm64 support for CPUID detection (kubernetes-sigs#194)
- Validation of feature label names and values (kubernetes-sigs#199, kubernetes-sigs#219)
- Detection of NVDIMM devices (kubernetes-sigs#214)
- Get labels by reading from file in 'local' source (kubernetes-sigs#228)
- Detection of Intel SST-BF (Speed Select Technology - Base Frequency) (kubernetes-sigs#235)
- Make it possible to create feature labels in non-default namespace
  (kubernetes-sigs#231). Currently possible for using the local source (hooks and files).

Miscellaneous
- Template specs converted from json to yaml
- Documentation updates and fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants