Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand source/network #591

Closed
ArangoGutierrez opened this issue Sep 2, 2021 · 16 comments
Closed

Expand source/network #591

ArangoGutierrez opened this issue Sep 2, 2021 · 16 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ArangoGutierrez
Copy link
Contributor

Enhance the source/network https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/source/network/network.go to expose more features

What would you like to be added:
Info like

[root@cnfde4 ~]# ethtool -i ens1f0
driver: ice
version: 4.18.0-305.2.1.rt7.74.el8_4.ocp
firmware-version: 3.00 0x80008278 1.2992.0
expansion-rom-version:
bus-info: 0000:3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
[root@cnfde4 ~]# ethtool -i ens5f0
driver: i40e
version: 4.18.0-305.2.1.rt7.74.el8_4.ocp
firmware-version: 7.10 0x800075e6 19.5.12
expansion-rom-version: 
bus-info: 0000:86:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Why is this needed:
Discover available drivers supported on each network interface and support complex Network deployments

@ArangoGutierrez ArangoGutierrez added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 2, 2021
@leo8a
Copy link

leo8a commented Sep 2, 2021

available drivers supported on each network interface available at the cluster may be labeled like in the below example
"feature.node.kubernetes.io/network-<iface-name>.driver": "<driver-name>",

for above examples, might be something like:
"feature.node.kubernetes.io/network-ens1f0.driver": "ice",
"feature.node.kubernetes.io/network-ens5f0.driver": "i40e",

@marquiz
Copy link
Contributor

marquiz commented Sep 2, 2021

I've been preparing for something like this 🧐 In #553 I have this commit which expands generalizes network discovery a bit. I would be inclined to add this as a customizable capability only, at least in the suggested form above. I'm dubious about the usefullness of labels that contain e.g. network device names that can be anything between the nodes.

So, I would say any feedback (comments on the concept, code review, testing...) on my somewhat huge PRs would be of great help in enabling this 😉 (#464, #550, #553). Unfortunately the documentation PR is still missing...

@leo8a
Copy link

leo8a commented Sep 15, 2021

sharing in case's useful, when multiple virtual/physical interfaces are present in the cluster node(s), physical net devices could be identified via the below command...

[root@cnfde4 ~]# ll /sys/class/net/ | grep -v virtual
total 0
lrwxrwxrwx. 1 root root 0 Sep 15 10:45 ens1f0 -> ../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/net/ens1f0
lrwxrwxrwx. 1 root root 0 Sep 15 10:45 ens5f0 -> ../../devices/pci0000:85/0000:85:00.0/0000:86:00.0/net/ens5f0

I'd be in favor of verbosing/improving a bit the networking labels (by default), given that cpu labels looks already great.
(sample output below)

"feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512BW": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512CD": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512DQ": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512F": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX512VL": "true",

@marquiz
Copy link
Contributor

marquiz commented Sep 16, 2021

physical net devices could be identified via the below command...

virtual might be a worthy addition to the device attributes

I'd be in favor of verbosing/improving a bit the networking labels (by default)

I'm reluctant to add any device-specific labels (i.e. labels that include the network device name). Any other suggestions or wishes?

given that cpu labels looks already great.

Network devices are a bit different. And even in the case of cpuid labels it's questionable how useful most of these labels really are for the end user (i.e. how many of these really are used) – it'd be nice to have some hard data about this 😆

@marquiz
Copy link
Contributor

marquiz commented Nov 26, 2021

@ArangoGutierrez @leo8a: #660 added capability for network device detection. Please test and share feedback, is there something essential/useful missing. Perhaps the virtual/physical could be added 🤔

@marquiz
Copy link
Contributor

marquiz commented Dec 1, 2021

Now giving it a second thought, I'm not sure we want to disover virtual devices (currently we don't). WDYT?

@leo8a
Copy link

leo8a commented Dec 1, 2021

Hey @marquiz, thanks for pushing this forward.

In my experience so far, discovering features on virtual devices doesn't have a UC as clear as discovering available features on physical devices.

For instance, in cloud-native Telco deployments for 5G, usually, there's a need to patch the baremetal cluster nodes with the latest OOT drivers [1], or perform firmware upgrades on the NICs to meet the stringent requirements on packets processing (i.e., low-latency, etc.). These automated day-2 operations can be performed via other operators (like SRO, or others), but there's still a missing component that discovers the current driver and firmware supported on the NICs in order to fully automate these E2E operations. That's the gap, IMHO, that NFD might help to bridge, in Telco UCs.

Note that, to decide whether to upgrade the firmware or the driver on a NIC, usually, it's first needed to discover the current version on the baremetal cluster nodes.

  • [1] More info on why OOT drivers are needed for some UCs instead of in-tree drivers, can be found here by the LAN Access Division at Intel.

@marquiz
Copy link
Contributor

marquiz commented Dec 1, 2021

In my experience so far, discovering features on virtual devices doesn't have a UC as clear as discovering available features on physical devices.

Yes, agree

For instance, in cloud-native Telco deployments for 5G, usually, there's a need to patch the baremetal cluster nodes with the latest OOT drivers [1], or perform firmware upgrades on the NICs to meet the stringent requirements on packets processing (i.e., low-latency, etc.). These automated day-2 operations can be performed via other operators (like SRO, or others), but there's still a missing component that discovers the current driver and firmware supported on the NICs in order to fully automate these E2E operations. That's the gap, IMHO, that NFD might help to bridge, in Telco UCs.

I think we could easily add driver and driver/version but I'm not sure about firmware version, how to get that. Maybe need to rely on external hook or sidecar for that 🤔

@zvonkok
Copy link
Contributor

zvonkok commented Dec 1, 2021

Firmware could be tricky since most of the time this relies on some tools installed on the host be it a driver or some user-space library. There are use-cases where such information is encoded in the PCI config-space without using any external tools it can be "easily' extracted. @marquiz This is e.g. on of the use-case for having NFD expose the PCI config-space.

This feature (driver exposure) could also be beneficial for sandboxed environments where e.g. I want to know which devices are bound to VFIO.

@mythi FYI

@zvonkok
Copy link
Contributor

zvonkok commented Dec 1, 2021

Do we need to consider here also if the devices is e.g. a bridge or a macvlan interface? Those would show the driver of the parent interface.

@marquiz
Copy link
Contributor

marquiz commented Dec 1, 2021

Do we need to consider here also if the devices is e.g. a bridge or a macvlan interface? Those would show the driver of the parent interface.

Those are always "virtual" interfaces, right? So that we don't do them currently.

@leo8a
Copy link

leo8a commented Dec 1, 2021

Do we need to consider here also if the devices is e.g. a bridge or a macvlan interface? Those would show the driver of the parent interface.

I'd say only the s/parent/physical devices only, as Markus just pointed. This can be determined with the below command, in case it helps.

ll /sys/class/net/ | grep -v virtual

(sample output)
total 0
lrwxrwxrwx. 1 root root 0 Nov 24 08:18 enp1s0 -> ../../devices/pci0000:00/0000:00:02.0/0000:01:00.0/virtio0/net/enp1s0
lrwxrwxrwx. 1 root root 0 Nov 24 08:18 enp7s0 -> ../../devices/pci0000:00/0000:00:02.6/0000:07:00.0/virtio5/net/enp7s0

For the firmware version of NICs only, the below command comes in handy for us. However, IIRC @zvonkok had an idea to integrate this using a sidecar to gap the functionality.

ethtool -i eth0

(sample output)
driver: i40e
version: 4.18.0-305.2.1.rt7.74.el8_4.ocp
firmware-version: 7.10 0x800075e6 19.5.12 # <-- this is the firmware version

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 1, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 31, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants