Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement collector and analyser for network namespace connectivity #1670

Merged
merged 12 commits into from
Nov 6, 2024

Conversation

ricardomaraschini
Copy link
Contributor

@ricardomaraschini ricardomaraschini commented Oct 31, 2024

Description, Motivation and Context

Important

This feature is only supported when running on Linux, all others platforms return an unsupported platform error.

Checks if two network namespaces can talk to each other on UDP and TCP. Its usage is as follows:

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: test
spec:
  hostCollectors:
  - networkNamespaceConnectivity:
      collectorName: check-network-connectivity
      fromCIDR: 10.0.0.0/24
      toCIDR: 10.0.1.0/24
  hostAnalyzers:
  - networkNamespaceConnectivity:
      collectorName: check-network-connectivity
      outcomes:
      - pass:
          message: "Communication between 10.0.0.0/24 and 10.0.1.0/24 is working"
      - fail:
          message: "Communication between 10.0.0.0/24 and 10.0.1.0/24 isn't working"

It is also available on a HostPreflight object:

apiVersion: troubleshoot.sh/v1beta2
kind: HostPreflight
metadata:
    name: ec-cluster-preflight
spec:
    collectors:
        - networkNamespaceConnectivity:
            collectorName: check-network-connectivity
            fromCIDR: 10.0.0.0/24
            toCIDR: 10.0.1.0/24
            port: 8888
    analyzers:
        - networkNamespaceConnectivity:
            collectorName: check-network-connectivity
            outcomes:
            - pass:
                message: "Communication between 10.0.0.0/24 and 10.0.1.0/24 is working"
            - fail:
                message: "Communication between 10.0.0.0/24 and 10.0.1.0/24 isn't working"

The analyzer output can be templated as follows:

apiVersion: troubleshoot.sh/v1beta2
kind: HostPreflight
metadata:
    name: ec-cluster-preflight
spec:
    collectors:
        - networkNamespaceConnectivity:
            collectorName: check-network-connectivity
            fromCIDR: 10.0.0.0/24
            toCIDR: 10.0.1.0/24
    analyzers:
        - networkNamespaceConnectivity:
            collectorName: check-network-connectivity
            outcomes:
            - pass:
                message: "Communication between {{ .FromNamespace }}  and {{ .ToNamespace }} is working"
            - fail:
                message: "{{ .ErrorMessage }}"

Tip

If this fails then you may need to enable IP forwarding with:

sysctl -w net.ipv4.ip_forward=1

If it still fails then you may need to configure firewalld to allow the traffic or simply disable it for sake of testing.

firewall-cmd --new-zone=ec --permanent
firewall-cmd --zone=ec --set-target=ACCEPT --permanent
firewall-cmd --zone=ec --add-source=10.0.0.0/24 --permanent
firewall-cmd --zone=ec --add-source=10.0.1.0/24 --permanent
firewall-cmd --reload

or

systemctl stop firewalld

Notes

  • This is being implemented to be used on our Embedded Cluster.
  • We have chosen to require the user to provide CIDRs to avoid false positives (firewall rules may vary).
  • Documentation hasn't been written yet as its content may vary based on how this PR review.
  • The port property is optional, if none is provide then 8080 is used.

Workflow

                                                                 
  ┌───────────────────────────────────────────────────────────┐  
  │                    DEFAULT NAMESPACE                      │  
  │                                                           │  
  │ ┌────────────┐                             ┌────────────┐ │  
  │ │ From NS    │                             │ To NS      │ │  
  │ │            │                             │            │ │  
  │ │            │                             │            │ │  
  │ │ TCP CLIENT │                             │ UDP SERVER │ │  
  │ │ UDP CLIENT │                             │ TCP SERVER │ │  
  │ │        ▲   │                             │        ▲   │ │  
  │ │   │    │   │                             │  │     │   │ │  
  │ │   │    │   │                             │  │     │   │ │  
  │ │   │    └───┼─────────────────────────────┼──┘     │   │ │  
  │ │   └────────┼─────────────────────────────┼────────┘   │ │  
  │ │  ┌─────────│───────────┐     ┌───────────│─────────┐  │ │  
  │ │  │10.0.0.1 │ 10.0.0.254│     │10.0.1.254 │ 10.0.1.1│  │ │  
  │ │  └─────────│───────────┘     └───────────│─────────┘  │ │  
  │ └────────────┘                             └────────────┘ │  
  │                                                           │  
  └───────────────────────────────────────────────────────────┘  
                                                                 
  • Two namespaces are created (called from and to).
  • An UDP and a TCP servers are started in the to namespace.
  • Connections are made to these servers from the from namespace.
  • Packets traverse out the from namespace through the veth interface and into the to namespace through routing.
  • First IP of the range is chosen as the internal IP while the last is chosen as the gateway out of the namespace.

Optional Configurations

The collector supports the following additional optional configurations:

Configuration Default Value Description
port 8080 The port to be used by both UDP and TCP connections
timeout 5s The timeout for the UDP and TCP connections

Message templating

The following templating variables are available when templating the message of a failure outcome:

Property Description
{{ .ErrorMessage }} Show all error messages found during the collection
{{ .FromNamespace }} The CIDR provided in the collector from property
{{ .ToNamespace }} The CIDR provided in the collector to property

Checklist

  • New and existing tests pass locally with introduced changes.
  • Tests for the changes have been added (for bug fixes / features)
  • The commit message(s) are informative and highlight any breaking changes
  • Any documentation required has been added/updated. For changes to https://troubleshoot.sh/ create a PR here

Does this PR introduce a breaking change?

  • Yes
  • No

…vity

checks if two network namespaces can talk to each other on udp and tcp.
its usage is as follows:

```yaml
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: test
spec:
  hostCollectors:
  - networkNamespaceConnectivity:
      collectorName: check-network-connectivity
      fromCIDR: 10.0.0.0/24
      toCIDR: 10.0.1.0/24
  hostAnalyzers:
  - networkNamespaceConnectivity:
      collectorName: check-network-connectivity
      outcomes:
      - pass:
          message: "Communication between 10.0.0.0/24 and 10.0.1.0/24 is working"
      - fail:
          message: "Communication between 10.0.0.0/24 and 10.0.1.0/24 isn't working"
```

if this fails then you may need to enable `forwarding` with:

```bash
sysctl -w net.ipv4.ip_forward=1
```

if it still fails then you may need to configure firewalld to allow the
traffic or simply disable it for sake of testing.
@ricardomaraschini ricardomaraschini added the type::feature New feature or request label Oct 31, 2024
@ricardomaraschini ricardomaraschini marked this pull request as ready for review October 31, 2024 20:18
@ricardomaraschini ricardomaraschini requested a review from a team as a code owner October 31, 2024 20:18
Copy link
Member

@banjoh banjoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collector works well. I was able to perform the tests as described in the PR.

There might be resource leak somewhere. One of the virtual network interfaces is left behind after the collector completes. The screenshot below shows the support-bundle binary waiting for user input. At this point, collection, redaction and analysis have already taken place. Any resources by collectors should have been deleted. Exiting support-bundle removes the network interface suggesting that one of the servers is left running in some go routine.

image

pkg/namespaces/network-namespace.go Outdated Show resolved Hide resolved
pkg/analyze/host_network_namespace_connectivity.go Outdated Show resolved Hide resolved
pkg/analyze/host_network_namespace_connectivity.go Outdated Show resolved Hide resolved
@ricardomaraschini
Copy link
Contributor Author

ricardomaraschini commented Nov 5, 2024

There might be resource leak somewhere. One of the virtual network interfaces is left behind after the collector completes. The screenshot below shows the support-bundle binary waiting for user input. At this point, collection, redaction and analysis have already taken place. Any resources by collectors should have been deleted. Exiting support-bundle removes the network interface suggesting that one of the servers is left running in some go routine.

Interesting, I haven't seen this on my tests and in a simple attempt to reproduce it now. Would you mind sharing the YAML you have used so I can try to reproduce this ?

@banjoh
Copy link
Member

banjoh commented Nov 5, 2024

There might be resource leak somewhere. One of the virtual network interfaces is left behind after the collector completes. The screenshot below shows the support-bundle binary waiting for user input. At this point, collection, redaction and analysis have already taken place. Any resources by collectors should have been deleted. Exiting support-bundle removes the network interface suggesting that one of the servers is left running in some go routine.

Interesting, I haven't seen this on my tests and in a simple attempt to reproduce it now. Would you mind sharing the YAML you have used so I can try to reproduce this ?

I used the same spec you have in your description

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: test
spec:
  hostCollectors:
  - networkNamespaceConnectivity:
      collectorName: check-network-connectivity
      fromCIDR: 10.0.0.0/24
      toCIDR: 10.0.1.0/24
  hostAnalyzers:
  - networkNamespaceConnectivity:
      collectorName: check-network-connectivity
      outcomes:
      - pass:
          message: "Communication between 10.0.0.0/24 and 10.0.1.0/24 is working"
      - fail:
          message: "Communication between 10.0.0.0/24 and 10.0.1.0/24 isn't working"

I ran support-bundle spec.yaml while watching existing interfaces using ip -s link show

even though the interface pair is deleted everyttime we delete the
namespace on my tests we better delete it before we delete the
namespace.

this comes out of a review comment where some people seem to still be
able to see the interface pair even after the namespace is deleted.

i.e. better safe than sorry.
@ricardomaraschini
Copy link
Contributor Author

There might be resource leak somewhere. One of the virtual network interfaces is left behind after the collector completes. The screenshot below shows the support-bundle binary waiting for user input. At this point, collection, redaction and analysis have already taken place. Any resources by collectors should have been deleted. Exiting support-bundle removes the network interface suggesting that one of the servers is left running in some go routine.

Interesting, I haven't seen this on my tests and in a simple attempt to reproduce it now. Would you mind sharing the YAML you have used so I can try to reproduce this ?

I used the same spec you have in your description

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: test
spec:
  hostCollectors:
  - networkNamespaceConnectivity:
      collectorName: check-network-connectivity
      fromCIDR: 10.0.0.0/24
      toCIDR: 10.0.1.0/24
  hostAnalyzers:
  - networkNamespaceConnectivity:
      collectorName: check-network-connectivity
      outcomes:
      - pass:
          message: "Communication between 10.0.0.0/24 and 10.0.1.0/24 is working"
      - fail:
          message: "Communication between 10.0.0.0/24 and 10.0.1.0/24 isn't working"

I ran support-bundle spec.yaml while watching existing interfaces using ip -s link show

I ran this more than a thousand times this morning and I could not reproduce. Tried on kernels 6.8.0-47-generic and 5.14.0-427.13.1.el9_4.x86_64 and I haven't seen this behavior not even a single time.

For sake of testing I have raised this commit, can you please check if you can still reproduce this ?

Please let me know what Kernel you are using.

Copy link
Member

@banjoh banjoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@banjoh
Copy link
Member

banjoh commented Nov 6, 2024

After your last commit to fix deletion of the interface things look good now

image

@ricardomaraschini ricardomaraschini merged commit e272683 into main Nov 6, 2024
27 checks passed
@ricardomaraschini ricardomaraschini deleted the network-namespace-collector-and-analyser branch November 6, 2024 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants