Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix network plugin detection #1061

Merged
merged 9 commits into from
Feb 15, 2021
Merged

Conversation

mkimuram
Copy link
Contributor

@mkimuram mkimuram commented Feb 3, 2021

This PR fixes network plugin detection by:

  • Making detection logic of each network plugin or cidrs not to fail on internal error to properly fall through,
  • Adding logic to detect pod cidrs and cluster cidrs for k8s w/o k8s pods(api-server, controller, kube-proxy)

Closes: submariner-io/submariner#1116

@mkimuram
Copy link
Contributor Author

mkimuram commented Feb 3, 2021

I've tested by below way and it seems to work well:

[Build environment]

# make build
# make images
# docker save quay.io/submariner/submariner-operator:dev > /tmp/submariner-operator.tar

(Copy bin/subctl and /tmp/submariner-operator.tar to test environment.)

[Test environment (after k3s is deployed)]

# k3s ctr images import submariner-operator.tar
# k3s ctr images tag quay.io/submariner/submariner-operator:dev quay.io/submariner/submariner-operator-dev:dev
# k3s ctr images list | grep submariner

# ./subctl deploy-broker --globalnet --globalnet-cidr-range 169.254.0.0/16

# ./subctl join broker-info.subm --clusterid cl1 --clustercidr 10.44.0.0/16 --servicecidr 10.45.0.0/16 --natt=false --globalnet-cidr 169.254.44.0/24 --image-override submariner-operator=quay.io/submariner/submariner-operator-dev:dev

# kubectl edit deploy -n submariner-operator submariner-operator

(Change imagePullPolicy to Never.)

# watch kubectl get pod -n submariner-operator

# ./subctl show networks

Showing network details for cluster "default":
    Discovered network details:
        Network plugin:  generic
        Service CIDRs:   [10.45.0.0/16]
        Cluster CIDRs:   [10.44.0.0/24]

# iptables -t nat -S POSTROUTING | head -n 2
-P POSTROUTING ACCEPT
-A POSTROUTING -j SUBMARINER-POSTROUTING

# kubectl logs -n submariner-operator -l app=submariner-routeagent
I0202 22:58:31.139620       1 routes_iface.go:121] Found nil gw or dst
I0202 22:58:31.139631       1 gw_transition.go:40] Creating the vxlan interface: vx-submariner on the gateway node
I0202 22:58:31.140418       1 vxlan.go:164] Successfully added the bridge fdb entry 192.168.122.67 00:00:00:00:00:00
I0202 22:58:31.140562       1 vxlan.go:251] Successfully configured rp_filter to loose mode(2) on vx-submariner
I0202 22:58:33.297912       1 handler.go:78] A Node with name "k3s" and addresses []v1.NodeAddress{v1.NodeAddress{Type:"InternalIP", Address:"192.168.122.67"}, v1.NodeAddress{Type:"Hostname", Address:"k3s"}} has been updated
I0202 22:58:33.297953       1 node_handler.go:30] A Node with name "k3s" and addresses []v1.NodeAddress{v1.NodeAddress{Type:"InternalIP", Address:"192.168.122.67"}, v1.NodeAddress{Type:"Hostname", Address:"k3s"}} has been updated
I0202 22:58:38.924489       1 handler.go:78] A Node with name "k3s" and addresses []v1.NodeAddress{v1.NodeAddress{Type:"InternalIP", Address:"192.168.122.67"}, v1.NodeAddress{Type:"Hostname", Address:"k3s"}} has been updated
I0202 22:58:38.924530       1 node_handler.go:30] A Node with name "k3s" and addresses []v1.NodeAddress{v1.NodeAddress{Type:"InternalIP", Address:"192.168.122.67"}, v1.NodeAddress{Type:"Hostname", Address:"k3s"}} has been updated
I0202 23:03:41.255994       1 handler.go:78] A Node with name "k3s" and addresses []v1.NodeAddress{v1.NodeAddress{Type:"InternalIP", Address:"192.168.122.67"}, v1.NodeAddress{Type:"Hostname", Address:"k3s"}} has been updated
I0202 23:03:41.256036       1 node_handler.go:30] A Node with name "k3s" and addresses []v1.NodeAddress{v1.NodeAddress{Type:"InternalIP", Address:"192.168.122.67"}, v1.NodeAddress{Type:"Hostname", Address:"k3s"}} has been updated

pkg/discovery/network/generic.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic.go Outdated Show resolved Hide resolved
}
}

return "", fmt.Errorf("no node with Spec.PodCIDR found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should return an error here if not found. We should let the caller handle not found as it sees fit (it so happens this is the last resort for the caller but that could change).

pkg/discovery/network/generic.go Show resolved Hide resolved
pkg/discovery/network/generic.go Outdated Show resolved Hide resolved
pkg/discovery/network/network_suite_test.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic_test.go Show resolved Hide resolved
pkg/discovery/network/generic_test.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic_test.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mangelajo mangelajo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see inline comments, but the new service heuristic is good IMO, "cluster-dump" equivalent is what we do on the other heuristics.

@mkimuram mkimuram force-pushed the issue/1116v3 branch 2 times, most recently from 3866d7a to d89fbc6 Compare February 3, 2021 23:01
Copy link
Contributor Author

@mkimuram mkimuram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tpantelis

I've fixed almost as suggested. Please see my inline comments.

pkg/discovery/network/generic.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic_test.go Show resolved Hide resolved
pkg/discovery/network/generic_test.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic_test.go Outdated Show resolved Hide resolved
pkg/discovery/network/pods.go Show resolved Hide resolved
pkg/discovery/network/generic.go Show resolved Hide resolved
@@ -89,7 +97,7 @@ func findClusterIPRangeSvcCreation(clientSet kubernetes.Interface) (string, erro

// creating invalid svc didn't fail as expected
if err == nil {
return "", fmt.Errorf("creating invalid service(%v) didn't fail", invalidSvcSpec)
return "", nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this return error makes unit tests complex. All existing tests for network plugin detection will need to rely on creating svc always return error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means the API server didn't behave as we expected, likely something changed. I think it would be useful to at least log a message:

status.QueueWarningMessage("Could not determine the service IP range via service creation - the expected error was not returned")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to return an error properly and changed unit tests as well.

},
},
}
_, err := clientSet.CoreV1().Services("submariner-operator").Create(invalidSvcSpec)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating svc in submariner-operator namespace might fail if it is called before the namespace is created. It seems to happen even on subctl join command. Other candidate will be kube-system, default, and random namespace created just for this check. Any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use empty namespace but, if not, default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is called from both command line and operator.
subctl actually calls it on subctl join before submariner-namespace is created, so it shouldn't create service in submariner-operator namespace. On the other hand, the operator is only allowed to create resources inside submariner-operator namespace, so it shouldn't create service in default namespace.

Therefore, the code was changed to check whether it is running in operator and to create service in the proper namespace.

pkg/discovery/network/canal_test.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic.go Outdated Show resolved Hide resolved
},
},
}
_, err := clientSet.CoreV1().Services("submariner-operator").Create(invalidSvcSpec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use empty namespace but, if not, default.

pkg/discovery/network/generic.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic_test.go Outdated Show resolved Hide resolved
pkg/discovery/network/weavenet_test.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic_test.go Outdated Show resolved Hide resolved
pkg/discovery/network/generic_test.go Outdated Show resolved Hide resolved
pkg/discovery/network/canal_test.go Outdated Show resolved Hide resolved
Base automatically changed from master to devel February 4, 2021 09:37
@mkimuram
Copy link
Contributor Author

mkimuram commented Feb 5, 2021

@tpantelis

Thank you so much for your review with detailed suggestions. PTAL

}

// decide which namespace to create the service
ns := os.Getenv("WATCH_NAMESPACE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is WATCH_NAMESPACE and how is it set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set here as an environment variable of the operator pod. I think that this is used by operator-sdk's GetWatchNamespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - could you add a comment explaining this? eg "WATCH_NAMESPACE will be present if running in the operator pod".

Another approach is to pass in the NS so we don't rely on the presence of an env var. The operator code implicitly knows its own namespace. I think this would make it clearer and although it would mean passing a param down the chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed and squashed.

// decide which namespace to create the service
ns := os.Getenv("WATCH_NAMESPACE")
// use "submariner-operator" if WATCH_NAMESPACE is "submariner-operator" (in operator)
if ns != "submariner-operator" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a constant for this in root.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that importing constant from pkg/subctl/cmd to pkg/discovery/network/generic.go makes import cycle not allowed error. Is it OK to just define constant in pkg/discovery/network/generic.go or do we need to move the constant to the other place to share it?

Note that as explained, subctl create the operator pod in the namespace and pass the env variable via the downward API (metadata.namespace). Then inside the operator, it checks whether env variable is set to check if it runs inside the operator. So, this constant is not directly shared across these functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass the NS into the function as described above, this becomes moot.


// creating invalid service didn't fail as expected
if err == nil {
return "", fmt.Errorf("creating invalid service(%v) didn't fail", invalidSvcSpec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this the way you had it, ie not return error here or on L128. This makes it consistent with the other discovery methods and, on join, will result in the user being prompted. But, as I noted earlier, I think it's worth logging a message as the API server didn't behave as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callers of discoverGenericNetwork are as follows:

[Operator]
Reconcile
  discoverNetwork
    getClusterNetwork
      Discover
        discoverGenericNetwork

[Command line]
getNetworkDetails
  Discover
    discoverGenericNetwork
  if err != nil {
    status.QueueWarningMessage(fmt.Sprintf("Error trying to discover network details: %s", err))
  }

So, as for command line case, the error is already been passed to status.QueueWarningMessage.

And, other functions called from discoverGenericNetwork, like findClusterIPRangeFromApiserver actually returns error and not catching it with status.QueueWarningMessage.

Therefore, the code should be already as you expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was about whether it should return an error in this case, ie should we treat it like we didn't find the data similar to the other discovery methods. So for subclt join, this would at least allow it to prompt the user as a last resort. After all, this is just another attempt to find the data. This seems reasonable but it would be beneficial to print a warning. @mangelajo WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for @mangelajo 's comment. But I'm now inclined to return nil and log error here, only for this case.

Meanwhile, I would like to confirm that it is safe to call status.QueueWarningMessage outside CLI's context, or from operator. It looks like a package under pkg/internal/cli and I'm not familiar with its implementation details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately status.QueueWarningMessage wouldn't be appropriate in the operator. The operator uses a different logging API. We could pass in a function to log warnings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logging code in controllers/submariner/submariner_networkdiscovery.go for operator to log it (See the first commit). Also, re-organized and squashed commits to review easier.

})
})

When("There is a kubeapi pod at least ", func() {
When("There is a kube-api pod at least ", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When("There is a kube-api pod at least ", func() {
When("There is a kube-api pod", func() {

var clusterNet *ClusterNetwork

BeforeEach(func() {
clusterNet = testDiscoverGenericWith(
fakePod("kube-controller", []string{"kube-controller", "--cluster-ABCD=1.2.3.4"}, []v1.EnvVar{}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fakePod("kube-controller", []string{"kube-controller", "--cluster-ABCD=1.2.3.4"}, []v1.EnvVar{}),
fakePod("kube-controller-manager", []string{"kube-controller-manager", "--cluster-ABCD=1.2.3.4"}, []v1.EnvVar{}),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code, we are actually checking for component kube-controller-manager and this test-case is about validating the arguments (i.e., no expected params).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing it out. Fixed.

var clusterNet *ClusterNetwork

BeforeEach(func() {
clusterNet = testDiscoverGenericWith(
fakePod("kube-controller", []string{"kube-api", "--cluster-ABCD=1.2.3.4"}, []v1.EnvVar{}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fakePod("kube-controller", []string{"kube-api", "--cluster-ABCD=1.2.3.4"}, []v1.EnvVar{}),
fakePod("kube-apiserver", []string{"kube-apiserver", "--cluster-ABCD=1.2.3.4"}, []v1.EnvVar{}),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed as well.

Signed-off-by: Masaki Kimura <masaki.kimura@hitachivantara.com>
Signed-off-by: Masaki Kimura <masaki.kimura@hitachivantara.com>
Signed-off-by: Masaki Kimura <masaki.kimura@hitachivantara.com>
Signed-off-by: Masaki Kimura <masaki.kimura@hitachivantara.com>
Signed-off-by: Masaki Kimura <masaki.kimura@hitachivantara.com>
Signed-off-by: Masaki Kimura <masaki.kimura@hitachivantara.com>
Copy link
Member

@sridhargaddam sridhargaddam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and also addressing the comments @mkimuram

Copy link
Contributor

@mangelajo mangelajo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now, the additional heuristics will make this much more robust. Thank you @mkimuram .

@mangelajo mangelajo requested a review from tpantelis February 15, 2021 09:42
@mangelajo
Copy link
Contributor

@tpantelis have an eye, I belive mkimura handled all your comments.

@tpantelis
Copy link
Contributor

tpantelis commented Feb 15, 2021

@tpantelis have an eye, I belive mkimura handled all your comments.

Mostly. So we're going to fail fast on join if the service creation API doesn't fail with the CIDR info as expected. I'm fine with that but I'd like to make the messages clearer and add a comment wrt to the WATCH_NAMESPACE. I can push a follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SUBMARINER-POSTROUTING is not chained from POSTROUTING
5 participants