Skip to content

Commit

Permalink
Fix issues realted to fail over cr description and routes
Browse files Browse the repository at this point in the history
Signed-off-by: Aswin Suryanarayanan <aswinsuryan@gmail.com>
  • Loading branch information
aswinsuryan committed May 30, 2023
1 parent d69a268 commit 66b1be9
Showing 1 changed file with 30 additions and 29 deletions.
59 changes: 30 additions & 29 deletions submariner/OVN-Interconnect.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,19 +38,19 @@ With OVN Interconnect we can have two types of deployment

```bash
annotations:
k8s.ovn.org/ovn-node-transit-switch-port-ips: '["169.254.0.3/16"]'
k8s.ovn.org/ovn-zone: global
k8s.ovn.org/node-transit-switch-port-ifaddr: '["169.254.0.3/16"]'
k8s.ovn.org/zone-name: global
name: cluster1-worker

annotations:
k8s.ovn.org/ovn-node-transit-switch-port-ips: '["169.254.0.5/16"]'
k8s.ovn.org/ovn-zone: az2
k8s.ovn.org/node-transit-switch-port-ifaddr: '["169.254.0.5/16"]'
k8s.ovn.org/zone-name: az2
name: cluster2-worker
```

With the current architecture, Submariner adds routes only in the zone in which it is deployed. For example, if Submariner is deployed in
zone 1 it programs OVN db in zone 1. So only pods in zone 1 nodes will be able to talk to other clusters. Pods in zone 2 or zone 3 will not
be able to reach remote clusters connected via Submariner.
With the current architecture, Submariner network-plugin-syncer adds routes only in a single zone where the network-plugin-syncer pod runs.
For example, if network-plugin-syncer is deployed in zone 1 it programs OVN db in zone 1. So only pods in zone 1 nodes will be able to talk
to other clusters. Pods in zone 2 or zone 3 will not be able to reach remote clusters connected via Submariner.

As part of the proposal, we plan to support both the modes and OVN cluster deployments where interconnect
is not enabled as well.
Expand All @@ -67,7 +67,7 @@ This CR will be created when a remote endpoint is added and there will be one CR

* NextHops - Specifies the list of next hop to reach the remote cluster, in this case it will be the IP of ovn-k8s-mp0
interface, the interface used by OVN for host networking.
* RemoteCIDR - Specifies the list of remote CIDRs reachable via this cluster.
* RemoteCIDR - Specifies the list of remote CIDRs reachable via the next hop.

This CR will be used by the route agent pod running on the active-Gateway node to program OVN to send the traffic destined to
remote clusters via the Submariner tunnel.
Expand All @@ -93,7 +93,8 @@ type SubmarinerRoutePolicySpec struct {

This CR will be created when a remote endpoint is created and there will be one created per endpoint.

* NextHops - Specifies the list of next hops. In this case it will be the transit switch IP.
* NextHops - Specifies the list of next hops. In this case ,we will have only one, and it will be the transit switch IP of the zone
where g/w node is present.
* RemoteCIDR - Specifies the list of remote CIDRs reachable via this gateway.

* In non-g/w node - If the route-agent pod is not in the same zone as Gateway node zone, send the traffic to the g/w node zone.
Expand Down Expand Up @@ -133,9 +134,9 @@ The Submariner Route-agent pod running on the active gateway node will be respon
only for OVN CNI. For every RemoteEndpointCreated event a SubmarinerGWRoute CR will be created. The nextHop will be the interface IP through
which we can reach the cable driver. In the case of OVN it will be the IP of ovn-k8s-mp0 interface.

The SubmarinerNonGWRoute CRD will also be created by Submariner Route-agent. It will be created per endpoint and will have remoterCIDRS from
the endpoint. The nextHop will be the transit switch IP of the G/W node. If the transit switch IP is missing this CRD will not be created,
which means it is a non-IC setup.
The SubmarinerNonGWRoute CRD will also be created by Submariner Route-agent running on the active gateway node . It will be created per endpoint
and will have remoterCIDRS from the endpoint. The nextHop will be the transit switch IP of the G/W node. If the transit switch IP is missing
this CR will not be created, which means it is a non-IC setup.

The RouteAgent will have these controllers added to it and the one running in gateway node responds to the CRUD operations of Submariner
endpoints.
Expand All @@ -148,14 +149,13 @@ redirect any traffic destined to remote CIDR to the ovn-k8s-mp0 interface IP.

```bash
_uuid : 0459f009-3603-47ac-8ee7-9d958540ed31
bfd : []
action : reroute
external_ids : {}
ip_prefix : "10.132.0.0/16"
nexthop : "10.1.1.2"
options : {}
output_port : []
policy : []
route_table : ""
match : "ip4.dst==10.132.0.0/16"
nexthop : []
nexthops : ["10.1.1.2"]
options : {"external_ids:{submariner"="true}"}
priority : 20000
```

It also programs a route in the ovn-cluster-router, to route the traffic coming from other zones destined to remote cluster IP range via the
Expand All @@ -175,17 +175,17 @@ route_table : ""

#### SubmarinerNonGWRoute Controller

This controller will run in every route agent pod. This controller connects to the OVN DB. When a SubmarinerNonGWRoute CR is created
in non-g/w node, it updates the ovn-cluster-route with a router policy using a priority of 20000 to send the traffic to
the remote cluster via next hop mentioned, which is the transit switch IP to the g/w node. Before adding the route it checks if
a route exists, if so it skips adding the route again. This is required to prevent duplicate update since there can be more than
one node in each zone and hence more than one RouteAgent.
This controller will run as part of every route agent pod and connects to the OVN DB. When a SubmarinerNonGWRoute CR is created,
the route-agent running on the non-GW node will update the ovn-cluster-router with a logical router policy using a priority of 20000
to send the traffic to the remote cluster via next hop mentioned, which is the transit switch IP to the g/w node. Before adding
the route it checks if a route exists, if so it skips adding the route again. This is required to prevent duplicate update since there
can be more than one node in each zone and hence more than one RouteAgent.

```bash
_uuid : 22db3005-64c5-4e32-aeb0-642423c30742
action : reroute
external_ids : {}
match : "ip4.dst==10.132.0.0/16"
match : "ip4.dst==10.132.0.0/14"
nexthop : []
nexthops : ["169.254.0.1"]
options : {"external_ids:{submariner"="true}"}
Expand All @@ -203,16 +203,17 @@ network-plugin-syncer pods and remove any existing deployments.

If there are two gateway nodes , the passive one will work like a non-gateway node. It will not be responsible for creating the CRs.
In the case of gateway fail-over all the current SubmarinerGWRoute and SubmarinerNonGWRoute will be deleted by the route agent in
the node that is transitioning to non-gateway node and will be recreated by the new gateway node.
the node that is transitioning to gateway node and will be recreated with updated values.

#### Open issues

1. The update from older version of Submariner to a newer version will create a datapath downtime.
2. When multiple clusters are updated we need to check if one cluster can be done at a time. The cluster will be down
until all the nodes are updated.
1. When multiple clusters are updated we need to check if one cluster can be done at a time. The cluster will be down
until all the nodes are updated.
2. The update from older version of Submariner to a newer version will create a datapath downtime.
3. When Kubernetes cluster is updated to a version that has IC enabled, there could be a datapath downtime till the
Submariner g/w node is updated. Since the other nodes need the transit switch IP which will be available only when
the g/w node is updated.
4. Explore the possibility of VIP to represent the gateway node switch IP instead of reconfiguring all the routes at non-gw nodes

### Alternatives

Expand Down

0 comments on commit 66b1be9

Please sign in to comment.