Enhancement proposal for Network Observability #921

stleerh · 2021-10-05T16:13:28Z

Network Observability introduces a new category in OpenShift that provides networking information for a single cluster. It gives insight into what's on the network, when and what types of traffic and traffic flows are being made, and by whom. It gathers data to help design, plan, and answer questions about the network and provides a visual representation to help understand, diagnose, and troubleshoot networking issues.

aravindhp · 2021-10-05T17:11:12Z

/uncc

stleerh · 2021-10-05T18:07:08Z

/retest

jotak · 2021-10-06T05:29:26Z

/cc

stleerh · 2021-10-06T14:58:08Z

/cc @russellb @mcurry-rh @bbennett @amorenoz @eraichst @eparis @spadgett

openshift-ci · 2021-10-06T14:58:12Z

@stleerh: GitHub didn't allow me to request PR reviews from the following users: bbennett, amorenoz.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @russellb @mcurry-rh @bbennett @amorenoz @eraichst @eparis @spadgett

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stleerh · 2021-10-06T15:05:04Z

/cc knobunc

spadgett · 2021-10-06T17:13:19Z

enhancements/network-observability.md

+information (e.g. pod, services, namespaces) and then saved in internal or
+external storage.
+
+The web console will provide a NetFlow table showing traffic between


Do you plan to contribute this through an OpenShift console dynamic plugin or directly to the openshift/console repo?

Some background on dynamic plugins:
https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md
https://github.com/openshift/console/tree/master/frontend/packages/console-dynamic-plugin-sdk

OpenShift console dynamic plugin

jotak · 2021-10-07T06:52:14Z

enhancements/network-observability.md

+
+#### OVS
+Open vSwitch (OVS) is the NetFlow exporter in the OpenShift cluster.  When
+Network Observability is enabled, each OVS in the pod will be configured to


I think you mean:

Suggested change

Network Observability is enabled, each OVS in the pod will be configured to

Network Observability is enabled, each OVS in the cluster will be configured to

?

Actually, shouldn't it be node?

jotak · 2021-10-07T07:06:44Z

enhancements/network-observability.md

+
+
+### Operators
+Two new operators will be added to OCP.


Small remark, the sentence "two new operators will be added" could be misunderstood, we're not going to develop the loki operator, just reuse the one developed and maintained by the cluster logging team, though we will expect the users to deploy a distinct instance.
So we really create 1 operator in terms of development.

Network Observability introduces a new category in OpenShift that provides networking information for a single cluster. It gives insight into what's on the network, when and what types of traffic and traffic flows are being made, and by whom. It gathers data to help design, plan, and answer questions about the network and provides a visual representation to help understand, diagnose, and troubleshoot networking issues.

Clarify that Loki Operator is a separate project. Also, squash commits into one for initial PR. Separate commits after initial PR.

amorenoz · 2021-10-08T14:29:53Z

enhancements/network-observability.md

+what's happening on their network.  Monitoring provides metrics and
+alerts to potential problems.  Network observability will then help you
+analyze, investigate and diagnose those problems by looking at it from
+a control plane perspective instead of device by device.  In addition,


I'd like to make a rather generic observation here:

At lest to me, Network Observability can mean two things (maybe it means both at the same time):
1 - Observing the network traffic
2 - Observing the network logic

What we're doing here is adding Traffic Network Observability since we're capturing information of packets that actually go through the network. But, in order to understand why that traffic went through (or why not), we would have to visualize the network's logic. That means observing 4 layers of network logic (from more to less abstract):

k8s/ ovn-k8s level logic (probably covered by other parts of Openshift Console): What networks/pods/services/policies, etc are configured and how they relate to each other

OVN logic: how the k8s logic gets translated into a set of Logical Routers, Logical Switches, Load Balancers, ACLs, etc and how these objects are connected to each other and to the k8s nodes (Chassis).

OVS logic: the Openflow flows that OVN configures into OVS and define what it should do with packets as they come in and out of certain openflow bridges [1] (which are a logical entity that groups openflow flows together).

Datapath logic: the datapath flows configured by OVS into the OVS kernel datapath. This is the logic that actually get's applied to flows that come through the system.

So, whenever someone wants to understand why a packet is being dealt in a specific way, they'll probably have to navigate some of these network logic layers.

Now, I think it's OK to focus this enhancement on network traffic observability but I'd say we should not forget about the rest. For instance, the process of determining what logic to program into the virtual network is commonly referred to as "control plane" so this sentence, in this particular context, seems a bit confusing to me.

[1] I'll refer to this logical bridge entity in another comment

This is in line with what Russell says, we all agree that Network Observability is broader than just observing the network traffic and that this proposal only addresses one aspect of the whole.
Currently our focus is clearly on traffic observability but we don't want to loose track on the network logic (it's tracked in jira :) )

@stleerh we have this in JIRA, but indeed, this area could as well be mentioned in the "future work" section that Russell suggests

The plan is to focus on network traffic and then see how much we need to expand beyond this based on customer feedback.

The idea is that this enhancement looks at it from a cluster point-of-view instead of a device point-of-view. Perhaps the better term to use is "cluster" instead of "control plane".

amorenoz · 2021-10-08T14:42:21Z

enhancements/network-observability.md

+
+### Implementation Details/Notes/Constraints
+
+Here are some of the limitations and constraints.


Will the IPFIX exporter be configured in all OVS bridges [1]?
If the answer is no, I think there is a limitation of certain flows not being visible.

More generally, is the concept of "logical bridge" going to be abstracted away from the customer visualization? For instance, if you sample at two bridges, A and B, that are connected to each other you might detect a netflow flow on ingressA, that same flow on egressA, again on ingressB and finally egressB. The way these flows are de-duplicated can be considered an implementation detail but I'd like to understand how much "logical" visualization are we adding here.

[1] See other comment above

No, at the moment only the int bridge is going to be used (ie. no change regarding the current implementation of enabling flow exports on int bridge from the ovn-k layer).
I agree it can be mentioned in the limitations, and potentially mentioned as an area of improvement in "future work"

amorenoz · 2021-10-08T14:45:59Z

enhancements/network-observability.md

+### Visualization
+This is the proposal for the NetFlow table.
+
+<https://marvelapp.com/prototype/h4fei7h>


Is this the final set of columns or will more be added?

This is the current proposal and it intentionally kept the list small for simplicity. If there are other attributes that you feel are important, please suggest adding them to the list.

One field comes to mind: the Direction field which can have 2 values INGRESS/EGRESS.
Some flows are only visible on one of the directions, for instance if Pod A accesses Service S that has Pod B as an endpoint you would see:
Pod A -> Service S INGRESS
Pod A -> Pod B EGRESS

How would this be shown?

If no Direction column is shown, it could lead to the (wrong) conclusion that Pod A is generating twice as much traffic.

If we enforce one direction (i.e. INGRESS) and hide the other we would loose the perspective on the load balancing taking place.

Additionally, OVS provides the port name information. This might need translation (from port name to pod name) and could be considered redundant if we have the Pod Src/Dst Names derived from the IP addresses. But, looking at the amount of traffic that is sent to the geneve port might give you an idea of whether your application is well scaled.

Sorry, I missed this comment. We will add the Direction field. Regarding the port name, let's see if that will be necessary.

russellb

As a general comment, the first few sections cast more high-level aspirational vision about network obvservability and then later in the doc it starts getting into a specific proposal about network flow collection and visualization. I would try to focus the document more on what's proposed to be implemented and not try to capture the broad potential scope of network observability overall.

You could consider a section on "future work" where you allude to what might come next, or what might build on the work proposed in this enhancement.

russellb · 2021-10-08T14:40:16Z

enhancements/network-observability.md

+approvers:
+  - "@russellb"
+  - "@mcurry-rh"
+  - "@bbennett"


need to update this to ben's github username @knobunc

Updated in fab47eb

russellb · 2021-10-08T14:41:35Z

enhancements/network-observability.md

+Network Observability introduces a new category in OpenShift that
+provides networking information for a single cluster.  It gives insight
+into what's on the network, when and what types of traffic and traffic
+flows are being made, and by whom.  It gathers data to help design, plan,


a nit, but how would you differentiate "traffic" and "traffic flows"? Should this just be "traffic flows" ?

Traffic = IANA service name
Traffic flow = the actual NetFlow data which includes the traffic

It could just be "traffic flow" but wanted to call out the traffic part.

My 2 cents: if we stick with IPFIX terminology, that would be "flow template" and "flow record"

russellb · 2021-10-08T14:42:14Z

enhancements/network-observability.md

+## Motivation
+
+With Kubernetes, a layer of abstraction is added making it difficult for
+Red Hat and customers who managed their networks to be able to fully see


Suggested change

Red Hat and customers who managed their networks to be able to fully see

Red Hat and customers who manage their networks to be able to fully see

Updated in fab47eb

russellb · 2021-10-08T14:58:30Z

enhancements/network-observability.md

+The goal of the first release is to lay the groundwork and foundation
+in place, while still being able to deliver some functionality, even
+if it is at a smaller scale.  The target is to have a Dev Preview to
+generate interest in network observability.


Summary, motivation, and goals are all pretty high level and aspirational. When we get down into the Proposal, we get down to the proposed initial scope. I think it would be helpful to clarify earlier in the doc that while Network Observability can be quite broad, you've explicitly chosen this subset to work on in this iteration. It could be worth explaining why network flow collection and visualization was chosen as the next area to implement.

Added in Proposal in fab47eb

russellb · 2021-10-08T14:59:57Z

enhancements/network-observability.md

+by a user with an admin or cluster-admin role.  This is done by installing
+the Network Observability Operator, which in turn, installs the dependency
+operators necessary for this feature.  The user can do this using the
+web console or the CLI.


Can you summarize the dependencies here, as well? I wondered what "dependency operators" were referred to here, and then found the details in the later design details section. I think a quick summary here would still be helpful of the major components you will depend on.

This will be updated to say it will require installation of Loki Operator. Similar to Cluster Logging, the user will need to install each one separately.

russellb · 2021-10-08T15:16:27Z

enhancements/network-observability.md

+#### Network Observability Operator (NetObserv)
+The Network Observability Operator will need to be installed from OperatorHub
+to enable this feature.  This operator has a dependency on the Loki
+Operator.  This is an OpenShift Console dynamic plugin that is responsible


What is a console dynamic plugin? Not the operators ... so not sure what this refers to exactly.

Do you mean to say that the operator includes a console dynamic plugin? I'd look into how dynamic plugins are delivered, as I think they may need to be included in the console and not delivered as part of the operator? I'm not positive on that though.

It would help for this section to more clearly explain what the new operator will do to justify it. So far, I can guess:

ensure the loki operator is installed?

ensure a loki instance is installed and configured properly?

automatically turn on and configure flow export from OVS?

Yes, it will follow the guidelines at https://github.com/openshift/console/tree/master/frontend/packages/console-dynamic-plugin-sdk. I will add the link to this document and ultimately, another link for the Network Observability Operator.

russellb · 2021-10-08T15:18:23Z

enhancements/network-observability.md

+***Side note:***<br>
+This is a move away from Elasticsearch due to licensing restrictions.
+Open Distro for Elasticsearch was considered, but Loki was favored due to other
+components that plan to ultimately use Loki.


Who is going to own loki and the loki operator and ensure those get shipped and supported appropriately?

Cluster Logging team. Working on this...

russellb · 2021-10-08T15:19:26Z

enhancements/network-observability.md

+
+Measuring ROI on network observability is difficult so it might be hard
+to justify the cost and resources to deploy this.  It attempts to find
+and resolve issues that you might not know exists.  The value you get


Suggested change

and resolve issues that you might not know exists. The value you get

and resolve issues that you might not know exist. The value you get

Updated in fab47eb

russellb · 2021-10-08T15:20:00Z

enhancements/network-observability.md

+to justify the cost and resources to deploy this.  It attempts to find
+and resolve issues that you might not know exists.  The value you get
+may not be obvious because it is hard to know how much you saved by
+preventing ransomware or by allocating the right amount of resources


"preventing ransomware" was a bit out of left field for me, since security isn't really a focus throughout the enhancement

Proposed text:
The value you get may not be obvious because it is difficult to calculate how much you save by preventing something from happening such as a network failure.

russellb · 2021-10-08T15:20:49Z

enhancements/network-observability.md

+
+## Alternatives
+
+Sticking with and enhancing traditional pinpoint tools.


What are "traditional pinpoint tools" ?

You've also got some "alternatives considered" content scattered throughout that could all be moved here, like the mention of ElasticSearch, or the use of existing operators.

Proposed text:
Provide scripts and standalone applications to enhance traditional pinpoint tools like pcap, traceroute, netstat, etc.

I will move the other alternatives here.

knobunc · 2021-10-20T13:08:02Z

/approve

openshift-ci · 2021-10-20T13:08:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [knobunc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alibo · 2021-10-22T13:32:33Z

I know you're already working on implementing it here, but is there any chance to consider Pixie as an alternative approach?

stleerh · 2021-10-25T18:54:03Z

I know you're already working on implementing it here, but is there any chance to consider Pixie as an alternative approach?

https://px.dev/

https://docs.px.dev/tutorials/pixie-101/network-monitoring/

https://docs.px.dev/tutorials/pixie-101/request-tracing/

FAQ: https://docs.px.dev/about-pixie/faq/

Pixie uses eBPF and that is something we will be looking at in the future as an alternative data source. We still want to provide NetFlow as a choice since that is what a number of customers are familiar with and willing to enable.

openshift-bot · 2021-11-23T10:20:47Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

stleerh · 2021-11-24T16:16:17Z

/remove-lifecycle stale

Fix lint issue

knobunc · 2021-12-08T16:31:30Z

/lgtm

openshift-ci bot requested review from aravindhp and jwmatthews October 5, 2021 16:13

openshift-ci bot removed the request for review from aravindhp October 5, 2021 17:11

openshift-ci bot requested a review from jotak October 6, 2021 05:29

openshift-ci bot requested review from eparis, spadgett, russellb and mcurry-rh October 6, 2021 14:58

openshift-ci bot requested a review from knobunc October 6, 2021 15:05

spadgett reviewed Oct 6, 2021

View reviewed changes

jotak reviewed Oct 7, 2021

View reviewed changes

stleerh force-pushed the network-observability branch from d545ae5 to 953912e Compare October 7, 2021 19:50

stleerh and others added 3 commits October 7, 2021 12:57

Fix markdownlint issue

5229e3e

Address reviewers' comments

1128b21

Clarify that Loki Operator is a separate project. Also, squash commits into one for initial PR. Separate commits after initial PR.

stleerh force-pushed the network-observability branch from 953912e to 1128b21 Compare October 7, 2021 19:58

amorenoz reviewed Oct 8, 2021

View reviewed changes

russellb reviewed Oct 8, 2021

View reviewed changes

Address review comments

fab47eb

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 20, 2021

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2021

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2021

stleerh added 4 commits December 1, 2021 09:34

Merge branch 'openshift:master' into network-observability

7cd4286

Update network-observability.md

4a63f16

Fix lint issue

Add missing sections for lint compliance

ea7aa1a

Add API Extensions

277e530

openshift-ci bot assigned knobunc Dec 8, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2021

openshift-merge-robot merged commit 8762e02 into openshift:master Dec 8, 2021

	Network Observability is enabled, each OVS in the pod will be configured to
	Network Observability is enabled, each OVS in the cluster will be configured to


		### Implementation Details/Notes/Constraints

		Here are some of the limitations and constraints.

	Red Hat and customers who managed their networks to be able to fully see
	Red Hat and customers who manage their networks to be able to fully see

	and resolve issues that you might not know exists. The value you get
	and resolve issues that you might not know exist. The value you get


		## Alternatives

		Sticking with and enhancing traditional pinpoint tools.

Enhancement proposal for Network Observability #921

Enhancement proposal for Network Observability #921

Conversation

stleerh commented Oct 5, 2021 • edited Loading

aravindhp commented Oct 5, 2021

stleerh commented Oct 5, 2021

jotak commented Oct 6, 2021

stleerh commented Oct 6, 2021

openshift-ci bot commented Oct 6, 2021

stleerh commented Oct 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amorenoz Oct 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

russellb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stleerh Oct 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stleerh Oct 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knobunc commented Oct 20, 2021

openshift-ci bot commented Oct 20, 2021

alibo commented Oct 22, 2021

stleerh commented Oct 25, 2021

openshift-bot commented Nov 23, 2021

stleerh commented Nov 24, 2021

knobunc commented Dec 8, 2021

stleerh commented Oct 5, 2021 •

edited

Loading

amorenoz Oct 8, 2021 •

edited

Loading

stleerh Oct 11, 2021 •

edited

Loading

stleerh Oct 11, 2021 •

edited

Loading